Credit Risk Modeling with Machine Learning: A Practical Introduction
Every time a bank approves or declines a loan application, an algorithm is involved. Every time a fintech company sets an interest rate for a personal loan, a credit risk model is running in the background. Every time a credit card company decides on a spending limit, machine learning is estimating the probability that a borrower will default. Credit risk modeling is one of the oldest applications of quantitative methods in finance - and one of the most rapidly transformed by machine learning in the last decade.
The shift from traditional scorecards (rules-based, manually calibrated) to ML-based credit models is not simply a technical upgrade. It fundamentally changes what signals a lender can use (thousands of features instead of dozens), how quickly models can be updated (retraining on new data vs. manual recalibration), and how accurately individual risk can be assessed (non-linear patterns that linear scoring misses). The cost of getting this wrong is significant: underestimate risk and you fund defaults; overestimate risk and you exclude creditworthy borrowers - with legal and reputational consequences in both directions. Board Infinity's introduction to banking guide covers how banks use credit assessment as a foundational function of their lending operations.
This guide walks through the complete credit risk ML workflow - from understanding the core risk metrics (PD, LGD, EAD) through traditional scorecards, logistic regression, ensemble models, feature engineering, and the evaluation metrics that regulators and risk managers actually use. Every section includes Python code for immediate application.
Who This Guide Is For
This guide is for:
- Risk analysts at banks, fintechs, or credit bureaus who want to understand or build ML credit models
- Data scientists entering finance who need the credit risk domain context
- Finance professionals preparing for roles where credit modeling skills are assessed
- Anyone building data science portfolio projects in the credit and lending space - Board Infinity's building a data science portfolio guide identifies credit scoring models as one of the highest-value portfolio projects for finance-focused ML roles
1. What Is Credit Risk? PD, LGD, EAD Explained
Credit risk is the probability that a borrower will fail to meet their financial obligations. But in practice, "credit risk" is decomposed into three distinct components that together determine the Expected Loss (EL) on any loan or credit exposure.
Expected Loss = PD ร LGD ร EAD
PD - Probability of Default is the likelihood that a borrower will default within a given time horizon (typically 12 months for retail credit). This is what ML models primarily predict - a number between 0 and 1 representing default risk. A PD of 0.03 means a 3% chance of default within the year.
LGD - Loss Given Default is the percentage of the exposure that the lender expects to lose if the borrower does default. If a lender is owed $100,000 and expects to recover $60,000 through collateral and collections, LGD = 40%. Secured loans (mortgages) have lower LGD than unsecured loans (personal loans, credit cards).
EAD - Exposure at Default is the total amount the lender is exposed to at the time of default. For term loans, this is straightforward. For revolving credit (credit cards, lines of credit), the borrower may draw down more before defaulting, making EAD estimation more complex.
Understanding where data science fits in - Board Infinity's guide on How Data Science in Financial Modelling Helps Businesses shows how predictive modeling is transforming risk assessment and cash flow forecasting across financial institutions.
| Component | Full Name | Typical Range | How ML Helps |
|---|---|---|---|
| PD | Probability of Default | 0.1% - 30%+ (retail) | Classification models predict PD directly from borrower features |
| LGD | Loss Given Default | 10% - 90% (varies by collateral) | Regression models estimate recovery rates from loan and collateral data |
| EAD | Exposure at Default | Outstanding balance to credit limit | ML predicts draw-down behavior for revolving credit facilities |
| EL | Expected Loss = PD ร LGD ร EAD | Varies widely by product/segment | All three components combined determine provisioning and pricing |
2. Traditional Scorecard vs ML-Based Models
For decades, credit scoring was dominated by traditional scorecards - point-based systems where each credit characteristic (payment history, credit utilization, length of credit history) is assigned a point value, and scores are summed to produce a final credit score. FICO scores are the most well-known example.
Traditional scorecards have significant strengths: they are fully transparent (every factor and weight is documented), auditable by regulators, stable over time, and well-understood by lenders. Their limitation is that they are linear, use a small number of pre-selected features, and require manual calibration to maintain accuracy as population characteristics shift.
ML-based credit models handle thousands of features simultaneously, capture non-linear relationships between variables, and can be retrained automatically as new data arrives. They consistently outperform traditional scorecards on discrimination (AUC-ROC) and calibration metrics. The tradeoff is explainability - which regulators require - driving adoption of SHAP and LIME for post-hoc explanation of ML credit decisions.
| Dimension | Traditional Scorecard | ML-Based Model |
|---|---|---|
| Features | 10-30 manually selected variables | Hundreds to thousands of features |
| Relationships | Linear only - additive point values | Non-linear - interactions and complex patterns |
| Accuracy (AUC) | Typically 0.65 - 0.75 | Typically 0.75 - 0.90+ on same data |
| Explainability | Fully transparent - every factor documented | Black box - requires SHAP/LIME for explanation |
| Regulatory acceptance | Well established - preferred by regulators | Increasingly accepted with explainability tools |
| Maintenance | Manual recalibration - 6-12 months cycle | Automated retraining on new data pipelines |
The most common production setup at major lenders is a hybrid: an ML model provides the probability of default score, a traditional scorecard provides the regulatory-facing explanation ("your credit utilization was too high"), and SHAP values bridge the gap - generating the top factors from the ML model that drove a specific decision. This architecture gets the accuracy benefits of ML and the explainability requirements of fair lending compliance. Pure scorecard-only systems are increasingly rare at large financial institutions for new model development.
3. Logistic Regression for Default Prediction
Logistic regression is the baseline model for credit risk classification and remains the most widely deployed ML algorithm in regulated credit scoring environments. Despite its simplicity, it produces well-calibrated probability estimates (unlike many black-box models), is fully interpretable (coefficient direction and magnitude tell you the feature's effect on default risk), and is fast to train and score at scale. Regulators specifically prefer logistic regression because its behavior can be fully documented and challenged. Board Infinity's Goldman Sachs GBM Private Summer Analyst guide shows how quantitative risk frameworks at investment banks combine statistical rigor with regulatory compliance - the same balance logistic regression serves in credit.
import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score, classification_report from sklearn.calibration import calibration_curve import matplotlib.pyplot as plt# === CREDIT FEATURES === # Standard retail credit variables features = [ 'credit_score', # FICO or bureau score 'debt_to_income', # DTI ratio (total debt payments / gross income) 'num_missed_payments', # 30+ day delinquencies in last 24 months 'credit_utilization', # revolving balance / credit limit 'loan_to_value', # for secured loans: loan amount / collateral value 'months_employed', # employment stability 'num_accounts', # breadth of credit history 'loan_amount' # exposure size ] target = 'default_12m' # 1 = defaulted within 12 months, 0 = did notX = df[features] y = df[target]# === TRAIN/TEST SPLIT === X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y )# === SCALE FEATURES === scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)# === LOGISTIC REGRESSION WITH L2 REGULARIZATION === model = LogisticRegression(C=0.1, max_iter=1000, random_state=42) # C = inverse regularization strength: smaller C = stronger regularization model.fit(X_train_scaled, y_train)y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] # PD scores y_pred = (y_pred_proba >= 0.5).astype(int)print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.3f}") print(classification_report(y_test, y_pred))# === COEFFICIENT INTERPRETATION === coef_df = pd.DataFrame({ 'feature': features, 'coefficient': model.coef_[0] }).sort_values('coefficient', ascending=False) print(coef_df) # Positive coefficient = increases default probability # Negative coefficient = decreases default probability # e.g., num_missed_payments: +0.85 = strong positive predictor of default # credit_score: -0.72 = higher score strongly reduces default risk
In a typical retail credit portfolio, default rates are 2-8%. This means your dataset has 92-98% non-defaults and only 2-8% defaults. A naive model that predicts "no default" for everyone achieves 95% accuracy while being completely useless for credit risk. Always use stratify=y in train/test splits to maintain class proportions. Use class weights (class_weight='balanced' in sklearn) or SMOTE oversampling. Evaluate on ROC-AUC, precision-recall, KS statistic, and Gini - not accuracy. Accuracy is a misleading metric for imbalanced credit data.
4. Random Forest and Gradient Boosting for Credit Scoring
While logistic regression is the regulatory baseline, ensemble models - Random Forest and gradient boosting algorithms like XGBoost and LightGBM - consistently deliver higher discrimination performance on credit datasets. They handle non-linear relationships between features, automatically learn feature interactions, and are robust to outliers and multicollinearity. They are the production standard at most fintechs and increasingly at banks with ML governance frameworks in place.
from xgboost import XGBClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import roc_auc_score import numpy as np # === CALCULATE CLASS WEIGHT FOR IMBALANCED DATA === default_rate = y_train.mean() scale_pos_weight = (1 - default_rate) / default_rate print(f"Default rate: {default_rate:.1%} | scale_pos_weight: {scale_pos_weight:.1f}") # e.g., 5% default rate โ scale_pos_weight = 19 (19 non-defaults per default) # === XGBOOST MODEL === xgb_model = XGBClassifier( n_estimators=300, max_depth=4, # shallow trees reduce overfitting learning_rate=0.05, # slow learning rate + more trees = better generalization subsample=0.8, # 80% of rows per tree - prevents overfitting colsample_bytree=0.8, # 80% of features per tree scale_pos_weight=scale_pos_weight, # handle class imbalance eval_metric='auc', random_state=42 ) xgb_model.fit( X_train, y_train, eval_set=[(X_test, y_test)], verbose=False ) # === RANDOM FOREST MODEL (comparison baseline) === rf_model = RandomForestClassifier( n_estimators=200, max_depth=6, class_weight='balanced', # auto-adjusts for class imbalance random_state=42 ) rf_model.fit(X_train, y_train) # === MODEL COMPARISON === xgb_auc = roc_auc_score(y_test, xgb_model.predict_proba(X_test)[:, 1]) rf_auc = roc_auc_score(y_test, rf_model.predict_proba(X_test)[:, 1]) lr_auc = roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:, 1]) print(f"Logistic Regression AUC: {lr_auc:.3f}") print(f"Random Forest AUC: {rf_auc:.3f}") print(f"XGBoost AUC: {xgb_auc:.3f}") # Expected: XGBoost > Random Forest > Logistic Regression # Typical improvement: LR ~0.72 โ RF ~0.78 โ XGB ~0.83 on same credit data
5. Feature Engineering for Financial Data
Feature engineering - transforming raw data into predictive variables - is where the most significant performance gains come from in credit risk modeling. The raw inputs (credit score, income, loan amount) are improved by creating derived features that capture behavioral patterns, ratios, trends, and interaction effects that raw variables miss.
import pandas as pd import numpy as np # === RAW FEATURES (from credit application + bureau data) === # df has: credit_score, annual_income, loan_amount, monthly_debt, # num_accounts, oldest_account_months, num_missed_12m, # revolving_balance, revolving_limit, num_hard_inquiries # === RATIO FEATURES === df['debt_to_income'] = df['monthly_debt'] / (df['annual_income'] / 12) df['loan_to_income'] = df['loan_amount'] / df['annual_income'] df['credit_utilization'] = df['revolving_balance'] / df['revolving_limit'].replace(0, np.nan) df['payment_burden'] = df['monthly_debt'] / df['loan_amount'] # === BEHAVIORAL FEATURES === df['delinquency_rate'] = df['num_missed_12m'] / df['num_accounts'] df['inquiry_intensity'] = df['num_hard_inquiries'] / 12 # monthly inquiry rate df['any_delinquency'] = (df['num_missed_12m'] > 0).astype(int) # binary flag df['severe_delinquency'] = (df['num_missed_12m'] >= 3).astype(int) # === CREDIT HISTORY FEATURES === df['credit_age_years'] = df['oldest_account_months'] / 12 df['accounts_per_year'] = df['num_accounts'] / (df['credit_age_years'] + 0.1) # === CREDIT SCORE BUCKETS (interaction with loan amount) === df['score_bucket'] = pd.cut(df['credit_score'], bins=[300, 580, 669, 739, 799, 850], labels=['Very Poor', 'Fair', 'Good', 'Very Good', 'Exceptional'] ) # One-hot encode for ML models df = pd.get_dummies(df, columns=['score_bucket'], drop_first=True) # === INTERACTION FEATURES === df['high_dti_poor_credit'] = ( (df['debt_to_income'] > 0.43) & (df['credit_score'] < 620) ).astype(int) # high-risk combination flag # === WINSORIZE EXTREME VALUES === for col in ['debt_to_income', 'credit_utilization', 'loan_to_income']: p1 = df[col].quantile(0.01) p99 = df[col].quantile(0.99) df[col] = df[col].clip(p1, p99) # cap extreme outliers print(f"Features created: {df.shape[1]} total columns")
Credit datasets contain genuine extreme values - a DTI of 3.5 (debt payments 350% of income) is unusual but real, and often highly predictive of default. Removing these observations loses real signal. Instead, winsorize: cap values at the 1st and 99th percentile. A DTI of 3.5 gets capped at, say, 1.2 (the 99th percentile), preserving the extreme-risk signal without letting one outlier distort the model's coefficients. Apply winsorization to the training set, then apply the same caps to the test set using the training set's quantile values - never fit winsorization boundaries on test data.
6. Model Evaluation: AUC-ROC, KS Statistic, Gini Coefficient
Credit risk model evaluation uses different metrics than general classification problems, because the goal is not just accuracy but discrimination (how well the model separates defaulters from non-defaulters) and calibration (how well the predicted PD aligns with actual default rates). Regulators, model validation teams, and risk managers use specific metrics that are standard in the credit industry.
import numpy as np import pandas as pd from sklearn.metrics import roc_auc_score, roc_curve import matplotlib.pyplot as plt y_scores = xgb_model.predict_proba(X_test)[:, 1] # predicted PD scores # === 1. AUC-ROC (Area Under ROC Curve) === auc = roc_auc_score(y_test, y_scores) print(f"AUC-ROC: {auc:.3f}") # Interpretation: # > 0.75: acceptable for credit scoring # > 0.80: good - clear separation between defaults and non-defaults # > 0.85: strong - production-grade for most retail credit products # > 0.90: excellent - rare in credit (data may have leakage - investigate) # === 2. GINI COEFFICIENT === gini = 2 * auc - 1 print(f"Gini: {gini:.3f}") # Gini = 2 * AUC - 1 | Range: 0 (random) to 1 (perfect) # Widely used in credit: Gini > 0.40 considered acceptable # === 3. KS STATISTIC (Kolmogorov-Smirnov) === fpr, tpr, thresholds = roc_curve(y_test, y_scores) ks_stat = np.max(tpr - fpr) ks_threshold = thresholds[np.argmax(tpr - fpr)] print(f"KS Statistic: {ks_stat:.3f} at threshold: {ks_threshold:.3f}") # KS = max separation between cumulative default and non-default distributions # KS > 0.20: acceptable | > 0.40: good | > 0.60: excellent # The KS threshold is often used as the decision cutoff for approve/decline # === 4. ROC CURVE PLOT === fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # ROC Curve axes[0].plot(fpr, tpr, color='#0f3460', linewidth=2, label=f'XGBoost (AUC = {auc:.3f}, Gini = {gini:.3f})') axes[0].plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC = 0.500)') axes[0].scatter([fpr[np.argmax(tpr - fpr)]], [tpr[np.argmax(tpr - fpr)]], color='#e94560', s=100, zorder=5, label=f'KS = {ks_stat:.3f}') axes[0].set_xlabel('False Positive Rate') axes[0].set_ylabel('True Positive Rate') axes[0].set_title('ROC Curve - Credit Default Model', fontweight='bold') axes[0].legend() # Score distribution (defaults vs non-defaults) default_scores = y_scores[y_test == 1] clean_scores = y_scores[y_test == 0] axes[1].hist(clean_scores, bins=50, alpha=0.6, color='#0f3460', label='Non-Default') axes[1].hist(default_scores, bins=50, alpha=0.6, color='#e94560', label='Default') axes[1].set_xlabel('Predicted PD Score') axes[1].set_title('Score Distribution by Outcome', fontweight='bold') axes[1].legend() plt.tight_layout() plt.savefig('credit_model_evaluation.png', dpi=150, bbox_inches='tight') plt.show()
| Metric | Formula / Method | Acceptable | Good | What It Measures |
|---|---|---|---|---|
| AUC-ROC | Area under ROC curve | >0.75 | >0.80 | Probability model ranks a random default above a random non-default |
| Gini | 2 ร AUC - 1 | >0.40 | >0.60 | Discrimination - used by Basel and many regulators as primary metric |
| KS Statistic | Max(TPR - FPR) on ROC curve | >0.20 | >0.40 | Maximum separation between default and non-default score distributions |
| Brier Score | Mean((PD - actual)ยฒ) | <0.05 | <0.03 | Calibration - how well predicted probabilities match actual default rates |
Extremely high AUC scores on credit default datasets almost always indicate data leakage - a future variable has been included as a feature. Common leakage sources: including the default flag from a slightly different time window, including the loan status field that was derived from the same outcome, using post-application payment behavior in the feature set. A realistically achievable AUC for retail credit models using application-time features is 0.75-0.87. If your model scores above 0.90, systematically audit every feature's data timestamp relative to the application date before declaring success.
Further Reading
Board Infinity Guides:
- Introduction to Banking: A Beginner's Essential Guide
- How Data Science in Financial Modelling Helps Businesses
- Goldman Sachs GBM Private Summer Analyst Interview Guide
- Colliers Financial Analyst - Real Estate Interview Guide
- Building a Data Science Portfolio for Job Seekers
- Pro Tips for Building a Data Science Portfolio
- Is Data Literacy the New Mandatory Skill for Every Job Role?
- Personal Finance and Investment Planning
- Mastering the Art of Investment Banking
External Resources:
- Scikit-learn - Classification Models Documentation
- XGBoost Documentation - Gradient Boosting for Credit
- SHAP - Explainable AI for Credit Models
Apply AI & Machine Learning to Financial Forecasting on Coursera
This Coursera course by Board Infinity applies every credit risk ML concept in this guide through a structured 16-hour curriculum. Build classification models for credit scoring, master feature engineering for financial data, implement model validation with walk-forward testing, and apply generative AI to financial risk reporting - all using Python, Scikit-learn, and XGBoost.
โ Enroll now ยท โ Certificate available ยท โ Self-paced ยท โ 16 hours of structured content
Conclusion
Credit risk modeling with machine learning is one of the most consequential applications of data science in finance. PD, LGD, and EAD together determine expected loss - the number that drives lending decisions, interest rate pricing, regulatory capital requirements, and loan loss provisioning. Getting these models right matters in ways that stock prediction models do not: an underestimating PD model funds defaults at scale; an overestimating one denies credit to borrowers who would have repaid.
The workflow in this guide - from feature engineering through logistic regression baseline, XGBoost for performance, and AUC/KS/Gini for evaluation - covers the production standard for retail credit ML at most financial institutions. The most important discipline throughout: treat class imbalance explicitly, evaluate on discrimination metrics (not accuracy), winsorize rather than remove outliers, and validate on held-out data with a chronological split that prevents future data from contaminating your training set.
The next steps from here are model calibration (ensuring PD scores align with actual default rates for pricing decisions), SHAP-based explainability for regulatory compliance, and model monitoring - tracking whether a model's discrimination degrades over time as the population it was trained on shifts. Board Infinity's course on applying AI and machine learning to financial forecasting covers these advanced topics through Python-based labs applied to real financial datasets, building the complete credit risk ML skillset in a structured, project-based curriculum.