Credit Risk Modeling with Machine Learning: A Practical Introduction

Every time a bank approves or declines a loan application, an algorithm is involved. Every time a fintech company sets an interest rate for a personal loan, a credit risk model is running in the background. Every time a credit card company decides on a spending limit, machine learning is estimating the probability that a borrower will default. Credit risk modeling is one of the oldest applications of quantitative methods in finance - and one of the most rapidly transformed by machine learning in the last decade.

The shift from traditional scorecards (rules-based, manually calibrated) to ML-based credit models is not simply a technical upgrade. It fundamentally changes what signals a lender can use (thousands of features instead of dozens), how quickly models can be updated (retraining on new data vs. manual recalibration), and how accurately individual risk can be assessed (non-linear patterns that linear scoring misses). The cost of getting this wrong is significant: underestimate risk and you fund defaults; overestimate risk and you exclude creditworthy borrowers - with legal and reputational consequences in both directions. Board Infinity's introduction to banking guide covers how banks use credit assessment as a foundational function of their lending operations.

This guide walks through the complete credit risk ML workflow - from understanding the core risk metrics (PD, LGD, EAD) through traditional scorecards, logistic regression, ensemble models, feature engineering, and the evaluation metrics that regulators and risk managers actually use. Every section includes Python code for immediate application.

Who This Guide Is For

This guide is for:

  • Risk analysts at banks, fintechs, or credit bureaus who want to understand or build ML credit models
  • Data scientists entering finance who need the credit risk domain context
  • Finance professionals preparing for roles where credit modeling skills are assessed
  • Anyone building data science portfolio projects in the credit and lending space - Board Infinity's building a data science portfolio guide identifies credit scoring models as one of the highest-value portfolio projects for finance-focused ML roles

1. What Is Credit Risk? PD, LGD, EAD Explained

Credit risk is the probability that a borrower will fail to meet their financial obligations. But in practice, "credit risk" is decomposed into three distinct components that together determine the Expected Loss (EL) on any loan or credit exposure.

Expected Loss = PD ร— LGD ร— EAD

PD - Probability of Default is the likelihood that a borrower will default within a given time horizon (typically 12 months for retail credit). This is what ML models primarily predict - a number between 0 and 1 representing default risk. A PD of 0.03 means a 3% chance of default within the year.

LGD - Loss Given Default is the percentage of the exposure that the lender expects to lose if the borrower does default. If a lender is owed $100,000 and expects to recover $60,000 through collateral and collections, LGD = 40%. Secured loans (mortgages) have lower LGD than unsecured loans (personal loans, credit cards).

EAD - Exposure at Default is the total amount the lender is exposed to at the time of default. For term loans, this is straightforward. For revolving credit (credit cards, lines of credit), the borrower may draw down more before defaulting, making EAD estimation more complex.

Understanding where data science fits in - Board Infinity's guide on How Data Science in Financial Modelling Helps Businesses shows how predictive modeling is transforming risk assessment and cash flow forecasting across financial institutions.

Component Full Name Typical Range How ML Helps
PD Probability of Default 0.1% - 30%+ (retail) Classification models predict PD directly from borrower features
LGD Loss Given Default 10% - 90% (varies by collateral) Regression models estimate recovery rates from loan and collateral data
EAD Exposure at Default Outstanding balance to credit limit ML predicts draw-down behavior for revolving credit facilities
EL Expected Loss = PD ร— LGD ร— EAD Varies widely by product/segment All three components combined determine provisioning and pricing

2. Traditional Scorecard vs ML-Based Models

For decades, credit scoring was dominated by traditional scorecards - point-based systems where each credit characteristic (payment history, credit utilization, length of credit history) is assigned a point value, and scores are summed to produce a final credit score. FICO scores are the most well-known example.

Traditional scorecards have significant strengths: they are fully transparent (every factor and weight is documented), auditable by regulators, stable over time, and well-understood by lenders. Their limitation is that they are linear, use a small number of pre-selected features, and require manual calibration to maintain accuracy as population characteristics shift.

ML-based credit models handle thousands of features simultaneously, capture non-linear relationships between variables, and can be retrained automatically as new data arrives. They consistently outperform traditional scorecards on discrimination (AUC-ROC) and calibration metrics. The tradeoff is explainability - which regulators require - driving adoption of SHAP and LIME for post-hoc explanation of ML credit decisions.

Dimension Traditional Scorecard ML-Based Model
Features 10-30 manually selected variables Hundreds to thousands of features
Relationships Linear only - additive point values Non-linear - interactions and complex patterns
Accuracy (AUC) Typically 0.65 - 0.75 Typically 0.75 - 0.90+ on same data
Explainability Fully transparent - every factor documented Black box - requires SHAP/LIME for explanation
Regulatory acceptance Well established - preferred by regulators Increasingly accepted with explainability tools
Maintenance Manual recalibration - 6-12 months cycle Automated retraining on new data pipelines
๐Ÿ”
In Practice: Most Lenders Use Both

The most common production setup at major lenders is a hybrid: an ML model provides the probability of default score, a traditional scorecard provides the regulatory-facing explanation ("your credit utilization was too high"), and SHAP values bridge the gap - generating the top factors from the ML model that drove a specific decision. This architecture gets the accuracy benefits of ML and the explainability requirements of fair lending compliance. Pure scorecard-only systems are increasingly rare at large financial institutions for new model development.

3. Logistic Regression for Default Prediction

Logistic regression is the baseline model for credit risk classification and remains the most widely deployed ML algorithm in regulated credit scoring environments. Despite its simplicity, it produces well-calibrated probability estimates (unlike many black-box models), is fully interpretable (coefficient direction and magnitude tell you the feature's effect on default risk), and is fast to train and score at scale. Regulators specifically prefer logistic regression because its behavior can be fully documented and challenged. Board Infinity's Goldman Sachs GBM Private Summer Analyst guide shows how quantitative risk frameworks at investment banks combine statistical rigor with regulatory compliance - the same balance logistic regression serves in credit.

Python - Logistic Regression Credit Default Model
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt# === CREDIT FEATURES ===
# Standard retail credit variables
features = [
'credit_score',          # FICO or bureau score
'debt_to_income',         # DTI ratio (total debt payments / gross income)
'num_missed_payments',    # 30+ day delinquencies in last 24 months
'credit_utilization',     # revolving balance / credit limit
'loan_to_value',          # for secured loans: loan amount / collateral value
'months_employed',        # employment stability
'num_accounts',           # breadth of credit history
'loan_amount'             # exposure size
]
target = 'default_12m'   # 1 = defaulted within 12 months, 0 = did notX = df[features]
y = df[target]# === TRAIN/TEST SPLIT ===
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)# === SCALE FEATURES ===
scaler         = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)# === LOGISTIC REGRESSION WITH L2 REGULARIZATION ===
model = LogisticRegression(C=0.1, max_iter=1000, random_state=42)
# C = inverse regularization strength: smaller C = stronger regularization
model.fit(X_train_scaled, y_train)y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]  # PD scores
y_pred       = (y_pred_proba >= 0.5).astype(int)print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.3f}")
print(classification_report(y_test, y_pred))# === COEFFICIENT INTERPRETATION ===
coef_df = pd.DataFrame({
'feature':     features,
'coefficient': model.coef_[0]
}).sort_values('coefficient', ascending=False)
print(coef_df)
# Positive coefficient = increases default probability
# Negative coefficient = decreases default probability
# e.g., num_missed_payments: +0.85 = strong positive predictor of default
#        credit_score: -0.72       = higher score strongly reduces default risk
โš ๏ธ
Credit Data Is Always Severely Imbalanced - Handle It Explicitly

In a typical retail credit portfolio, default rates are 2-8%. This means your dataset has 92-98% non-defaults and only 2-8% defaults. A naive model that predicts "no default" for everyone achieves 95% accuracy while being completely useless for credit risk. Always use stratify=y in train/test splits to maintain class proportions. Use class weights (class_weight='balanced' in sklearn) or SMOTE oversampling. Evaluate on ROC-AUC, precision-recall, KS statistic, and Gini - not accuracy. Accuracy is a misleading metric for imbalanced credit data.

4. Random Forest and Gradient Boosting for Credit Scoring

While logistic regression is the regulatory baseline, ensemble models - Random Forest and gradient boosting algorithms like XGBoost and LightGBM - consistently deliver higher discrimination performance on credit datasets. They handle non-linear relationships between features, automatically learn feature interactions, and are robust to outliers and multicollinearity. They are the production standard at most fintechs and increasingly at banks with ML governance frameworks in place.

Python - XGBoost Credit Scoring with Class Imbalance Handling
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import numpy as np
# === CALCULATE CLASS WEIGHT FOR IMBALANCED DATA ===
default_rate = y_train.mean()
scale_pos_weight = (1 - default_rate) / default_rate
print(f"Default rate: {default_rate:.1%} | scale_pos_weight: {scale_pos_weight:.1f}")
# e.g., 5% default rate โ†’ scale_pos_weight = 19 (19 non-defaults per default)
# === XGBOOST MODEL ===
xgb_model = XGBClassifier(
n_estimators=300,
max_depth=4,              # shallow trees reduce overfitting
learning_rate=0.05,       # slow learning rate + more trees = better generalization
subsample=0.8,            # 80% of rows per tree - prevents overfitting
colsample_bytree=0.8,     # 80% of features per tree
scale_pos_weight=scale_pos_weight,  # handle class imbalance
eval_metric='auc',
random_state=42
)
xgb_model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=False
)
# === RANDOM FOREST MODEL (comparison baseline) ===
rf_model = RandomForestClassifier(
n_estimators=200,
max_depth=6,
class_weight='balanced',  # auto-adjusts for class imbalance
random_state=42
)
rf_model.fit(X_train, y_train)
# === MODEL COMPARISON ===
xgb_auc = roc_auc_score(y_test, xgb_model.predict_proba(X_test)[:, 1])
rf_auc  = roc_auc_score(y_test, rf_model.predict_proba(X_test)[:, 1])
lr_auc  = roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:, 1])
print(f"Logistic Regression AUC: {lr_auc:.3f}")
print(f"Random Forest AUC:       {rf_auc:.3f}")
print(f"XGBoost AUC:             {xgb_auc:.3f}")
# Expected: XGBoost > Random Forest > Logistic Regression
# Typical improvement: LR ~0.72 โ†’ RF ~0.78 โ†’ XGB ~0.83 on same credit data

5. Feature Engineering for Financial Data

Feature engineering - transforming raw data into predictive variables - is where the most significant performance gains come from in credit risk modeling. The raw inputs (credit score, income, loan amount) are improved by creating derived features that capture behavioral patterns, ratios, trends, and interaction effects that raw variables miss.

Python - Credit Feature Engineering
import pandas as pd
import numpy as np
# === RAW FEATURES (from credit application + bureau data) ===
# df has: credit_score, annual_income, loan_amount, monthly_debt,
#         num_accounts, oldest_account_months, num_missed_12m,
#         revolving_balance, revolving_limit, num_hard_inquiries
# === RATIO FEATURES ===
df['debt_to_income']      = df['monthly_debt'] / (df['annual_income'] / 12)
df['loan_to_income']      = df['loan_amount'] / df['annual_income']
df['credit_utilization']  = df['revolving_balance'] / df['revolving_limit'].replace(0, np.nan)
df['payment_burden']      = df['monthly_debt'] / df['loan_amount']
# === BEHAVIORAL FEATURES ===
df['delinquency_rate']    = df['num_missed_12m'] / df['num_accounts']
df['inquiry_intensity']   = df['num_hard_inquiries'] / 12  # monthly inquiry rate
df['any_delinquency']     = (df['num_missed_12m'] > 0).astype(int)  # binary flag
df['severe_delinquency']  = (df['num_missed_12m'] >= 3).astype(int)
# === CREDIT HISTORY FEATURES ===
df['credit_age_years']    = df['oldest_account_months'] / 12
df['accounts_per_year']   = df['num_accounts'] / (df['credit_age_years'] + 0.1)
# === CREDIT SCORE BUCKETS (interaction with loan amount) ===
df['score_bucket'] = pd.cut(df['credit_score'],
bins=[300, 580, 669, 739, 799, 850],
labels=['Very Poor', 'Fair', 'Good', 'Very Good', 'Exceptional']
)
# One-hot encode for ML models
df = pd.get_dummies(df, columns=['score_bucket'], drop_first=True)
# === INTERACTION FEATURES ===
df['high_dti_poor_credit'] = (
(df['debt_to_income'] > 0.43) &
(df['credit_score'] < 620)
).astype(int)   # high-risk combination flag
# === WINSORIZE EXTREME VALUES ===
for col in ['debt_to_income', 'credit_utilization', 'loan_to_income']:
p1  = df[col].quantile(0.01)
p99 = df[col].quantile(0.99)
df[col] = df[col].clip(p1, p99)    # cap extreme outliers
print(f"Features created: {df.shape[1]} total columns")
๐Ÿ’ก
Winsorize - Don't Remove - Extreme Credit Values

Credit datasets contain genuine extreme values - a DTI of 3.5 (debt payments 350% of income) is unusual but real, and often highly predictive of default. Removing these observations loses real signal. Instead, winsorize: cap values at the 1st and 99th percentile. A DTI of 3.5 gets capped at, say, 1.2 (the 99th percentile), preserving the extreme-risk signal without letting one outlier distort the model's coefficients. Apply winsorization to the training set, then apply the same caps to the test set using the training set's quantile values - never fit winsorization boundaries on test data.

6. Model Evaluation: AUC-ROC, KS Statistic, Gini Coefficient

Credit risk model evaluation uses different metrics than general classification problems, because the goal is not just accuracy but discrimination (how well the model separates defaulters from non-defaulters) and calibration (how well the predicted PD aligns with actual default rates). Regulators, model validation teams, and risk managers use specific metrics that are standard in the credit industry.

Python - Credit Model Evaluation: AUC, KS, Gini, and Calibration
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt
y_scores = xgb_model.predict_proba(X_test)[:, 1]  # predicted PD scores
# === 1. AUC-ROC (Area Under ROC Curve) ===
auc = roc_auc_score(y_test, y_scores)
print(f"AUC-ROC: {auc:.3f}")
# Interpretation:
# > 0.75: acceptable for credit scoring
# > 0.80: good - clear separation between defaults and non-defaults
# > 0.85: strong - production-grade for most retail credit products
# > 0.90: excellent - rare in credit (data may have leakage - investigate)
# === 2. GINI COEFFICIENT ===
gini = 2 * auc - 1
print(f"Gini: {gini:.3f}")
# Gini = 2 * AUC - 1 | Range: 0 (random) to 1 (perfect)
# Widely used in credit: Gini > 0.40 considered acceptable
# === 3. KS STATISTIC (Kolmogorov-Smirnov) ===
fpr, tpr, thresholds = roc_curve(y_test, y_scores)
ks_stat = np.max(tpr - fpr)
ks_threshold = thresholds[np.argmax(tpr - fpr)]
print(f"KS Statistic: {ks_stat:.3f} at threshold: {ks_threshold:.3f}")
# KS = max separation between cumulative default and non-default distributions
# KS > 0.20: acceptable | > 0.40: good | > 0.60: excellent
# The KS threshold is often used as the decision cutoff for approve/decline
# === 4. ROC CURVE PLOT ===
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# ROC Curve
axes[0].plot(fpr, tpr, color='#0f3460', linewidth=2,
label=f'XGBoost (AUC = {auc:.3f}, Gini = {gini:.3f})')
axes[0].plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC = 0.500)')
axes[0].scatter([fpr[np.argmax(tpr - fpr)]], [tpr[np.argmax(tpr - fpr)]],
color='#e94560', s=100, zorder=5, label=f'KS = {ks_stat:.3f}')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve - Credit Default Model', fontweight='bold')
axes[0].legend()
# Score distribution (defaults vs non-defaults)
default_scores = y_scores[y_test == 1]
clean_scores   = y_scores[y_test == 0]
axes[1].hist(clean_scores, bins=50, alpha=0.6, color='#0f3460', label='Non-Default')
axes[1].hist(default_scores, bins=50, alpha=0.6, color='#e94560', label='Default')
axes[1].set_xlabel('Predicted PD Score')
axes[1].set_title('Score Distribution by Outcome', fontweight='bold')
axes[1].legend()
plt.tight_layout()
plt.savefig('credit_model_evaluation.png', dpi=150, bbox_inches='tight')
plt.show()
Metric Formula / Method Acceptable Good What It Measures
AUC-ROC Area under ROC curve >0.75 >0.80 Probability model ranks a random default above a random non-default
Gini 2 ร— AUC - 1 >0.40 >0.60 Discrimination - used by Basel and many regulators as primary metric
KS Statistic Max(TPR - FPR) on ROC curve >0.20 >0.40 Maximum separation between default and non-default score distributions
Brier Score Mean((PD - actual)ยฒ) <0.05 <0.03 Calibration - how well predicted probabilities match actual default rates
โš ๏ธ
AUC > 0.90 on Credit Data Is Usually a Red Flag

Extremely high AUC scores on credit default datasets almost always indicate data leakage - a future variable has been included as a feature. Common leakage sources: including the default flag from a slightly different time window, including the loan status field that was derived from the same outcome, using post-application payment behavior in the feature set. A realistically achievable AUC for retail credit models using application-time features is 0.75-0.87. If your model scores above 0.90, systematically audit every feature's data timestamp relative to the application date before declaring success.

Further Reading

Board Infinity Guides:

External Resources:

๐Ÿš€ Master ML for Finance with Hands-On Projects

Apply AI & Machine Learning to Financial Forecasting on Coursera

This Coursera course by Board Infinity applies every credit risk ML concept in this guide through a structured 16-hour curriculum. Build classification models for credit scoring, master feature engineering for financial data, implement model validation with walk-forward testing, and apply generative AI to financial risk reporting - all using Python, Scikit-learn, and XGBoost.

Module 1
Machine Learning Foundations for Finance Regression and classification models, clustering for risk segmentation, ML model evaluation with AUC and RMSE - the foundation for credit scoring model development
Module 2
Feature Engineering for Financial Modeling Lag variables, rolling statistics, volatility metrics, behavioral indicators - the feature engineering techniques that separate weak from strong credit models
Module 3
Model Evaluation, Validation & Risk Controls Cross-validation, walk-forward validation, MAE/MAPE/RMSE, overfitting diagnosis, and regularization - the validation framework that makes credit models production-ready
Module 4
AI & ML Applications in Modern Finance Credit scoring with classification models, risk modeling and probability of default, portfolio analytics, ML fairness guidelines, and generative AI for risk reporting and insights
Master ML for Finance on Coursera โ†’

โœ“ Enroll now  ยท  โœ“ Certificate available  ยท  โœ“ Self-paced  ยท  โœ“ 16 hours of structured content

Conclusion

Credit risk modeling with machine learning is one of the most consequential applications of data science in finance. PD, LGD, and EAD together determine expected loss - the number that drives lending decisions, interest rate pricing, regulatory capital requirements, and loan loss provisioning. Getting these models right matters in ways that stock prediction models do not: an underestimating PD model funds defaults at scale; an overestimating one denies credit to borrowers who would have repaid.

The workflow in this guide - from feature engineering through logistic regression baseline, XGBoost for performance, and AUC/KS/Gini for evaluation - covers the production standard for retail credit ML at most financial institutions. The most important discipline throughout: treat class imbalance explicitly, evaluate on discrimination metrics (not accuracy), winsorize rather than remove outliers, and validate on held-out data with a chronological split that prevents future data from contaminating your training set.

The next steps from here are model calibration (ensuring PD scores align with actual default rates for pricing decisions), SHAP-based explainability for regulatory compliance, and model monitoring - tracking whether a model's discrimination degrades over time as the population it was trained on shifts. Board Infinity's course on applying AI and machine learning to financial forecasting covers these advanced topics through Python-based labs applied to real financial datasets, building the complete credit risk ML skillset in a structured, project-based curriculum.

Financial Analyst Finance Credit Risk