Understanding ML Model Evaluation Metrics

A comprehensive guide to model evaluation metrics. Learn when to use accuracy, precision, recall, F1-score, ROC-AUC, and more to properly evaluate your machine learning models.

D
Dery Febriantara Developer
Understanding ML Model Evaluation Metrics

Choosing the right evaluation metric is crucial for building effective machine learning models. A model that looks great on one metric might perform poorly on another. In this comprehensive guide, we’ll explore all the essential metrics, understand when to use each, and learn how to interpret results correctly.

Why Metrics Matter

Model evaluation metrics tell us how well our model performs on unseen data. Different metrics capture different aspects of model performance, and choosing the wrong metric can lead to:

  • Deploying models that fail in production
  • Optimizing for the wrong objective
  • Missing critical failure modes
  • False confidence in model quality

The Problem with Accuracy

Accuracy is the most intuitive metric, but it can be dangerously misleading.

The Accuracy Paradox

import numpy as np
from sklearn.metrics import accuracy_score, classification_report

# Imagine a fraud detection system
# 99% of transactions are legitimate, 1% are fraud
n_samples = 10000
np.random.seed(42)

# Actual labels: 99% legitimate (0), 1% fraud (1)
y_true = np.array([0] * 9900 + [1] * 100)

# A "smart" model that always predicts legitimate
y_pred_always_legit = np.zeros(n_samples)

# Our actual fraud detection model
# Catches 80% of fraud, 5% false positive rate
y_pred_model = np.zeros(n_samples)
y_pred_model[np.where(y_true == 1)[0][:80]] = 1  # Catch 80 out of 100 frauds
y_pred_model[np.random.choice(9900, 495, replace=False)] = 1  # 5% false positives

print("Always Predict Legitimate:")
print(f"  Accuracy: {accuracy_score(y_true, y_pred_always_legit):.2%}")
print(f"  Frauds caught: 0/100")

print("\nOur Fraud Detection Model:")
print(f"  Accuracy: {accuracy_score(y_true, y_pred_model):.2%}")
print(f"  Frauds caught: 80/100")

The “always legitimate” model achieves 99% accuracy but catches zero fraudsters. This demonstrates why accuracy alone is insufficient for imbalanced datasets.

The Confusion Matrix

The confusion matrix is the foundation for understanding classification performance.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Generate sample predictions
np.random.seed(42)
y_true = np.random.randint(0, 2, 1000)
y_pred = np.random.randint(0, 2, 1000)
y_pred[y_true == 1] = np.where(np.random.random(sum(y_true == 1)) > 0.2, 1, 0)
y_pred[y_true == 0] = np.where(np.random.random(sum(y_true == 0)) > 0.15, 0, 1)

# Create confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)

# Visualize
fig, ax = plt.subplots(figsize=(8, 6))
disp = ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive'])
disp.plot(ax=ax, cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

Understanding the Four Quadrants

Predicted NegativePredicted Positive
Actual NegativeTrue Negative (TN)False Positive (FP)
Actual PositiveFalse Negative (FN)True Positive (TP)

True Negatives (TN): Correctly predicted negative cases True Positives (TP): Correctly predicted positive cases False Positives (FP): Incorrectly predicted as positive (Type I error) False Negatives (FN): Incorrectly predicted as negative (Type II error)

Extracting Values from Confusion Matrix

def extract_confusion_values(y_true, y_pred):
    """Extract TN, FP, FN, TP from confusion matrix."""
    cm = confusion_matrix(y_true, y_pred)

    if cm.shape == (2, 2):
        tn, fp, fn, tp = cm.ravel()
        return {
            'TN': tn, 'FP': fp, 'FN': fn, 'TP': tp,
            'Total': tn + fp + fn + tp,
            'Actual Positive': fn + tp,
            'Actual Negative': tn + fp,
            'Predicted Positive': fp + tp,
            'Predicted Negative': tn + fn
        }

    return cm

values = extract_confusion_values(y_true, y_pred)
for key, value in values.items():
    print(f"{key}: {value}")

Classification Metrics Deep Dive

Accuracy

Definition: Proportion of correct predictions

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Manual calculation
values = extract_confusion_values(y_true, y_pred)
manual_accuracy = (values['TP'] + values['TN']) / values['Total']
print(f"Manual Accuracy: {manual_accuracy:.4f}")

Use when:

  • Classes are balanced
  • Both types of errors are equally costly
  • Quick baseline evaluation

Don’t use when:

  • Classes are imbalanced
  • Different errors have different costs
  • You need to understand error types

Precision

Definition: Of all predicted positives, how many are actually positive?

$$\text{Precision} = \frac{TP}{TP + FP}$$

from sklearn.metrics import precision_score

precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.4f}")

# Manual calculation
manual_precision = values['TP'] / (values['TP'] + values['FP'])
print(f"Manual Precision: {manual_precision:.4f}")

High precision means: When the model predicts positive, it’s usually right.

Use when:

  • False positives are costly
  • Example: Spam detection (don’t want to lose important emails)
  • Example: Recommendation systems (bad recommendations hurt trust)

Recall (Sensitivity, True Positive Rate)

Definition: Of all actual positives, how many did we catch?

$$\text{Recall} = \frac{TP}{TP + FN}$$

from sklearn.metrics import recall_score

recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.4f}")

# Manual calculation
manual_recall = values['TP'] / (values['TP'] + values['FN'])
print(f"Manual Recall: {manual_recall:.4f}")

High recall means: We catch most of the actual positives.

Use when:

  • False negatives are costly
  • Example: Disease detection (don’t want to miss sick patients)
  • Example: Fraud detection (don’t want to miss fraudsters)

The Precision-Recall Trade-off

Precision and recall are inversely related. Improving one often hurts the other.

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

# Get probability predictions
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2,
                          n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

# Calculate precision-recall curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(thresholds, precisions[:-1], label='Precision')
plt.plot(thresholds, recalls[:-1], label='Recall')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Precision-Recall Trade-off')
plt.legend()
plt.grid(True)
plt.show()

# Plot precision vs recall
plt.figure(figsize=(10, 6))
plt.plot(recalls, precisions, marker='.')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(True)
plt.show()

F1 Score

Definition: Harmonic mean of precision and recall

$$F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}$$

from sklearn.metrics import f1_score

f1 = f1_score(y_true, y_pred)
print(f"F1 Score: {f1:.4f}")

# Manual calculation
manual_f1 = 2 * (precision * recall) / (precision + recall)
print(f"Manual F1: {manual_f1:.4f}")

Why harmonic mean? The harmonic mean penalizes extreme differences. If precision = 1.0 and recall = 0.1:

  • Arithmetic mean: 0.55
  • Harmonic mean: 0.18

The harmonic mean better reflects poor performance in either metric.

Use when:

  • You need a single metric balancing precision and recall
  • Both false positives and false negatives matter
  • Classes are imbalanced

F-Beta Score

Generalization of F1 that allows weighting precision vs recall.

$$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{(\beta^2 \cdot \text{Precision}) + \text{Recall}}$$

from sklearn.metrics import fbeta_score

# F0.5 - prioritizes precision (2x weight)
f05 = fbeta_score(y_true, y_pred, beta=0.5)
print(f"F0.5 Score (prioritize precision): {f05:.4f}")

# F1 - equal weight
f1 = fbeta_score(y_true, y_pred, beta=1.0)
print(f"F1 Score (balanced): {f1:.4f}")

# F2 - prioritizes recall (2x weight)
f2 = fbeta_score(y_true, y_pred, beta=2.0)
print(f"F2 Score (prioritize recall): {f2:.4f}")

Choosing beta:

  • β < 1: Precision is more important
  • β = 1: Equal importance (F1)
  • β > 1: Recall is more important

Specificity (True Negative Rate)

Definition: Of all actual negatives, how many did we correctly identify?

$$\text{Specificity} = \frac{TN}{TN + FP}$$

def specificity_score(y_true, y_pred):
    """Calculate specificity (true negative rate)."""
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    return tn / (tn + fp)

specificity = specificity_score(y_true, y_pred)
print(f"Specificity: {specificity:.4f}")

Use when:

  • Correctly identifying negatives is important
  • Example: Criminal justice (don’t convict innocent people)

Complete Classification Report

from sklearn.metrics import classification_report

# Generate multi-class example
y_true_multi = np.random.randint(0, 3, 1000)
y_pred_multi = y_true_multi.copy()
# Add some errors
error_idx = np.random.choice(1000, 200, replace=False)
y_pred_multi[error_idx] = np.random.randint(0, 3, 200)

print(classification_report(y_true_multi, y_pred_multi,
                           target_names=['Class 0', 'Class 1', 'Class 2']))

Understanding the report:

  • support: Number of actual occurrences of each class
  • macro avg: Average of metrics (treats all classes equally)
  • weighted avg: Weighted by support (accounts for class imbalance)

ROC-AUC: Area Under the ROC Curve

The ROC curve plots True Positive Rate vs False Positive Rate at various thresholds.

Creating ROC Curves

from sklearn.metrics import roc_curve, roc_auc_score, auc
import matplotlib.pyplot as plt

# Get predictions from our model
y_proba = model.predict_proba(X_test)[:, 1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

# Plot
plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr, color='darkorange', lw=2,
         label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--',
         label='Random classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

print(f"ROC-AUC Score: {roc_auc:.4f}")

Interpreting ROC-AUC

AUC ScoreInterpretation
0.50Random guessing
0.50-0.60Fail
0.60-0.70Poor
0.70-0.80Fair
0.80-0.90Good
0.90-1.00Excellent

Multi-Class ROC-AUC

from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier

# Create multi-class data
X_multi, y_multi = make_classification(n_samples=1000, n_features=20, n_classes=3,
                                       n_informative=10, n_clusters_per_class=1,
                                       random_state=42)

# Binarize labels
y_multi_bin = label_binarize(y_multi, classes=[0, 1, 2])
n_classes = y_multi_bin.shape[1]

X_train, X_test, y_train, y_test = train_test_split(X_multi, y_multi_bin,
                                                     test_size=0.3, random_state=42)

# Train OvR classifier
classifier = OneVsRestClassifier(LogisticRegression())
classifier.fit(X_train, y_train)
y_proba = classifier.predict_proba(X_test)

# Calculate ROC curve for each class
plt.figure(figsize=(10, 8))
colors = ['blue', 'red', 'green']

for i, color in enumerate(colors):
    fpr, tpr, _ = roc_curve(y_test[:, i], y_proba[:, i])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, color=color, lw=2,
             label=f'ROC curve class {i} (AUC = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Multi-class ROC Curve')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

# Calculate macro and micro average AUC
from sklearn.metrics import roc_auc_score

# For multi-label/binary indicator format
macro_auc = roc_auc_score(y_test, y_proba, multi_class='ovr', average='macro')
print(f"Macro Average AUC: {macro_auc:.4f}")

Precision-Recall Curves and Average Precision

ROC-AUC can be misleading for imbalanced datasets. Precision-Recall curves are often more informative.

Creating PR Curves

from sklearn.metrics import precision_recall_curve, average_precision_score

# Create imbalanced dataset
X_imb, y_imb = make_classification(n_samples=1000, n_features=20, n_classes=2,
                                   n_informative=10, weights=[0.95, 0.05],
                                   random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X_imb, y_imb,
                                                     test_size=0.3, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

# Calculate PR curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
avg_precision = average_precision_score(y_test, y_proba)

# Plot
plt.figure(figsize=(10, 8))
plt.plot(recall, precision, color='darkorange', lw=2,
         label=f'PR curve (AP = {avg_precision:.2f})')

# Baseline: ratio of positive class
baseline = y_test.sum() / len(y_test)
plt.axhline(y=baseline, color='navy', linestyle='--',
            label=f'Baseline (ratio = {baseline:.2f})')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc="upper right")
plt.grid(True)
plt.show()

print(f"Average Precision: {avg_precision:.4f}")

When to Use PR vs ROC

ScenarioRecommended Curve
Balanced classesROC
Imbalanced classesPR
Cost of FP mattersPR
Both error types matter equallyROC
Comparing models on imbalanced dataPR

Regression Metrics

For continuous target variables, we use different metrics.

Mean Absolute Error (MAE)

Definition: Average absolute difference between predictions and actual values

$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$

from sklearn.metrics import mean_absolute_error
import numpy as np

# Generate sample regression data
np.random.seed(42)
y_true_reg = np.random.randn(100) * 10 + 50
y_pred_reg = y_true_reg + np.random.randn(100) * 5

mae = mean_absolute_error(y_true_reg, y_pred_reg)
print(f"MAE: {mae:.4f}")

# Manual calculation
manual_mae = np.mean(np.abs(y_true_reg - y_pred_reg))
print(f"Manual MAE: {manual_mae:.4f}")

Characteristics:

  • Same units as target variable
  • Easy to interpret (average error)
  • Robust to outliers
  • All errors weighted equally

Mean Squared Error (MSE) and RMSE

Definition: Average squared difference between predictions and actual values

$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

$$RMSE = \sqrt{MSE}$$

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_true_reg, y_pred_reg)
rmse = np.sqrt(mse)

print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")

# Manual calculation
manual_mse = np.mean((y_true_reg - y_pred_reg)**2)
manual_rmse = np.sqrt(manual_mse)
print(f"Manual MSE: {manual_mse:.4f}")
print(f"Manual RMSE: {manual_rmse:.4f}")

Characteristics:

  • Penalizes large errors more heavily
  • RMSE in same units as target
  • Sensitive to outliers
  • Commonly used in optimization

MAE vs RMSE

# Demonstrate sensitivity to outliers
y_true_outlier = np.array([10, 20, 30, 40, 50])
y_pred_outlier = np.array([11, 21, 31, 41, 100])  # One large error

mae_outlier = mean_absolute_error(y_true_outlier, y_pred_outlier)
rmse_outlier = np.sqrt(mean_squared_error(y_true_outlier, y_pred_outlier))

print(f"With outlier:")
print(f"  MAE: {mae_outlier:.2f}")
print(f"  RMSE: {rmse_outlier:.2f}")

# Without outlier
y_true_clean = np.array([10, 20, 30, 40])
y_pred_clean = np.array([11, 21, 31, 41])

mae_clean = mean_absolute_error(y_true_clean, y_pred_clean)
rmse_clean = np.sqrt(mean_squared_error(y_true_clean, y_pred_clean))

print(f"\nWithout outlier:")
print(f"  MAE: {mae_clean:.2f}")
print(f"  RMSE: {rmse_clean:.2f}")

R² Score (Coefficient of Determination)

Definition: Proportion of variance explained by the model

$$R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}i)^2}{\sum{i=1}^{n}(y_i - \bar{y})^2}$$

from sklearn.metrics import r2_score

r2 = r2_score(y_true_reg, y_pred_reg)
print(f"R² Score: {r2:.4f}")

# Manual calculation
ss_res = np.sum((y_true_reg - y_pred_reg)**2)  # Residual sum of squares
ss_tot = np.sum((y_true_reg - np.mean(y_true_reg))**2)  # Total sum of squares
manual_r2 = 1 - (ss_res / ss_tot)
print(f"Manual R²: {manual_r2:.4f}")

Interpretation:

  • R² = 1.0: Perfect predictions
  • R² = 0.0: Model predicts the mean
  • R² < 0.0: Model is worse than predicting the mean (can happen!)

Mean Absolute Percentage Error (MAPE)

Definition: Average percentage error

$$MAPE = \frac{100%}{n}\sum_{i=1}^{n}\left|\frac{y_i - \hat{y}_i}{y_i}\right|$$

from sklearn.metrics import mean_absolute_percentage_error

mape = mean_absolute_percentage_error(y_true_reg, y_pred_reg)
print(f"MAPE: {mape:.4%}")

# Manual calculation (avoiding division by zero)
mask = y_true_reg != 0
manual_mape = np.mean(np.abs((y_true_reg[mask] - y_pred_reg[mask]) / y_true_reg[mask]))
print(f"Manual MAPE: {manual_mape:.4%}")

Characteristics:

  • Scale-independent (percentage)
  • Undefined when y = 0
  • Asymmetric (over-predictions weighted differently than under-predictions)

Choosing Regression Metrics

MetricWhen to Use
MAEInterpretability, robust to outliers
RMSELarge errors are particularly bad
Comparing to baseline, model explanatory power
MAPENeed percentage error, no zeros in target

Cross-Validation for Reliable Evaluation

Single train-test splits can give unreliable estimates. Cross-validation provides more robust evaluation.

K-Fold Cross-Validation

from sklearn.model_selection import cross_val_score, cross_validate, KFold

# Create sample data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

model = LogisticRegression(max_iter=1000)

# Simple cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

# Multiple metrics
cv_results = cross_validate(model, X, y, cv=5,
                           scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])

print("\nMultiple Metrics:")
for metric in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']:
    scores = cv_results[f'test_{metric}']
    print(f"  {metric}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

Stratified K-Fold for Classification

from sklearn.model_selection import StratifiedKFold

# Ensures each fold has similar class distribution
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

fold_scores = []
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    model.fit(X_train, y_train)
    score = model.score(X_val, y_val)
    fold_scores.append(score)

    # Check class distribution in each fold
    print(f"Fold {fold+1}: Score={score:.4f}, "
          f"Train positive ratio={y_train.mean():.2f}, "
          f"Val positive ratio={y_val.mean():.2f}")

print(f"\nMean Score: {np.mean(fold_scores):.4f}")

Leave-One-Out Cross-Validation

from sklearn.model_selection import LeaveOneOut, cross_val_score

# Use on small datasets
X_small = X[:100]
y_small = y[:100]

loo = LeaveOneOut()
loo_scores = cross_val_score(model, X_small, y_small, cv=loo)
print(f"LOO CV Accuracy: {loo_scores.mean():.4f}")

Threshold Optimization

For many applications, the default 0.5 threshold isn’t optimal.

Finding the Optimal Threshold

from sklearn.metrics import f1_score, precision_score, recall_score

# Train model and get probabilities
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

# Test different thresholds
thresholds = np.arange(0.1, 0.9, 0.05)
results = []

for threshold in thresholds:
    y_pred_thresh = (y_proba >= threshold).astype(int)
    results.append({
        'threshold': threshold,
        'precision': precision_score(y_test, y_pred_thresh),
        'recall': recall_score(y_test, y_pred_thresh),
        'f1': f1_score(y_test, y_pred_thresh)
    })

results_df = pd.DataFrame(results)

# Plot
plt.figure(figsize=(12, 6))
plt.plot(results_df['threshold'], results_df['precision'], 'b-', label='Precision')
plt.plot(results_df['threshold'], results_df['recall'], 'r-', label='Recall')
plt.plot(results_df['threshold'], results_df['f1'], 'g-', label='F1')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Metrics vs. Classification Threshold')
plt.legend()
plt.grid(True)
plt.show()

# Find optimal threshold for F1
optimal_threshold = results_df.loc[results_df['f1'].idxmax(), 'threshold']
print(f"Optimal threshold for F1: {optimal_threshold:.2f}")

Business-Driven Threshold Selection

def calculate_business_metric(y_true, y_pred, fp_cost, fn_cost):
    """Calculate total cost based on business constraints."""
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    return fp * fp_cost + fn * fn_cost

# Example: Fraud detection where missing fraud costs $1000, false alarm costs $10
fp_cost = 10
fn_cost = 1000

costs = []
for threshold in thresholds:
    y_pred_thresh = (y_proba >= threshold).astype(int)
    cost = calculate_business_metric(y_test, y_pred_thresh, fp_cost, fn_cost)
    costs.append({'threshold': threshold, 'cost': cost})

costs_df = pd.DataFrame(costs)
optimal_business_threshold = costs_df.loc[costs_df['cost'].idxmin(), 'threshold']
print(f"Optimal threshold for minimum cost: {optimal_business_threshold:.2f}")
print(f"Minimum cost: ${costs_df['cost'].min():,.0f}")

Comparing Models

Statistical Significance Testing

from scipy import stats

# Compare two models using paired t-test on CV scores
model1 = LogisticRegression()
model2 = RandomForestClassifier(n_estimators=100)

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

scores1 = cross_val_score(model1, X, y, cv=cv, scoring='accuracy')
scores2 = cross_val_score(model2, X, y, cv=cv, scoring='accuracy')

# Paired t-test
t_stat, p_value = stats.ttest_rel(scores1, scores2)

print(f"Model 1 (Logistic): {scores1.mean():.4f} (+/- {scores1.std()*2:.4f})")
print(f"Model 2 (Random Forest): {scores2.mean():.4f} (+/- {scores2.std()*2:.4f})")
print(f"Paired t-test: t={t_stat:.4f}, p={p_value:.4f}")

if p_value < 0.05:
    print("Difference is statistically significant at p<0.05")
else:
    print("No statistically significant difference")

Model Comparison Visualization

import pandas as pd

# Compare multiple models
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'KNN': KNeighborsClassifier()
}

results = {}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    cv_results = cross_validate(model, X, y, cv=cv,
                               scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])
    results[name] = {
        'accuracy': cv_results['test_accuracy'].mean(),
        'precision': cv_results['test_precision'].mean(),
        'recall': cv_results['test_recall'].mean(),
        'f1': cv_results['test_f1'].mean(),
        'roc_auc': cv_results['test_roc_auc'].mean()
    }

results_df = pd.DataFrame(results).T
print(results_df.round(4))

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
results_df.plot(kind='bar', ax=ax)
plt.title('Model Comparison Across Metrics')
plt.xlabel('Model')
plt.ylabel('Score')
plt.xticks(rotation=45)
plt.legend(loc='lower right')
plt.tight_layout()
plt.show()

Metric Selection Guide

Classification Metrics Decision Tree

Is your problem binary or multi-class?
├── Binary
│   ├── Are classes balanced?
│   │   ├── Yes → Accuracy, F1
│   │   └── No → F1, Precision-Recall AUC, ROC-AUC
│   ├── What costs more?
│   │   ├── False Positives → Precision, Specificity
│   │   ├── False Negatives → Recall, Sensitivity
│   │   └── Both matter → F1, ROC-AUC
│   └── Need probability ranking? → ROC-AUC, Log Loss

└── Multi-class
    ├── Macro Average → All classes equally important
    ├── Weighted Average → Account for class imbalance
    └── Micro Average → Overall accuracy-like

Regression Metrics Decision Tree

What matters most?
├── Interpretability → MAE (same units as target)
├── Penalize large errors → RMSE
├── Explain variance → R²
├── Percentage errors → MAPE
└── Robust to outliers → MAE, Median Absolute Error

Conclusion

Choosing the right evaluation metric is as important as choosing the right algorithm. Key takeaways:

  1. Never rely on accuracy alone for imbalanced datasets
  2. Understand the confusion matrix - it’s the foundation
  3. Match metrics to business objectives - what errors cost more?
  4. Use appropriate metrics - PR-AUC for imbalanced, ROC-AUC for balanced
  5. Cross-validate for reliable estimates
  6. Consider multiple metrics for a complete picture
  7. Test statistical significance when comparing models

Remember: The best metric is the one that aligns with your real-world objective.

Further Reading

  • Scikit-learn Metrics Documentation
  • “Pattern Recognition and Machine Learning” by Christopher Bishop
  • “The Elements of Statistical Learning” (free online)
  • Google’s Machine Learning Crash Course (Metrics section)
  • Kaggle competition winning solutions for metric optimization