Dec 15, 2024

Understanding ML Model Evaluation Metrics

A comprehensive guide to model evaluation metrics. Learn when to use accuracy, precision, recall, F1-score, ROC-AUC, and more to properly evaluate your machine learning models.

Dery Febriantara Developer

Understanding ML Model Evaluation Metrics

Choosing the right evaluation metric is crucial for building effective machine learning models. A model that looks great on one metric might perform poorly on another. In this comprehensive guide, we’ll explore all the essential metrics, understand when to use each, and learn how to interpret results correctly.

Why Metrics Matter

Model evaluation metrics tell us how well our model performs on unseen data. Different metrics capture different aspects of model performance, and choosing the wrong metric can lead to:

Deploying models that fail in production
Optimizing for the wrong objective
Missing critical failure modes
False confidence in model quality

The Problem with Accuracy

Accuracy is the most intuitive metric, but it can be dangerously misleading.

The Accuracy Paradox

import numpy as np
from sklearn.metrics import accuracy_score, classification_report

# Imagine a fraud detection system
# 99% of transactions are legitimate, 1% are fraud
n_samples = 10000
np.random.seed(42)

# Actual labels: 99% legitimate (0), 1% fraud (1)
y_true = np.array([0] * 9900 + [1] * 100)

# A "smart" model that always predicts legitimate
y_pred_always_legit = np.zeros(n_samples)

# Our actual fraud detection model
# Catches 80% of fraud, 5% false positive rate
y_pred_model = np.zeros(n_samples)
y_pred_model[np.where(y_true == 1)[0][:80]] = 1  # Catch 80 out of 100 frauds
y_pred_model[np.random.choice(9900, 495, replace=False)] = 1  # 5% false positives

print("Always Predict Legitimate:")
print(f"  Accuracy: {accuracy_score(y_true, y_pred_always_legit):.2%}")
print(f"  Frauds caught: 0/100")

print("\nOur Fraud Detection Model:")
print(f"  Accuracy: {accuracy_score(y_true, y_pred_model):.2%}")
print(f"  Frauds caught: 80/100")

The “always legitimate” model achieves 99% accuracy but catches zero fraudsters. This demonstrates why accuracy alone is insufficient for imbalanced datasets.

The Confusion Matrix

The confusion matrix is the foundation for understanding classification performance.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Generate sample predictions
np.random.seed(42)
y_true = np.random.randint(0, 2, 1000)
y_pred = np.random.randint(0, 2, 1000)
y_pred[y_true == 1] = np.where(np.random.random(sum(y_true == 1)) > 0.2, 1, 0)
y_pred[y_true == 0] = np.where(np.random.random(sum(y_true == 0)) > 0.15, 0, 1)

# Create confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)

# Visualize
fig, ax = plt.subplots(figsize=(8, 6))
disp = ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive'])
disp.plot(ax=ax, cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

Understanding the Four Quadrants

	Predicted Negative	Predicted Positive
Actual Negative	True Negative (TN)	False Positive (FP)
Actual Positive	False Negative (FN)	True Positive (TP)

True Negatives (TN): Correctly predicted negative cases True Positives (TP): Correctly predicted positive cases False Positives (FP): Incorrectly predicted as positive (Type I error) False Negatives (FN): Incorrectly predicted as negative (Type II error)

Extracting Values from Confusion Matrix

def extract_confusion_values(y_true, y_pred):
    """Extract TN, FP, FN, TP from confusion matrix."""
    cm = confusion_matrix(y_true, y_pred)

    if cm.shape == (2, 2):
        tn, fp, fn, tp = cm.ravel()
        return {
            'TN': tn, 'FP': fp, 'FN': fn, 'TP': tp,
            'Total': tn + fp + fn + tp,
            'Actual Positive': fn + tp,
            'Actual Negative': tn + fp,
            'Predicted Positive': fp + tp,
            'Predicted Negative': tn + fn
        }

    return cm

values = extract_confusion_values(y_true, y_pred)
for key, value in values.items():
    print(f"{key}: {value}")

Classification Metrics Deep Dive

Accuracy

Definition: Proportion of correct predictions

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Manual calculation
values = extract_confusion_values(y_true, y_pred)
manual_accuracy = (values['TP'] + values['TN']) / values['Total']
print(f"Manual Accuracy: {manual_accuracy:.4f}")

Use when:

Classes are balanced
Both types of errors are equally costly
Quick baseline evaluation

Don’t use when:

Classes are imbalanced
Different errors have different costs
You need to understand error types

Precision

Definition: Of all predicted positives, how many are actually positive?

$$\text{Precision} = \frac{TP}{TP + FP}$$

from sklearn.metrics import precision_score

precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.4f}")

# Manual calculation
manual_precision = values['TP'] / (values['TP'] + values['FP'])
print(f"Manual Precision: {manual_precision:.4f}")

High precision means: When the model predicts positive, it’s usually right.

Use when:

False positives are costly
Example: Spam detection (don’t want to lose important emails)
Example: Recommendation systems (bad recommendations hurt trust)

Recall (Sensitivity, True Positive Rate)

Definition: Of all actual positives, how many did we catch?

$$\text{Recall} = \frac{TP}{TP + FN}$$

from sklearn.metrics import recall_score

recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.4f}")

# Manual calculation
manual_recall = values['TP'] / (values['TP'] + values['FN'])
print(f"Manual Recall: {manual_recall:.4f}")

High recall means: We catch most of the actual positives.

Use when:

False negatives are costly
Example: Disease detection (don’t want to miss sick patients)
Example: Fraud detection (don’t want to miss fraudsters)

The Precision-Recall Trade-off

Precision and recall are inversely related. Improving one often hurts the other.

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

# Get probability predictions
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2,
                          n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

# Calculate precision-recall curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(thresholds, precisions[:-1], label='Precision')
plt.plot(thresholds, recalls[:-1], label='Recall')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Precision-Recall Trade-off')
plt.legend()
plt.grid(True)
plt.show()

# Plot precision vs recall
plt.figure(figsize=(10, 6))
plt.plot(recalls, precisions, marker='.')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(True)
plt.show()

F1 Score

Definition: Harmonic mean of precision and recall

$$F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}$$

from sklearn.metrics import f1_score

f1 = f1_score(y_true, y_pred)
print(f"F1 Score: {f1:.4f}")

# Manual calculation
manual_f1 = 2 * (precision * recall) / (precision + recall)
print(f"Manual F1: {manual_f1:.4f}")

Why harmonic mean? The harmonic mean penalizes extreme differences. If precision = 1.0 and recall = 0.1:

Arithmetic mean: 0.55
Harmonic mean: 0.18

The harmonic mean better reflects poor performance in either metric.

Use when:

You need a single metric balancing precision and recall
Both false positives and false negatives matter
Classes are imbalanced

F-Beta Score

Generalization of F1 that allows weighting precision vs recall.

$$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{(\beta^2 \cdot \text{Precision}) + \text{Recall}}$$

from sklearn.metrics import fbeta_score

# F0.5 - prioritizes precision (2x weight)
f05 = fbeta_score(y_true, y_pred, beta=0.5)
print(f"F0.5 Score (prioritize precision): {f05:.4f}")

# F1 - equal weight
f1 = fbeta_score(y_true, y_pred, beta=1.0)
print(f"F1 Score (balanced): {f1:.4f}")

# F2 - prioritizes recall (2x weight)
f2 = fbeta_score(y_true, y_pred, beta=2.0)
print(f"F2 Score (prioritize recall): {f2:.4f}")

Choosing beta:

β < 1: Precision is more important
β = 1: Equal importance (F1)
β > 1: Recall is more important

Specificity (True Negative Rate)

Definition: Of all actual negatives, how many did we correctly identify?

$$\text{Specificity} = \frac{TN}{TN + FP}$$

def specificity_score(y_true, y_pred):
    """Calculate specificity (true negative rate)."""
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    return tn / (tn + fp)

specificity = specificity_score(y_true, y_pred)
print(f"Specificity: {specificity:.4f}")

Use when:

Correctly identifying negatives is important
Example: Criminal justice (don’t convict innocent people)

Complete Classification Report

from sklearn.metrics import classification_report

# Generate multi-class example
y_true_multi = np.random.randint(0, 3, 1000)
y_pred_multi = y_true_multi.copy()
# Add some errors
error_idx = np.random.choice(1000, 200, replace=False)
y_pred_multi[error_idx] = np.random.randint(0, 3, 200)

print(classification_report(y_true_multi, y_pred_multi,
                           target_names=['Class 0', 'Class 1', 'Class 2']))

Understanding the report:

support: Number of actual occurrences of each class
macro avg: Average of metrics (treats all classes equally)
weighted avg: Weighted by support (accounts for class imbalance)

ROC-AUC: Area Under the ROC Curve

The ROC curve plots True Positive Rate vs False Positive Rate at various thresholds.

Creating ROC Curves

from sklearn.metrics import roc_curve, roc_auc_score, auc
import matplotlib.pyplot as plt

# Get predictions from our model
y_proba = model.predict_proba(X_test)[:, 1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

# Plot
plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr, color='darkorange', lw=2,
         label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--',
         label='Random classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

print(f"ROC-AUC Score: {roc_auc:.4f}")

Interpreting ROC-AUC

AUC Score	Interpretation
0.50	Random guessing
0.50-0.60	Fail
0.60-0.70	Poor
0.70-0.80	Fair
0.80-0.90	Good
0.90-1.00	Excellent

Multi-Class ROC-AUC

from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier

# Create multi-class data
X_multi, y_multi = make_classification(n_samples=1000, n_features=20, n_classes=3,
                                       n_informative=10, n_clusters_per_class=1,
                                       random_state=42)

# Binarize labels
y_multi_bin = label_binarize(y_multi, classes=[0, 1, 2])
n_classes = y_multi_bin.shape[1]

X_train, X_test, y_train, y_test = train_test_split(X_multi, y_multi_bin,
                                                     test_size=0.3, random_state=42)

# Train OvR classifier
classifier = OneVsRestClassifier(LogisticRegression())
classifier.fit(X_train, y_train)
y_proba = classifier.predict_proba(X_test)

# Calculate ROC curve for each class
plt.figure(figsize=(10, 8))
colors = ['blue', 'red', 'green']

for i, color in enumerate(colors):
    fpr, tpr, _ = roc_curve(y_test[:, i], y_proba[:, i])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, color=color, lw=2,
             label=f'ROC curve class {i} (AUC = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Multi-class ROC Curve')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

# Calculate macro and micro average AUC
from sklearn.metrics import roc_auc_score

# For multi-label/binary indicator format
macro_auc = roc_auc_score(y_test, y_proba, multi_class='ovr', average='macro')
print(f"Macro Average AUC: {macro_auc:.4f}")

Precision-Recall Curves and Average Precision

ROC-AUC can be misleading for imbalanced datasets. Precision-Recall curves are often more informative.

Creating PR Curves

from sklearn.metrics import precision_recall_curve, average_precision_score

# Create imbalanced dataset
X_imb, y_imb = make_classification(n_samples=1000, n_features=20, n_classes=2,
                                   n_informative=10, weights=[0.95, 0.05],
                                   random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X_imb, y_imb,
                                                     test_size=0.3, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

# Calculate PR curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
avg_precision = average_precision_score(y_test, y_proba)

# Plot
plt.figure(figsize=(10, 8))
plt.plot(recall, precision, color='darkorange', lw=2,
         label=f'PR curve (AP = {avg_precision:.2f})')

# Baseline: ratio of positive class
baseline = y_test.sum() / len(y_test)
plt.axhline(y=baseline, color='navy', linestyle='--',
            label=f'Baseline (ratio = {baseline:.2f})')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc="upper right")
plt.grid(True)
plt.show()

print(f"Average Precision: {avg_precision:.4f}")

When to Use PR vs ROC

Scenario	Recommended Curve
Balanced classes	ROC
Imbalanced classes	PR
Cost of FP matters	PR
Both error types matter equally	ROC
Comparing models on imbalanced data	PR

Regression Metrics

For continuous target variables, we use different metrics.

Mean Absolute Error (MAE)

Definition: Average absolute difference between predictions and actual values

$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$

from sklearn.metrics import mean_absolute_error
import numpy as np

# Generate sample regression data
np.random.seed(42)
y_true_reg = np.random.randn(100) * 10 + 50
y_pred_reg = y_true_reg + np.random.randn(100) * 5

mae = mean_absolute_error(y_true_reg, y_pred_reg)
print(f"MAE: {mae:.4f}")

# Manual calculation
manual_mae = np.mean(np.abs(y_true_reg - y_pred_reg))
print(f"Manual MAE: {manual_mae:.4f}")

Characteristics:

Same units as target variable
Easy to interpret (average error)
Robust to outliers
All errors weighted equally

Mean Squared Error (MSE) and RMSE

Definition: Average squared difference between predictions and actual values

$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

$$RMSE = \sqrt{MSE}$$

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_true_reg, y_pred_reg)
rmse = np.sqrt(mse)

print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")

# Manual calculation
manual_mse = np.mean((y_true_reg - y_pred_reg)**2)
manual_rmse = np.sqrt(manual_mse)
print(f"Manual MSE: {manual_mse:.4f}")
print(f"Manual RMSE: {manual_rmse:.4f}")

Characteristics:

Penalizes large errors more heavily
RMSE in same units as target
Sensitive to outliers
Commonly used in optimization

MAE vs RMSE

# Demonstrate sensitivity to outliers
y_true_outlier = np.array([10, 20, 30, 40, 50])
y_pred_outlier = np.array([11, 21, 31, 41, 100])  # One large error

mae_outlier = mean_absolute_error(y_true_outlier, y_pred_outlier)
rmse_outlier = np.sqrt(mean_squared_error(y_true_outlier, y_pred_outlier))

print(f"With outlier:")
print(f"  MAE: {mae_outlier:.2f}")
print(f"  RMSE: {rmse_outlier:.2f}")

# Without outlier
y_true_clean = np.array([10, 20, 30, 40])
y_pred_clean = np.array([11, 21, 31, 41])

mae_clean = mean_absolute_error(y_true_clean, y_pred_clean)
rmse_clean = np.sqrt(mean_squared_error(y_true_clean, y_pred_clean))

print(f"\nWithout outlier:")
print(f"  MAE: {mae_clean:.2f}")
print(f"  RMSE: {rmse_clean:.2f}")

R² Score (Coefficient of Determination)

Definition: Proportion of variance explained by the model

$$R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}i)^2}{\sum{i=1}^{n}(y_i - \bar{y})^2}$$

from sklearn.metrics import r2_score

r2 = r2_score(y_true_reg, y_pred_reg)
print(f"R² Score: {r2:.4f}")

# Manual calculation
ss_res = np.sum((y_true_reg - y_pred_reg)**2)  # Residual sum of squares
ss_tot = np.sum((y_true_reg - np.mean(y_true_reg))**2)  # Total sum of squares
manual_r2 = 1 - (ss_res / ss_tot)
print(f"Manual R²: {manual_r2:.4f}")

Interpretation:

R² = 1.0: Perfect predictions
R² = 0.0: Model predicts the mean
R² < 0.0: Model is worse than predicting the mean (can happen!)

Mean Absolute Percentage Error (MAPE)

Definition: Average percentage error

$$MAPE = \frac{100%}{n}\sum_{i=1}^{n}\left|\frac{y_i - \hat{y}_i}{y_i}\right|$$

from sklearn.metrics import mean_absolute_percentage_error

mape = mean_absolute_percentage_error(y_true_reg, y_pred_reg)
print(f"MAPE: {mape:.4%}")

# Manual calculation (avoiding division by zero)
mask = y_true_reg != 0
manual_mape = np.mean(np.abs((y_true_reg[mask] - y_pred_reg[mask]) / y_true_reg[mask]))
print(f"Manual MAPE: {manual_mape:.4%}")

Characteristics:

Scale-independent (percentage)
Undefined when y = 0
Asymmetric (over-predictions weighted differently than under-predictions)

Choosing Regression Metrics

Metric	When to Use
MAE	Interpretability, robust to outliers
RMSE	Large errors are particularly bad
R²	Comparing to baseline, model explanatory power
MAPE	Need percentage error, no zeros in target

Cross-Validation for Reliable Evaluation

Single train-test splits can give unreliable estimates. Cross-validation provides more robust evaluation.

K-Fold Cross-Validation

from sklearn.model_selection import cross_val_score, cross_validate, KFold

# Create sample data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

model = LogisticRegression(max_iter=1000)

# Simple cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

# Multiple metrics
cv_results = cross_validate(model, X, y, cv=5,
                           scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])

print("\nMultiple Metrics:")
for metric in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']:
    scores = cv_results[f'test_{metric}']
    print(f"  {metric}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

Stratified K-Fold for Classification

from sklearn.model_selection import StratifiedKFold

# Ensures each fold has similar class distribution
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

fold_scores = []
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    model.fit(X_train, y_train)
    score = model.score(X_val, y_val)
    fold_scores.append(score)

    # Check class distribution in each fold
    print(f"Fold {fold+1}: Score={score:.4f}, "
          f"Train positive ratio={y_train.mean():.2f}, "
          f"Val positive ratio={y_val.mean():.2f}")

print(f"\nMean Score: {np.mean(fold_scores):.4f}")

Leave-One-Out Cross-Validation

from sklearn.model_selection import LeaveOneOut, cross_val_score

# Use on small datasets
X_small = X[:100]
y_small = y[:100]

loo = LeaveOneOut()
loo_scores = cross_val_score(model, X_small, y_small, cv=loo)
print(f"LOO CV Accuracy: {loo_scores.mean():.4f}")

Threshold Optimization

For many applications, the default 0.5 threshold isn’t optimal.

Finding the Optimal Threshold

from sklearn.metrics import f1_score, precision_score, recall_score

# Train model and get probabilities
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

# Test different thresholds
thresholds = np.arange(0.1, 0.9, 0.05)
results = []

for threshold in thresholds:
    y_pred_thresh = (y_proba >= threshold).astype(int)
    results.append({
        'threshold': threshold,
        'precision': precision_score(y_test, y_pred_thresh),
        'recall': recall_score(y_test, y_pred_thresh),
        'f1': f1_score(y_test, y_pred_thresh)
    })

results_df = pd.DataFrame(results)

# Plot
plt.figure(figsize=(12, 6))
plt.plot(results_df['threshold'], results_df['precision'], 'b-', label='Precision')
plt.plot(results_df['threshold'], results_df['recall'], 'r-', label='Recall')
plt.plot(results_df['threshold'], results_df['f1'], 'g-', label='F1')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Metrics vs. Classification Threshold')
plt.legend()
plt.grid(True)
plt.show()

# Find optimal threshold for F1
optimal_threshold = results_df.loc[results_df['f1'].idxmax(), 'threshold']
print(f"Optimal threshold for F1: {optimal_threshold:.2f}")

Business-Driven Threshold Selection

def calculate_business_metric(y_true, y_pred, fp_cost, fn_cost):
    """Calculate total cost based on business constraints."""
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    return fp * fp_cost + fn * fn_cost

# Example: Fraud detection where missing fraud costs $1000, false alarm costs $10
fp_cost = 10
fn_cost = 1000

costs = []
for threshold in thresholds:
    y_pred_thresh = (y_proba >= threshold).astype(int)
    cost = calculate_business_metric(y_test, y_pred_thresh, fp_cost, fn_cost)
    costs.append({'threshold': threshold, 'cost': cost})

costs_df = pd.DataFrame(costs)
optimal_business_threshold = costs_df.loc[costs_df['cost'].idxmin(), 'threshold']
print(f"Optimal threshold for minimum cost: {optimal_business_threshold:.2f}")
print(f"Minimum cost: ${costs_df['cost'].min():,.0f}")

Comparing Models

Statistical Significance Testing

from scipy import stats

# Compare two models using paired t-test on CV scores
model1 = LogisticRegression()
model2 = RandomForestClassifier(n_estimators=100)

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

scores1 = cross_val_score(model1, X, y, cv=cv, scoring='accuracy')
scores2 = cross_val_score(model2, X, y, cv=cv, scoring='accuracy')

# Paired t-test
t_stat, p_value = stats.ttest_rel(scores1, scores2)

print(f"Model 1 (Logistic): {scores1.mean():.4f} (+/- {scores1.std()*2:.4f})")
print(f"Model 2 (Random Forest): {scores2.mean():.4f} (+/- {scores2.std()*2:.4f})")
print(f"Paired t-test: t={t_stat:.4f}, p={p_value:.4f}")

if p_value < 0.05:
    print("Difference is statistically significant at p<0.05")
else:
    print("No statistically significant difference")

Model Comparison Visualization

import pandas as pd

# Compare multiple models
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'KNN': KNeighborsClassifier()
}

results = {}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    cv_results = cross_validate(model, X, y, cv=cv,
                               scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])
    results[name] = {
        'accuracy': cv_results['test_accuracy'].mean(),
        'precision': cv_results['test_precision'].mean(),
        'recall': cv_results['test_recall'].mean(),
        'f1': cv_results['test_f1'].mean(),
        'roc_auc': cv_results['test_roc_auc'].mean()
    }

results_df = pd.DataFrame(results).T
print(results_df.round(4))

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
results_df.plot(kind='bar', ax=ax)
plt.title('Model Comparison Across Metrics')
plt.xlabel('Model')
plt.ylabel('Score')
plt.xticks(rotation=45)
plt.legend(loc='lower right')
plt.tight_layout()
plt.show()

Metric Selection Guide

Classification Metrics Decision Tree

Is your problem binary or multi-class?
├── Binary
│   ├── Are classes balanced?
│   │   ├── Yes → Accuracy, F1
│   │   └── No → F1, Precision-Recall AUC, ROC-AUC
│   ├── What costs more?
│   │   ├── False Positives → Precision, Specificity
│   │   ├── False Negatives → Recall, Sensitivity
│   │   └── Both matter → F1, ROC-AUC
│   └── Need probability ranking? → ROC-AUC, Log Loss
│
└── Multi-class
    ├── Macro Average → All classes equally important
    ├── Weighted Average → Account for class imbalance
    └── Micro Average → Overall accuracy-like

Regression Metrics Decision Tree

What matters most?
├── Interpretability → MAE (same units as target)
├── Penalize large errors → RMSE
├── Explain variance → R²
├── Percentage errors → MAPE
└── Robust to outliers → MAE, Median Absolute Error

Conclusion

Choosing the right evaluation metric is as important as choosing the right algorithm. Key takeaways:

Never rely on accuracy alone for imbalanced datasets
Understand the confusion matrix - it’s the foundation
Match metrics to business objectives - what errors cost more?
Use appropriate metrics - PR-AUC for imbalanced, ROC-AUC for balanced
Cross-validate for reliable estimates
Consider multiple metrics for a complete picture
Test statistical significance when comparing models

Remember: The best metric is the one that aligns with your real-world objective.

Why Metrics Matter

The Problem with Accuracy

The Accuracy Paradox

The Confusion Matrix

Understanding the Four Quadrants

Extracting Values from Confusion Matrix

Classification Metrics Deep Dive

Accuracy

Precision

Recall (Sensitivity, True Positive Rate)

The Precision-Recall Trade-off

F1 Score

F-Beta Score

Specificity (True Negative Rate)

Complete Classification Report

ROC-AUC: Area Under the ROC Curve

Creating ROC Curves

Interpreting ROC-AUC

Multi-Class ROC-AUC

Precision-Recall Curves and Average Precision

Creating PR Curves

When to Use PR vs ROC

Regression Metrics

Mean Absolute Error (MAE)

Mean Squared Error (MSE) and RMSE

MAE vs RMSE

R² Score (Coefficient of Determination)

Mean Absolute Percentage Error (MAPE)

Choosing Regression Metrics

Cross-Validation for Reliable Evaluation

K-Fold Cross-Validation

Stratified K-Fold for Classification

Leave-One-Out Cross-Validation

Threshold Optimization

Finding the Optimal Threshold

Business-Driven Threshold Selection

Comparing Models

Statistical Significance Testing

Model Comparison Visualization

Metric Selection Guide

Classification Metrics Decision Tree

Regression Metrics Decision Tree

Conclusion

Further Reading