Dec 28, 2024

Supervised Learning Algorithms Every Developer Should Know

A comprehensive guide to essential supervised learning algorithms with Python implementations, mathematical intuition, and practical tips for choosing the right algorithm.

Dery Febriantara Developer

Supervised Learning Algorithms Every Developer Should Know

Supervised learning is the most common and practical type of machine learning. From predicting house prices to classifying emails as spam, supervised learning algorithms form the backbone of most real-world ML applications. In this comprehensive guide, we’ll explore the essential algorithms every developer should master.

What is Supervised Learning?

Supervised learning is a type of machine learning where we train models using labeled data—examples where we know the correct answer. The model learns the relationship between inputs (features) and outputs (labels) to make predictions on new, unseen data.

Types of Supervised Learning

Classification: Predicting discrete categories (spam/not spam, cat/dog)
Regression: Predicting continuous values (house prices, temperature)

The Supervised Learning Workflow

# Standard supervised learning workflow
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, mean_squared_error

# 1. Load and prepare data
X, y = load_data()

# 2. Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Preprocess features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Train model
model = SomeAlgorithm()
model.fit(X_train_scaled, y_train)

# 5. Evaluate
predictions = model.predict(X_test_scaled)
score = accuracy_score(y_test, predictions)  # or mean_squared_error for regression

Linear Regression

Linear regression is the simplest and most interpretable algorithm for predicting continuous values. It models the relationship between features and target as a linear equation.

Mathematical Foundation

The linear regression model:

$$\hat{y} = w_0 + w_1x_1 + w_2x_2 + … + w_nx_n = \mathbf{w}^T\mathbf{x} + b$$

Where:

$\hat{y}$ = predicted value
$w_i$ = weights (coefficients)
$x_i$ = features
$b$ = bias (intercept)

The model finds weights that minimize the Mean Squared Error (MSE):

$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

Implementation

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
import matplotlib.pyplot as plt

# Generate sample data - house prices
np.random.seed(42)
n_samples = 500

# Features
square_feet = np.random.randint(800, 4000, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
bathrooms = np.random.randint(1, 4, n_samples)
age = np.random.randint(0, 50, n_samples)
distance_to_city = np.random.uniform(0.5, 30, n_samples)

# Target (price) with some noise
price = (
    50 * square_feet +
    15000 * bedrooms +
    10000 * bathrooms -
    500 * age -
    2000 * distance_to_city +
    np.random.normal(0, 30000, n_samples)
)

# Create DataFrame
data = pd.DataFrame({
    'square_feet': square_feet,
    'bedrooms': bedrooms,
    'bathrooms': bathrooms,
    'age': age,
    'distance_to_city': distance_to_city,
    'price': price
})

# Prepare features and target
X = data.drop('price', axis=1)
y = data['price']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train linear regression
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Predictions
y_pred = model.predict(X_test_scaled)

# Evaluate
print("Linear Regression Results:")
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, y_pred):,.2f}")

# Feature coefficients
coefficients = pd.DataFrame({
    'feature': X.columns,
    'coefficient': model.coef_
}).sort_values('coefficient', ascending=False)
print("\nFeature Coefficients:")
print(coefficients)

Regularized Linear Regression

Regularization prevents overfitting by penalizing large weights.

# Ridge Regression (L2 regularization)
# Adds penalty: λ * Σ(w_i)²
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_scaled, y_train)
ridge_pred = ridge_model.predict(X_test_scaled)
print(f"\nRidge R²: {r2_score(y_test, ridge_pred):.4f}")

# Lasso Regression (L1 regularization)
# Adds penalty: λ * Σ|w_i|
# Can reduce some coefficients to exactly zero (feature selection)
lasso_model = Lasso(alpha=1.0)
lasso_model.fit(X_train_scaled, y_train)
lasso_pred = lasso_model.predict(X_test_scaled)
print(f"Lasso R²: {r2_score(y_test, lasso_pred):.4f}")
print(f"Features used by Lasso: {np.sum(lasso_model.coef_ != 0)}/{len(lasso_model.coef_)}")

# Elastic Net (combines L1 and L2)
elastic_model = ElasticNet(alpha=1.0, l1_ratio=0.5)
elastic_model.fit(X_train_scaled, y_train)
elastic_pred = elastic_model.predict(X_test_scaled)
print(f"Elastic Net R²: {r2_score(y_test, elastic_pred):.4f}")

# Finding optimal regularization strength
from sklearn.model_selection import GridSearchCV

param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
ridge_cv = GridSearchCV(Ridge(), param_grid, cv=5, scoring='r2')
ridge_cv.fit(X_train_scaled, y_train)
print(f"\nBest Ridge alpha: {ridge_cv.best_params_['alpha']}")
print(f"Best Ridge R²: {ridge_cv.best_score_:.4f}")

Polynomial Regression

For non-linear relationships:

# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)

print(f"Original features: {X_train_scaled.shape[1]}")
print(f"Polynomial features: {X_train_poly.shape[1]}")

# Train with regularization to prevent overfitting
poly_model = Ridge(alpha=10)
poly_model.fit(X_train_poly, y_train)
poly_pred = poly_model.predict(X_test_poly)
print(f"\nPolynomial Regression R²: {r2_score(y_test, poly_pred):.4f}")

When to Use Linear Regression

Good for:

Continuous target variables
Linear relationships between features and target
When interpretability is important
Baseline model for regression problems
Small to medium datasets

Limitations:

Assumes linear relationship
Sensitive to outliers
Features should be independent (no multicollinearity)

Logistic Regression

Despite its name, logistic regression is a classification algorithm. It predicts the probability of an instance belonging to a class.

Mathematical Foundation

Logistic regression uses the sigmoid function to convert linear output to probability:

$$P(y=1|x) = \sigma(w^Tx + b) = \frac{1}{1 + e^{-(w^Tx + b)}}$$

The decision boundary: predict class 1 if $P(y=1|x) > 0.5$

Implementation

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score,
    roc_curve, precision_recall_curve
)

# Generate sample data - binary classification
np.random.seed(42)
n_samples = 1000

# Features for loan approval prediction
income = np.random.normal(60000, 20000, n_samples)
credit_score = np.random.normal(700, 50, n_samples)
debt_ratio = np.random.uniform(0.1, 0.6, n_samples)
employment_years = np.random.randint(0, 30, n_samples)

# Generate target with some logic
probability = 1 / (1 + np.exp(-(
    0.00003 * income +
    0.01 * credit_score -
    5 * debt_ratio +
    0.1 * employment_years -
    10
)))
approved = (np.random.random(n_samples) < probability).astype(int)

# Create DataFrame
data = pd.DataFrame({
    'income': income,
    'credit_score': credit_score,
    'debt_ratio': debt_ratio,
    'employment_years': employment_years,
    'approved': approved
})

# Prepare features and target
X = data.drop('approved', axis=1)
y = data['approved']

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train logistic regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

# Predictions
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1]

# Evaluate
print("Logistic Regression Results:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC Score: {roc_auc_score(y_test, y_proba):.4f}")

# Feature importance (odds ratios)
odds_ratios = pd.DataFrame({
    'feature': X.columns,
    'coefficient': model.coef_[0],
    'odds_ratio': np.exp(model.coef_[0])
}).sort_values('odds_ratio', ascending=False)
print("\nOdds Ratios:")
print(odds_ratios)

# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc_score(y_test, y_proba):.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True)
plt.show()

Multi-Class Logistic Regression

from sklearn.datasets import load_iris

# Load iris dataset (3 classes)
iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# One-vs-Rest (OvR) approach
model_ovr = LogisticRegression(multi_class='ovr', max_iter=1000)
model_ovr.fit(X_train, y_train)
print(f"One-vs-Rest Accuracy: {model_ovr.score(X_test, y_test):.4f}")

# Multinomial (Softmax) approach
model_multi = LogisticRegression(multi_class='multinomial', max_iter=1000)
model_multi.fit(X_train, y_train)
print(f"Multinomial Accuracy: {model_multi.score(X_test, y_test):.4f}")

Threshold Tuning

# Find optimal threshold for specific needs
def find_optimal_threshold(y_true, y_proba, metric='f1'):
    """Find threshold that optimizes the specified metric."""
    thresholds = np.arange(0.1, 0.9, 0.01)
    scores = []

    for threshold in thresholds:
        y_pred = (y_proba >= threshold).astype(int)
        if metric == 'f1':
            from sklearn.metrics import f1_score
            scores.append(f1_score(y_true, y_pred))
        elif metric == 'precision':
            from sklearn.metrics import precision_score
            scores.append(precision_score(y_true, y_pred))
        elif metric == 'recall':
            from sklearn.metrics import recall_score
            scores.append(recall_score(y_true, y_pred))

    optimal_idx = np.argmax(scores)
    return thresholds[optimal_idx], scores[optimal_idx]

optimal_threshold, optimal_score = find_optimal_threshold(y_test, y_proba, 'f1')
print(f"\nOptimal threshold for F1: {optimal_threshold:.2f} (F1 = {optimal_score:.4f})")

Decision Trees

Decision trees create a tree structure where each internal node represents a decision based on a feature, and each leaf node represents a prediction.

How Decision Trees Work

Select the best feature to split on (using Gini impurity or entropy)
Create child nodes for each possible value/threshold
Recursively repeat until stopping criteria is met
Assign predictions to leaf nodes

Implementation

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.tree import plot_tree, export_text

# Use the loan approval data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train decision tree
dt_model = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42
)
dt_model.fit(X_train, y_train)

# Evaluate
y_pred = dt_model.predict(X_test)
print("Decision Tree Results:")
print(classification_report(y_test, y_pred))

# Visualize the tree
plt.figure(figsize=(20, 10))
plot_tree(
    dt_model,
    feature_names=X.columns.tolist(),
    class_names=['Rejected', 'Approved'],
    filled=True,
    rounded=True,
    fontsize=10
)
plt.title('Decision Tree for Loan Approval')
plt.tight_layout()
plt.show()

# Text representation
print("\nDecision Tree Rules:")
print(export_text(dt_model, feature_names=X.columns.tolist()))

# Feature importance
importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance': dt_model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(importance_df)

Controlling Overfitting

# Compare different tree depths
depths = [2, 4, 6, 8, 10, 15, 20, None]
train_scores = []
test_scores = []

for depth in depths:
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    train_scores.append(model.score(X_train, y_train))
    test_scores.append(model.score(X_test, y_test))

# Plot learning curves
plt.figure(figsize=(10, 6))
plt.plot(range(len(depths)), train_scores, 'o-', label='Training Score')
plt.plot(range(len(depths)), test_scores, 'o-', label='Test Score')
plt.xticks(range(len(depths)), [str(d) for d in depths])
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.title('Decision Tree: Training vs Test Score by Depth')
plt.legend()
plt.grid(True)
plt.show()

Decision Tree for Regression

from sklearn.tree import DecisionTreeRegressor

# Use house price data
X = data.drop('price', axis=1)
y = data['price']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train regression tree
dt_regressor = DecisionTreeRegressor(
    max_depth=5,
    min_samples_split=20,
    random_state=42
)
dt_regressor.fit(X_train, y_train)

# Evaluate
y_pred = dt_regressor.predict(X_test)
print(f"Decision Tree Regression R²: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)):,.2f}")

Random Forest

Random Forest is an ensemble method that combines multiple decision trees to improve accuracy and reduce overfitting.

How Random Forest Works

Create multiple decision trees using bootstrapped samples
Each tree considers a random subset of features at each split
Aggregate predictions (majority vote for classification, average for regression)

Implementation

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

# Classification
rf_classifier = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    max_features='sqrt',
    random_state=42,
    n_jobs=-1
)
rf_classifier.fit(X_train, y_train)

# Evaluate
y_pred = rf_classifier.predict(X_test)
y_proba = rf_classifier.predict_proba(X_test)[:, 1]

print("Random Forest Classification Results:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")

# Feature importance
importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_classifier.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(importance_df['feature'], importance_df['importance'])
plt.xlabel('Importance')
plt.title('Random Forest Feature Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

Out-of-Bag Error

# Use OOB score for validation without separate test set
rf_oob = RandomForestClassifier(
    n_estimators=100,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
rf_oob.fit(X_train, y_train)
print(f"OOB Score: {rf_oob.oob_score_:.4f}")
print(f"Test Score: {rf_oob.score(X_test, y_test):.4f}")

Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV

# Define parameter distributions
param_distributions = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False]
}

# Randomized search
rf_random = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions,
    n_iter=50,
    cv=5,
    scoring='roc_auc',
    random_state=42,
    n_jobs=-1
)
rf_random.fit(X_train, y_train)

print(f"Best parameters: {rf_random.best_params_}")
print(f"Best CV Score: {rf_random.best_score_:.4f}")
print(f"Test Score: {rf_random.score(X_test, y_test):.4f}")

Support Vector Machines (SVM)

SVM finds the optimal hyperplane that separates classes with the maximum margin.

Mathematical Foundation

SVM optimizes: $$\min_{w,b} \frac{1}{2}||w||^2 + C\sum_{i=1}^{n}\xi_i$$

Subject to: $y_i(w^Tx_i + b) \geq 1 - \xi_i$

Where:

$w$ = weight vector (defines hyperplane)
$b$ = bias
$C$ = regularization parameter
$\xi_i$ = slack variables for soft margin

Implementation

from sklearn.svm import SVC, SVR
from sklearn.preprocessing import StandardScaler

# Scale features (very important for SVM!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Linear SVM
svm_linear = SVC(kernel='linear', C=1.0, probability=True, random_state=42)
svm_linear.fit(X_train_scaled, y_train)

# RBF (Radial Basis Function) kernel - most common
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale', probability=True, random_state=42)
svm_rbf.fit(X_train_scaled, y_train)

# Polynomial kernel
svm_poly = SVC(kernel='poly', degree=3, C=1.0, probability=True, random_state=42)
svm_poly.fit(X_train_scaled, y_train)

# Compare kernels
kernels = {
    'Linear': svm_linear,
    'RBF': svm_rbf,
    'Polynomial': svm_poly
}

print("SVM Results by Kernel:")
for name, model in kernels.items():
    y_pred = model.predict(X_test_scaled)
    y_proba = model.predict_proba(X_test_scaled)[:, 1]
    print(f"{name}: Accuracy = {model.score(X_test_scaled, y_test):.4f}, "
          f"ROC-AUC = {roc_auc_score(y_test, y_proba):.4f}")

Tuning SVM Hyperparameters

from sklearn.model_selection import GridSearchCV

# Grid search for RBF kernel
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1]
}

svm_cv = GridSearchCV(
    SVC(kernel='rbf', probability=True),
    param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)
svm_cv.fit(X_train_scaled, y_train)

print(f"Best parameters: {svm_cv.best_params_}")
print(f"Best CV Score: {svm_cv.best_score_:.4f}")
print(f"Test Score: {svm_cv.score(X_test_scaled, y_test):.4f}")

SVM for Regression

from sklearn.svm import SVR

# SVR with RBF kernel
svr = SVR(kernel='rbf', C=100, gamma='scale', epsilon=0.1)
svr.fit(X_train_scaled, y_train)

y_pred = svr.predict(X_test_scaled)
print(f"SVR R²: {r2_score(y_test, y_pred):.4f}")

K-Nearest Neighbors (KNN)

KNN is a simple, instance-based learning algorithm that makes predictions based on the K closest training examples.

How KNN Works

Calculate distance from query point to all training points
Select the K nearest neighbors
For classification: majority vote
For regression: average of neighbors

Implementation

from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor

# Scale features (important for distance-based algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform', metric='euclidean')
knn.fit(X_train_scaled, y_train)

# Evaluate
y_pred = knn.predict(X_test_scaled)
print(f"KNN Accuracy: {knn.score(X_test_scaled, y_test):.4f}")

# Find optimal K
k_values = range(1, 31)
train_scores = []
test_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    train_scores.append(knn.score(X_train_scaled, y_train))
    test_scores.append(knn.score(X_test_scaled, y_test))

# Plot
plt.figure(figsize=(10, 6))
plt.plot(k_values, train_scores, 'o-', label='Training Score')
plt.plot(k_values, test_scores, 'o-', label='Test Score')
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Accuracy')
plt.title('KNN: Accuracy vs K')
plt.legend()
plt.grid(True)
plt.show()

print(f"Optimal K: {k_values[np.argmax(test_scores)]}")
print(f"Best Test Accuracy: {max(test_scores):.4f}")

Distance Metrics and Weights

from sklearn.neighbors import KNeighborsClassifier

# Different distance metrics
metrics = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']

print("KNN Results by Distance Metric:")
for metric in metrics:
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    knn.fit(X_train_scaled, y_train)
    print(f"{metric}: Accuracy = {knn.score(X_test_scaled, y_test):.4f}")

# Weighted KNN (closer neighbors have more influence)
knn_weighted = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_weighted.fit(X_train_scaled, y_train)
print(f"\nWeighted KNN Accuracy: {knn_weighted.score(X_test_scaled, y_test):.4f}")

Gradient Boosting

Gradient Boosting builds an ensemble of weak learners sequentially, with each learner correcting the errors of the previous ones.

Implementation with Scikit-learn

from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor

# Gradient Boosting Classifier
gb_classifier = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    min_samples_split=5,
    min_samples_leaf=2,
    subsample=0.8,
    random_state=42
)
gb_classifier.fit(X_train, y_train)

# Evaluate
y_pred = gb_classifier.predict(X_test)
y_proba = gb_classifier.predict_proba(X_test)[:, 1]

print("Gradient Boosting Results:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")

XGBoost

XGBoost is an optimized implementation of gradient boosting:

import xgboost as xgb

# XGBoost Classifier
xgb_classifier = xgb.XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    min_child_weight=1,
    subsample=0.8,
    colsample_bytree=0.8,
    gamma=0,
    reg_alpha=0,
    reg_lambda=1,
    random_state=42,
    use_label_encoder=False,
    eval_metric='logloss'
)
xgb_classifier.fit(X_train, y_train)

# Evaluate
y_pred = xgb_classifier.predict(X_test)
y_proba = xgb_classifier.predict_proba(X_test)[:, 1]

print("XGBoost Results:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")

# Feature importance
xgb.plot_importance(xgb_classifier, max_num_features=10)
plt.title('XGBoost Feature Importance')
plt.tight_layout()
plt.show()

LightGBM

LightGBM is faster than XGBoost for large datasets:

import lightgbm as lgb

# LightGBM Classifier
lgb_classifier = lgb.LGBMClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    num_leaves=31,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    verbose=-1
)
lgb_classifier.fit(X_train, y_train)

# Evaluate
y_pred = lgb_classifier.predict(X_test)
y_proba = lgb_classifier.predict_proba(X_test)[:, 1]

print("LightGBM Results:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")

CatBoost

CatBoost handles categorical features natively:

from catboost import CatBoostClassifier

# CatBoost Classifier
cat_classifier = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=5,
    random_seed=42,
    verbose=False
)
cat_classifier.fit(X_train, y_train)

# Evaluate
y_pred = cat_classifier.predict(X_test)
y_proba = cat_classifier.predict_proba(X_test)[:, 1]

print("CatBoost Results:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")

Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes’ theorem with the assumption that features are conditionally independent.

Implementation

from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

# Gaussian Naive Bayes (for continuous features)
gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)
y_proba = gnb.predict_proba(X_test)[:, 1]

print("Gaussian Naive Bayes Results:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")

# For text classification, use MultinomialNB with TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer

# Example with text data
texts = ["This is spam", "Meeting tomorrow", "Free money now", "Project update"]
labels = [1, 0, 1, 0]

vectorizer = TfidfVectorizer()
X_text = vectorizer.fit_transform(texts)

mnb = MultinomialNB()
mnb.fit(X_text, labels)

Algorithm Selection Guide

Quick Reference Table

Algorithm	Type	Interpretability	Training Speed	Prediction Speed	Handles Non-linearity
Linear Regression	Regression	High	Fast	Fast	No
Logistic Regression	Classification	High	Fast	Fast	No
Decision Tree	Both	High	Fast	Fast	Yes
Random Forest	Both	Medium	Medium	Medium	Yes
SVM	Both	Low	Slow	Medium	Yes (with kernels)
KNN	Both	High	Fast	Slow	Yes
Gradient Boosting	Both	Low	Slow	Fast	Yes
Naive Bayes	Classification	High	Fast	Fast	No

Decision Flowchart

Is your target continuous or categorical?
├── Continuous (Regression)
│   ├── Linear relationship? → Linear Regression
│   ├── Non-linear relationship?
│   │   ├── Need interpretability? → Decision Tree
│   │   ├── Need best accuracy? → XGBoost/LightGBM
│   │   └── High-dimensional? → Random Forest
│   └── Small dataset? → SVM or KNN
│
└── Categorical (Classification)
    ├── Binary classification
    │   ├── Need probabilities? → Logistic Regression
    │   ├── Need interpretability? → Decision Tree
    │   └── Need best accuracy? → XGBoost/LightGBM
    ├── Multi-class classification
    │   ├── Few classes? → Same as binary
    │   └── Many classes? → Random Forest or Neural Network
    └── Text classification? → Naive Bayes or SVM

Complete Comparison Pipeline

from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'Naive Bayes': GaussianNB()
}

# Evaluate each model
results = {}
for name, model in models.items():
    # Create pipeline with scaling
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', model)
    ])

    # Cross-validation
    cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc')

    results[name] = {
        'mean_score': cv_scores.mean(),
        'std_score': cv_scores.std()
    }

# Display results
results_df = pd.DataFrame(results).T
results_df = results_df.sort_values('mean_score', ascending=False)
print("\nModel Comparison (ROC-AUC):")
print(results_df.round(4))

# Visualize
plt.figure(figsize=(12, 6))
plt.barh(results_df.index, results_df['mean_score'])
plt.xlabel('ROC-AUC Score')
plt.title('Model Comparison')
plt.xlim(0.5, 1.0)
for i, v in enumerate(results_df['mean_score']):
    plt.text(v + 0.01, i, f'{v:.3f}', va='center')
plt.tight_layout()
plt.show()

Best Practices

1. Always Start Simple

# Start with a simple baseline
baseline = LogisticRegression()
baseline_score = cross_val_score(baseline, X_scaled, y, cv=5).mean()
print(f"Baseline (Logistic Regression): {baseline_score:.4f}")

# Then try more complex models
complex_model = RandomForestClassifier(n_estimators=100)
complex_score = cross_val_score(complex_model, X, y, cv=5).mean()
print(f"Complex Model (Random Forest): {complex_score:.4f}")

# Is the improvement worth the complexity?

2. Use Cross-Validation

from sklearn.model_selection import StratifiedKFold

# For imbalanced classification
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
print(f"CV Score: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

3. Handle Class Imbalance

from sklearn.utils.class_weight import compute_class_weight

# Compute class weights
classes = np.unique(y)
weights = compute_class_weight('balanced', classes=classes, y=y)
class_weights = dict(zip(classes, weights))

# Use in model
model = LogisticRegression(class_weight='balanced')
# or
model = RandomForestClassifier(class_weight='balanced')

4. Save and Load Models

import joblib

# Save model
joblib.dump(model, 'model.joblib')

# Load model
loaded_model = joblib.load('model.joblib')
predictions = loaded_model.predict(X_test)

Conclusion

Supervised learning algorithms form the foundation of practical machine learning. Key takeaways:

Start with simple models (Linear/Logistic Regression) to establish baselines
Tree-based models (Random Forest, XGBoost) work well for most tabular data
Always scale features for distance-based algorithms (SVM, KNN)
Use cross-validation to get reliable performance estimates
Consider interpretability vs. accuracy trade-offs
Tune hyperparameters for optimal performance

The best algorithm depends on your specific problem, data characteristics, and requirements. Experiment with multiple algorithms and let the data guide your choice.

What is Supervised Learning?

Types of Supervised Learning

The Supervised Learning Workflow

Linear Regression

Mathematical Foundation

Implementation

Regularized Linear Regression

Polynomial Regression

When to Use Linear Regression

Logistic Regression

Mathematical Foundation

Implementation

Multi-Class Logistic Regression

Threshold Tuning

Decision Trees

How Decision Trees Work

Implementation

Controlling Overfitting

Decision Tree for Regression

Random Forest

How Random Forest Works

Implementation

Out-of-Bag Error

Hyperparameter Tuning

Support Vector Machines (SVM)

Mathematical Foundation

Implementation

Tuning SVM Hyperparameters

SVM for Regression

K-Nearest Neighbors (KNN)

How KNN Works

Implementation

Distance Metrics and Weights

Gradient Boosting

Implementation with Scikit-learn

XGBoost

LightGBM

CatBoost

Naive Bayes

Implementation

Algorithm Selection Guide

Quick Reference Table

Decision Flowchart

Complete Comparison Pipeline

Best Practices

1. Always Start Simple

2. Use Cross-Validation

3. Handle Class Imbalance

4. Save and Load Models

Conclusion

Further Reading