Supervised Learning Algorithms Every Developer Should Know
A comprehensive guide to essential supervised learning algorithms with Python implementations, mathematical intuition, and practical tips for choosing the right algorithm.
Supervised learning is the most common and practical type of machine learning. From predicting house prices to classifying emails as spam, supervised learning algorithms form the backbone of most real-world ML applications. In this comprehensive guide, we’ll explore the essential algorithms every developer should master.
What is Supervised Learning?
Supervised learning is a type of machine learning where we train models using labeled data—examples where we know the correct answer. The model learns the relationship between inputs (features) and outputs (labels) to make predictions on new, unseen data.
Types of Supervised Learning
- Classification: Predicting discrete categories (spam/not spam, cat/dog)
- Regression: Predicting continuous values (house prices, temperature)
The Supervised Learning Workflow
# Standard supervised learning workflow
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, mean_squared_error
# 1. Load and prepare data
X, y = load_data()
# 2. Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 3. Preprocess features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 4. Train model
model = SomeAlgorithm()
model.fit(X_train_scaled, y_train)
# 5. Evaluate
predictions = model.predict(X_test_scaled)
score = accuracy_score(y_test, predictions) # or mean_squared_error for regression
Linear Regression
Linear regression is the simplest and most interpretable algorithm for predicting continuous values. It models the relationship between features and target as a linear equation.
Mathematical Foundation
The linear regression model:
$$\hat{y} = w_0 + w_1x_1 + w_2x_2 + … + w_nx_n = \mathbf{w}^T\mathbf{x} + b$$
Where:
- $\hat{y}$ = predicted value
- $w_i$ = weights (coefficients)
- $x_i$ = features
- $b$ = bias (intercept)
The model finds weights that minimize the Mean Squared Error (MSE):
$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$
Implementation
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
import matplotlib.pyplot as plt
# Generate sample data - house prices
np.random.seed(42)
n_samples = 500
# Features
square_feet = np.random.randint(800, 4000, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
bathrooms = np.random.randint(1, 4, n_samples)
age = np.random.randint(0, 50, n_samples)
distance_to_city = np.random.uniform(0.5, 30, n_samples)
# Target (price) with some noise
price = (
50 * square_feet +
15000 * bedrooms +
10000 * bathrooms -
500 * age -
2000 * distance_to_city +
np.random.normal(0, 30000, n_samples)
)
# Create DataFrame
data = pd.DataFrame({
'square_feet': square_feet,
'bedrooms': bedrooms,
'bathrooms': bathrooms,
'age': age,
'distance_to_city': distance_to_city,
'price': price
})
# Prepare features and target
X = data.drop('price', axis=1)
y = data['price']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train linear regression
model = LinearRegression()
model.fit(X_train_scaled, y_train)
# Predictions
y_pred = model.predict(X_test_scaled)
# Evaluate
print("Linear Regression Results:")
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, y_pred):,.2f}")
# Feature coefficients
coefficients = pd.DataFrame({
'feature': X.columns,
'coefficient': model.coef_
}).sort_values('coefficient', ascending=False)
print("\nFeature Coefficients:")
print(coefficients)
Regularized Linear Regression
Regularization prevents overfitting by penalizing large weights.
# Ridge Regression (L2 regularization)
# Adds penalty: λ * Σ(w_i)²
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_scaled, y_train)
ridge_pred = ridge_model.predict(X_test_scaled)
print(f"\nRidge R²: {r2_score(y_test, ridge_pred):.4f}")
# Lasso Regression (L1 regularization)
# Adds penalty: λ * Σ|w_i|
# Can reduce some coefficients to exactly zero (feature selection)
lasso_model = Lasso(alpha=1.0)
lasso_model.fit(X_train_scaled, y_train)
lasso_pred = lasso_model.predict(X_test_scaled)
print(f"Lasso R²: {r2_score(y_test, lasso_pred):.4f}")
print(f"Features used by Lasso: {np.sum(lasso_model.coef_ != 0)}/{len(lasso_model.coef_)}")
# Elastic Net (combines L1 and L2)
elastic_model = ElasticNet(alpha=1.0, l1_ratio=0.5)
elastic_model.fit(X_train_scaled, y_train)
elastic_pred = elastic_model.predict(X_test_scaled)
print(f"Elastic Net R²: {r2_score(y_test, elastic_pred):.4f}")
# Finding optimal regularization strength
from sklearn.model_selection import GridSearchCV
param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
ridge_cv = GridSearchCV(Ridge(), param_grid, cv=5, scoring='r2')
ridge_cv.fit(X_train_scaled, y_train)
print(f"\nBest Ridge alpha: {ridge_cv.best_params_['alpha']}")
print(f"Best Ridge R²: {ridge_cv.best_score_:.4f}")
Polynomial Regression
For non-linear relationships:
# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)
print(f"Original features: {X_train_scaled.shape[1]}")
print(f"Polynomial features: {X_train_poly.shape[1]}")
# Train with regularization to prevent overfitting
poly_model = Ridge(alpha=10)
poly_model.fit(X_train_poly, y_train)
poly_pred = poly_model.predict(X_test_poly)
print(f"\nPolynomial Regression R²: {r2_score(y_test, poly_pred):.4f}")
When to Use Linear Regression
Good for:
- Continuous target variables
- Linear relationships between features and target
- When interpretability is important
- Baseline model for regression problems
- Small to medium datasets
Limitations:
- Assumes linear relationship
- Sensitive to outliers
- Features should be independent (no multicollinearity)
Logistic Regression
Despite its name, logistic regression is a classification algorithm. It predicts the probability of an instance belonging to a class.
Mathematical Foundation
Logistic regression uses the sigmoid function to convert linear output to probability:
$$P(y=1|x) = \sigma(w^Tx + b) = \frac{1}{1 + e^{-(w^Tx + b)}}$$
The decision boundary: predict class 1 if $P(y=1|x) > 0.5$
Implementation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
classification_report, confusion_matrix, roc_auc_score,
roc_curve, precision_recall_curve
)
# Generate sample data - binary classification
np.random.seed(42)
n_samples = 1000
# Features for loan approval prediction
income = np.random.normal(60000, 20000, n_samples)
credit_score = np.random.normal(700, 50, n_samples)
debt_ratio = np.random.uniform(0.1, 0.6, n_samples)
employment_years = np.random.randint(0, 30, n_samples)
# Generate target with some logic
probability = 1 / (1 + np.exp(-(
0.00003 * income +
0.01 * credit_score -
5 * debt_ratio +
0.1 * employment_years -
10
)))
approved = (np.random.random(n_samples) < probability).astype(int)
# Create DataFrame
data = pd.DataFrame({
'income': income,
'credit_score': credit_score,
'debt_ratio': debt_ratio,
'employment_years': employment_years,
'approved': approved
})
# Prepare features and target
X = data.drop('approved', axis=1)
y = data['approved']
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train logistic regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)
# Predictions
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1]
# Evaluate
print("Logistic Regression Results:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC Score: {roc_auc_score(y_test, y_proba):.4f}")
# Feature importance (odds ratios)
odds_ratios = pd.DataFrame({
'feature': X.columns,
'coefficient': model.coef_[0],
'odds_ratio': np.exp(model.coef_[0])
}).sort_values('odds_ratio', ascending=False)
print("\nOdds Ratios:")
print(odds_ratios)
# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc_score(y_test, y_proba):.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True)
plt.show()
Multi-Class Logistic Regression
from sklearn.datasets import load_iris
# Load iris dataset (3 classes)
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# One-vs-Rest (OvR) approach
model_ovr = LogisticRegression(multi_class='ovr', max_iter=1000)
model_ovr.fit(X_train, y_train)
print(f"One-vs-Rest Accuracy: {model_ovr.score(X_test, y_test):.4f}")
# Multinomial (Softmax) approach
model_multi = LogisticRegression(multi_class='multinomial', max_iter=1000)
model_multi.fit(X_train, y_train)
print(f"Multinomial Accuracy: {model_multi.score(X_test, y_test):.4f}")
Threshold Tuning
# Find optimal threshold for specific needs
def find_optimal_threshold(y_true, y_proba, metric='f1'):
"""Find threshold that optimizes the specified metric."""
thresholds = np.arange(0.1, 0.9, 0.01)
scores = []
for threshold in thresholds:
y_pred = (y_proba >= threshold).astype(int)
if metric == 'f1':
from sklearn.metrics import f1_score
scores.append(f1_score(y_true, y_pred))
elif metric == 'precision':
from sklearn.metrics import precision_score
scores.append(precision_score(y_true, y_pred))
elif metric == 'recall':
from sklearn.metrics import recall_score
scores.append(recall_score(y_true, y_pred))
optimal_idx = np.argmax(scores)
return thresholds[optimal_idx], scores[optimal_idx]
optimal_threshold, optimal_score = find_optimal_threshold(y_test, y_proba, 'f1')
print(f"\nOptimal threshold for F1: {optimal_threshold:.2f} (F1 = {optimal_score:.4f})")
Decision Trees
Decision trees create a tree structure where each internal node represents a decision based on a feature, and each leaf node represents a prediction.
How Decision Trees Work
- Select the best feature to split on (using Gini impurity or entropy)
- Create child nodes for each possible value/threshold
- Recursively repeat until stopping criteria is met
- Assign predictions to leaf nodes
Implementation
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.tree import plot_tree, export_text
# Use the loan approval data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train decision tree
dt_model = DecisionTreeClassifier(
max_depth=5,
min_samples_split=10,
min_samples_leaf=5,
random_state=42
)
dt_model.fit(X_train, y_train)
# Evaluate
y_pred = dt_model.predict(X_test)
print("Decision Tree Results:")
print(classification_report(y_test, y_pred))
# Visualize the tree
plt.figure(figsize=(20, 10))
plot_tree(
dt_model,
feature_names=X.columns.tolist(),
class_names=['Rejected', 'Approved'],
filled=True,
rounded=True,
fontsize=10
)
plt.title('Decision Tree for Loan Approval')
plt.tight_layout()
plt.show()
# Text representation
print("\nDecision Tree Rules:")
print(export_text(dt_model, feature_names=X.columns.tolist()))
# Feature importance
importance_df = pd.DataFrame({
'feature': X.columns,
'importance': dt_model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(importance_df)
Controlling Overfitting
# Compare different tree depths
depths = [2, 4, 6, 8, 10, 15, 20, None]
train_scores = []
test_scores = []
for depth in depths:
model = DecisionTreeClassifier(max_depth=depth, random_state=42)
model.fit(X_train, y_train)
train_scores.append(model.score(X_train, y_train))
test_scores.append(model.score(X_test, y_test))
# Plot learning curves
plt.figure(figsize=(10, 6))
plt.plot(range(len(depths)), train_scores, 'o-', label='Training Score')
plt.plot(range(len(depths)), test_scores, 'o-', label='Test Score')
plt.xticks(range(len(depths)), [str(d) for d in depths])
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.title('Decision Tree: Training vs Test Score by Depth')
plt.legend()
plt.grid(True)
plt.show()
Decision Tree for Regression
from sklearn.tree import DecisionTreeRegressor
# Use house price data
X = data.drop('price', axis=1)
y = data['price']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train regression tree
dt_regressor = DecisionTreeRegressor(
max_depth=5,
min_samples_split=20,
random_state=42
)
dt_regressor.fit(X_train, y_train)
# Evaluate
y_pred = dt_regressor.predict(X_test)
print(f"Decision Tree Regression R²: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)):,.2f}")
Random Forest
Random Forest is an ensemble method that combines multiple decision trees to improve accuracy and reduce overfitting.
How Random Forest Works
- Create multiple decision trees using bootstrapped samples
- Each tree considers a random subset of features at each split
- Aggregate predictions (majority vote for classification, average for regression)
Implementation
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
# Classification
rf_classifier = RandomForestClassifier(
n_estimators=100,
max_depth=10,
min_samples_split=5,
min_samples_leaf=2,
max_features='sqrt',
random_state=42,
n_jobs=-1
)
rf_classifier.fit(X_train, y_train)
# Evaluate
y_pred = rf_classifier.predict(X_test)
y_proba = rf_classifier.predict_proba(X_test)[:, 1]
print("Random Forest Classification Results:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
# Feature importance
importance_df = pd.DataFrame({
'feature': X.columns,
'importance': rf_classifier.feature_importances_
}).sort_values('importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(importance_df['feature'], importance_df['importance'])
plt.xlabel('Importance')
plt.title('Random Forest Feature Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
Out-of-Bag Error
# Use OOB score for validation without separate test set
rf_oob = RandomForestClassifier(
n_estimators=100,
oob_score=True,
random_state=42,
n_jobs=-1
)
rf_oob.fit(X_train, y_train)
print(f"OOB Score: {rf_oob.oob_score_:.4f}")
print(f"Test Score: {rf_oob.score(X_test, y_test):.4f}")
Hyperparameter Tuning
from sklearn.model_selection import RandomizedSearchCV
# Define parameter distributions
param_distributions = {
'n_estimators': [50, 100, 200, 300],
'max_depth': [5, 10, 15, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2', None],
'bootstrap': [True, False]
}
# Randomized search
rf_random = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions,
n_iter=50,
cv=5,
scoring='roc_auc',
random_state=42,
n_jobs=-1
)
rf_random.fit(X_train, y_train)
print(f"Best parameters: {rf_random.best_params_}")
print(f"Best CV Score: {rf_random.best_score_:.4f}")
print(f"Test Score: {rf_random.score(X_test, y_test):.4f}")
Support Vector Machines (SVM)
SVM finds the optimal hyperplane that separates classes with the maximum margin.
Mathematical Foundation
SVM optimizes: $$\min_{w,b} \frac{1}{2}||w||^2 + C\sum_{i=1}^{n}\xi_i$$
Subject to: $y_i(w^Tx_i + b) \geq 1 - \xi_i$
Where:
- $w$ = weight vector (defines hyperplane)
- $b$ = bias
- $C$ = regularization parameter
- $\xi_i$ = slack variables for soft margin
Implementation
from sklearn.svm import SVC, SVR
from sklearn.preprocessing import StandardScaler
# Scale features (very important for SVM!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Linear SVM
svm_linear = SVC(kernel='linear', C=1.0, probability=True, random_state=42)
svm_linear.fit(X_train_scaled, y_train)
# RBF (Radial Basis Function) kernel - most common
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale', probability=True, random_state=42)
svm_rbf.fit(X_train_scaled, y_train)
# Polynomial kernel
svm_poly = SVC(kernel='poly', degree=3, C=1.0, probability=True, random_state=42)
svm_poly.fit(X_train_scaled, y_train)
# Compare kernels
kernels = {
'Linear': svm_linear,
'RBF': svm_rbf,
'Polynomial': svm_poly
}
print("SVM Results by Kernel:")
for name, model in kernels.items():
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1]
print(f"{name}: Accuracy = {model.score(X_test_scaled, y_test):.4f}, "
f"ROC-AUC = {roc_auc_score(y_test, y_proba):.4f}")
Tuning SVM Hyperparameters
from sklearn.model_selection import GridSearchCV
# Grid search for RBF kernel
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1]
}
svm_cv = GridSearchCV(
SVC(kernel='rbf', probability=True),
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1
)
svm_cv.fit(X_train_scaled, y_train)
print(f"Best parameters: {svm_cv.best_params_}")
print(f"Best CV Score: {svm_cv.best_score_:.4f}")
print(f"Test Score: {svm_cv.score(X_test_scaled, y_test):.4f}")
SVM for Regression
from sklearn.svm import SVR
# SVR with RBF kernel
svr = SVR(kernel='rbf', C=100, gamma='scale', epsilon=0.1)
svr.fit(X_train_scaled, y_train)
y_pred = svr.predict(X_test_scaled)
print(f"SVR R²: {r2_score(y_test, y_pred):.4f}")
K-Nearest Neighbors (KNN)
KNN is a simple, instance-based learning algorithm that makes predictions based on the K closest training examples.
How KNN Works
- Calculate distance from query point to all training points
- Select the K nearest neighbors
- For classification: majority vote
- For regression: average of neighbors
Implementation
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
# Scale features (important for distance-based algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform', metric='euclidean')
knn.fit(X_train_scaled, y_train)
# Evaluate
y_pred = knn.predict(X_test_scaled)
print(f"KNN Accuracy: {knn.score(X_test_scaled, y_test):.4f}")
# Find optimal K
k_values = range(1, 31)
train_scores = []
test_scores = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_scaled, y_train)
train_scores.append(knn.score(X_train_scaled, y_train))
test_scores.append(knn.score(X_test_scaled, y_test))
# Plot
plt.figure(figsize=(10, 6))
plt.plot(k_values, train_scores, 'o-', label='Training Score')
plt.plot(k_values, test_scores, 'o-', label='Test Score')
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Accuracy')
plt.title('KNN: Accuracy vs K')
plt.legend()
plt.grid(True)
plt.show()
print(f"Optimal K: {k_values[np.argmax(test_scores)]}")
print(f"Best Test Accuracy: {max(test_scores):.4f}")
Distance Metrics and Weights
from sklearn.neighbors import KNeighborsClassifier
# Different distance metrics
metrics = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']
print("KNN Results by Distance Metric:")
for metric in metrics:
knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
knn.fit(X_train_scaled, y_train)
print(f"{metric}: Accuracy = {knn.score(X_test_scaled, y_test):.4f}")
# Weighted KNN (closer neighbors have more influence)
knn_weighted = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_weighted.fit(X_train_scaled, y_train)
print(f"\nWeighted KNN Accuracy: {knn_weighted.score(X_test_scaled, y_test):.4f}")
Gradient Boosting
Gradient Boosting builds an ensemble of weak learners sequentially, with each learner correcting the errors of the previous ones.
Implementation with Scikit-learn
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
# Gradient Boosting Classifier
gb_classifier = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
min_samples_split=5,
min_samples_leaf=2,
subsample=0.8,
random_state=42
)
gb_classifier.fit(X_train, y_train)
# Evaluate
y_pred = gb_classifier.predict(X_test)
y_proba = gb_classifier.predict_proba(X_test)[:, 1]
print("Gradient Boosting Results:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
XGBoost
XGBoost is an optimized implementation of gradient boosting:
import xgboost as xgb
# XGBoost Classifier
xgb_classifier = xgb.XGBClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=5,
min_child_weight=1,
subsample=0.8,
colsample_bytree=0.8,
gamma=0,
reg_alpha=0,
reg_lambda=1,
random_state=42,
use_label_encoder=False,
eval_metric='logloss'
)
xgb_classifier.fit(X_train, y_train)
# Evaluate
y_pred = xgb_classifier.predict(X_test)
y_proba = xgb_classifier.predict_proba(X_test)[:, 1]
print("XGBoost Results:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
# Feature importance
xgb.plot_importance(xgb_classifier, max_num_features=10)
plt.title('XGBoost Feature Importance')
plt.tight_layout()
plt.show()
LightGBM
LightGBM is faster than XGBoost for large datasets:
import lightgbm as lgb
# LightGBM Classifier
lgb_classifier = lgb.LGBMClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=5,
num_leaves=31,
subsample=0.8,
colsample_bytree=0.8,
random_state=42,
verbose=-1
)
lgb_classifier.fit(X_train, y_train)
# Evaluate
y_pred = lgb_classifier.predict(X_test)
y_proba = lgb_classifier.predict_proba(X_test)[:, 1]
print("LightGBM Results:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
CatBoost
CatBoost handles categorical features natively:
from catboost import CatBoostClassifier
# CatBoost Classifier
cat_classifier = CatBoostClassifier(
iterations=100,
learning_rate=0.1,
depth=5,
random_seed=42,
verbose=False
)
cat_classifier.fit(X_train, y_train)
# Evaluate
y_pred = cat_classifier.predict(X_test)
y_proba = cat_classifier.predict_proba(X_test)[:, 1]
print("CatBoost Results:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
Naive Bayes
Naive Bayes is a probabilistic classifier based on Bayes’ theorem with the assumption that features are conditionally independent.
Implementation
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
# Gaussian Naive Bayes (for continuous features)
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
y_proba = gnb.predict_proba(X_test)[:, 1]
print("Gaussian Naive Bayes Results:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
# For text classification, use MultinomialNB with TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer
# Example with text data
texts = ["This is spam", "Meeting tomorrow", "Free money now", "Project update"]
labels = [1, 0, 1, 0]
vectorizer = TfidfVectorizer()
X_text = vectorizer.fit_transform(texts)
mnb = MultinomialNB()
mnb.fit(X_text, labels)
Algorithm Selection Guide
Quick Reference Table
| Algorithm | Type | Interpretability | Training Speed | Prediction Speed | Handles Non-linearity |
|---|---|---|---|---|---|
| Linear Regression | Regression | High | Fast | Fast | No |
| Logistic Regression | Classification | High | Fast | Fast | No |
| Decision Tree | Both | High | Fast | Fast | Yes |
| Random Forest | Both | Medium | Medium | Medium | Yes |
| SVM | Both | Low | Slow | Medium | Yes (with kernels) |
| KNN | Both | High | Fast | Slow | Yes |
| Gradient Boosting | Both | Low | Slow | Fast | Yes |
| Naive Bayes | Classification | High | Fast | Fast | No |
Decision Flowchart
Is your target continuous or categorical?
├── Continuous (Regression)
│ ├── Linear relationship? → Linear Regression
│ ├── Non-linear relationship?
│ │ ├── Need interpretability? → Decision Tree
│ │ ├── Need best accuracy? → XGBoost/LightGBM
│ │ └── High-dimensional? → Random Forest
│ └── Small dataset? → SVM or KNN
│
└── Categorical (Classification)
├── Binary classification
│ ├── Need probabilities? → Logistic Regression
│ ├── Need interpretability? → Decision Tree
│ └── Need best accuracy? → XGBoost/LightGBM
├── Multi-class classification
│ ├── Few classes? → Same as binary
│ └── Many classes? → Random Forest or Neural Network
└── Text classification? → Naive Bayes or SVM
Complete Comparison Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Define models
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM': SVC(probability=True, random_state=42),
'KNN': KNeighborsClassifier(n_neighbors=5),
'Gradient Boosting': GradientBoostingClassifier(random_state=42),
'Naive Bayes': GaussianNB()
}
# Evaluate each model
results = {}
for name, model in models.items():
# Create pipeline with scaling
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', model)
])
# Cross-validation
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc')
results[name] = {
'mean_score': cv_scores.mean(),
'std_score': cv_scores.std()
}
# Display results
results_df = pd.DataFrame(results).T
results_df = results_df.sort_values('mean_score', ascending=False)
print("\nModel Comparison (ROC-AUC):")
print(results_df.round(4))
# Visualize
plt.figure(figsize=(12, 6))
plt.barh(results_df.index, results_df['mean_score'])
plt.xlabel('ROC-AUC Score')
plt.title('Model Comparison')
plt.xlim(0.5, 1.0)
for i, v in enumerate(results_df['mean_score']):
plt.text(v + 0.01, i, f'{v:.3f}', va='center')
plt.tight_layout()
plt.show()
Best Practices
1. Always Start Simple
# Start with a simple baseline
baseline = LogisticRegression()
baseline_score = cross_val_score(baseline, X_scaled, y, cv=5).mean()
print(f"Baseline (Logistic Regression): {baseline_score:.4f}")
# Then try more complex models
complex_model = RandomForestClassifier(n_estimators=100)
complex_score = cross_val_score(complex_model, X, y, cv=5).mean()
print(f"Complex Model (Random Forest): {complex_score:.4f}")
# Is the improvement worth the complexity?
2. Use Cross-Validation
from sklearn.model_selection import StratifiedKFold
# For imbalanced classification
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
print(f"CV Score: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
3. Handle Class Imbalance
from sklearn.utils.class_weight import compute_class_weight
# Compute class weights
classes = np.unique(y)
weights = compute_class_weight('balanced', classes=classes, y=y)
class_weights = dict(zip(classes, weights))
# Use in model
model = LogisticRegression(class_weight='balanced')
# or
model = RandomForestClassifier(class_weight='balanced')
4. Save and Load Models
import joblib
# Save model
joblib.dump(model, 'model.joblib')
# Load model
loaded_model = joblib.load('model.joblib')
predictions = loaded_model.predict(X_test)
Conclusion
Supervised learning algorithms form the foundation of practical machine learning. Key takeaways:
- Start with simple models (Linear/Logistic Regression) to establish baselines
- Tree-based models (Random Forest, XGBoost) work well for most tabular data
- Always scale features for distance-based algorithms (SVM, KNN)
- Use cross-validation to get reliable performance estimates
- Consider interpretability vs. accuracy trade-offs
- Tune hyperparameters for optimal performance
The best algorithm depends on your specific problem, data characteristics, and requirements. Experiment with multiple algorithms and let the data guide your choice.
Further Reading
- Scikit-learn Documentation and User Guide
- “Introduction to Statistical Learning” (free online)
- “Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow”
- XGBoost, LightGBM, and CatBoost documentation
- Kaggle competition solutions for practical examples