Feature Engineering: The Key to Better ML Models
Master the art of feature engineering with practical techniques to transform raw data into powerful features that dramatically improve model performance.
Feature engineering is often the difference between a mediocre model and a world-class one. While algorithms and hyperparameters matter, the features you create and select have the biggest impact on model performance. In this comprehensive guide, we’ll explore practical techniques that can transform your raw data into powerful predictive features.
Why Feature Engineering Matters
“Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering.” — Andrew Ng
Raw data rarely works well directly with machine learning algorithms. Good features:
- Capture relevant patterns that algorithms can learn from
- Reduce noise and irrelevant information
- Make relationships explicit that would be hard for algorithms to discover
- Enable simpler models to achieve better performance
The Feature Engineering Workflow
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
# Typical workflow
# 1. Understand your data
# 2. Handle missing values
# 3. Encode categorical variables
# 4. Create new features
# 5. Scale/normalize
# 6. Select important features
# 7. Evaluate impact
Understanding Your Data
Before engineering features, you need to understand what you have.
Exploratory Data Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load sample data
np.random.seed(42)
n_samples = 1000
data = pd.DataFrame({
'age': np.random.randint(18, 80, n_samples),
'income': np.random.lognormal(10.5, 0.5, n_samples),
'education': np.random.choice(['high_school', 'bachelor', 'master', 'phd'], n_samples),
'occupation': np.random.choice(['engineer', 'doctor', 'teacher', 'artist', 'manager'], n_samples),
'years_employed': np.random.randint(0, 40, n_samples),
'credit_score': np.random.normal(700, 50, n_samples).clip(300, 850),
'num_credit_cards': np.random.randint(0, 10, n_samples),
'loan_amount': np.random.lognormal(10, 0.8, n_samples),
'property_value': np.random.lognormal(12, 0.5, n_samples),
'approved': np.random.randint(0, 2, n_samples)
})
# Add some missing values
data.loc[np.random.choice(data.index, 50), 'income'] = np.nan
data.loc[np.random.choice(data.index, 30), 'credit_score'] = np.nan
data.loc[np.random.choice(data.index, 20), 'years_employed'] = np.nan
# Basic info
print("Dataset Shape:", data.shape)
print("\nColumn Types:")
print(data.dtypes)
print("\nMissing Values:")
print(data.isnull().sum())
print("\nNumeric Summary:")
print(data.describe())
# Check distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
numeric_cols = ['age', 'income', 'credit_score', 'loan_amount', 'property_value', 'years_employed']
for i, col in enumerate(numeric_cols):
ax = axes[i // 3, i % 3]
data[col].hist(bins=30, ax=ax, edgecolor='black')
ax.set_title(f'Distribution of {col}')
ax.set_xlabel(col)
plt.tight_layout()
plt.show()
# Check correlations
correlation_matrix = data.select_dtypes(include=[np.number]).corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()
Handling Missing Values
Missing data is common in real-world datasets. How you handle it affects model performance.
Strategies for Missing Values
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Create a copy for experiments
df = data.copy()
# 1. Simple Statistics-Based Imputation
# Mean imputation (for normally distributed data)
df['income_mean'] = df['income'].fillna(df['income'].mean())
# Median imputation (robust to outliers)
df['income_median'] = df['income'].fillna(df['income'].median())
# Mode imputation (for categorical or discrete data)
df['credit_score_mode'] = df['credit_score'].fillna(df['credit_score'].mode()[0])
# 2. Group-Based Imputation
# Fill based on related groups
df['income_by_education'] = df.groupby('education')['income'].transform(
lambda x: x.fillna(x.median())
)
df['income_by_occupation'] = df.groupby('occupation')['income'].transform(
lambda x: x.fillna(x.median())
)
# 3. Forward/Backward Fill (for time series)
# df['value_ffill'] = df['value'].fillna(method='ffill')
# df['value_bfill'] = df['value'].fillna(method='bfill')
# 4. Interpolation
# df['value_interpolate'] = df['value'].interpolate(method='linear')
# 5. KNN Imputation
numeric_cols = ['age', 'income', 'credit_score', 'years_employed']
knn_imputer = KNNImputer(n_neighbors=5)
df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])
# 6. Iterative Imputation (MICE-like)
iterative_imputer = IterativeImputer(max_iter=10, random_state=42)
# df[numeric_cols] = iterative_imputer.fit_transform(df[numeric_cols])
print("Missing values after imputation:")
print(df.isnull().sum())
Creating Missing Indicators
# Sometimes the fact that a value is missing is informative
df['income_missing'] = data['income'].isnull().astype(int)
df['credit_score_missing'] = data['credit_score'].isnull().astype(int)
df['years_employed_missing'] = data['years_employed'].isnull().astype(int)
# Check if missingness predicts the target
print("\nMissingness correlation with target:")
missing_cols = ['income_missing', 'credit_score_missing', 'years_employed_missing']
for col in missing_cols:
correlation = df[col].corr(df['approved'])
print(f"{col}: {correlation:.4f}")
Encoding Categorical Variables
Machine learning algorithms require numeric input. Here’s how to convert categories.
One-Hot Encoding
Best for nominal categories (no inherent order).
from sklearn.preprocessing import OneHotEncoder
# Method 1: Pandas get_dummies
df_encoded = pd.get_dummies(df, columns=['education', 'occupation'], drop_first=True)
print("Shape after one-hot encoding:", df_encoded.shape)
# Method 2: Sklearn OneHotEncoder (better for pipelines)
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
categorical_cols = ['education', 'occupation']
encoded_features = encoder.fit_transform(df[categorical_cols])
encoded_df = pd.DataFrame(
encoded_features,
columns=encoder.get_feature_names_out(categorical_cols)
)
print("\nEncoded column names:")
print(encoded_df.columns.tolist())
Ordinal Encoding
For categories with a natural order.
from sklearn.preprocessing import OrdinalEncoder
# Define the order
education_order = ['high_school', 'bachelor', 'master', 'phd']
# Custom ordinal mapping
education_map = {edu: i for i, edu in enumerate(education_order)}
df['education_ordinal'] = df['education'].map(education_map)
# Or use OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=[education_order])
df['education_ordinal_v2'] = ordinal_encoder.fit_transform(df[['education']])
print("Education encoding:")
print(df[['education', 'education_ordinal']].drop_duplicates().sort_values('education_ordinal'))
Target Encoding
Replace categories with target statistics. Useful for high-cardinality features.
from sklearn.model_selection import KFold
def target_encode(df, column, target, n_splits=5):
"""
Target encode a categorical column using K-fold to prevent leakage.
"""
df = df.copy()
df['encoded'] = np.nan
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(df):
# Calculate encoding from training fold
encoding = df.loc[train_idx].groupby(column)[target].mean()
# Apply to validation fold
df.loc[val_idx, 'encoded'] = df.loc[val_idx, column].map(encoding)
# Handle unseen categories with global mean
global_mean = df[target].mean()
df['encoded'] = df['encoded'].fillna(global_mean)
return df['encoded']
# Apply target encoding
df['occupation_target_encoded'] = target_encode(df, 'occupation', 'approved')
df['education_target_encoded'] = target_encode(df, 'education', 'approved')
print("\nTarget encoding for occupation:")
occupation_stats = df.groupby('occupation').agg({
'approved': 'mean',
'occupation_target_encoded': 'first'
}).round(4)
print(occupation_stats)
Frequency Encoding
Replace categories with their frequency.
def frequency_encode(df, column):
"""Encode categories by their frequency in the dataset."""
frequency = df[column].value_counts(normalize=True)
return df[column].map(frequency)
df['occupation_frequency'] = frequency_encode(df, 'occupation')
df['education_frequency'] = frequency_encode(df, 'education')
print("\nFrequency encoding:")
print(df[['occupation', 'occupation_frequency']].drop_duplicates())
Binary Encoding
Efficient for high-cardinality categories.
import category_encoders as ce
# Binary encoding
binary_encoder = ce.BinaryEncoder(cols=['occupation'])
df_binary = binary_encoder.fit_transform(df[['occupation']])
print("\nBinary encoding columns:", df_binary.columns.tolist())
Feature Scaling
Many algorithms require features to be on similar scales.
Standardization (Z-score)
Centers data around 0 with unit variance.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numeric_features = ['age', 'income', 'credit_score', 'years_employed', 'loan_amount']
# Fit only on training data!
X_train, X_test = train_test_split(df[numeric_features], test_size=0.2, random_state=42)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Before scaling:")
print(f" Mean: {X_train.mean().values}")
print(f" Std: {X_train.std().values}")
print("\nAfter scaling:")
print(f" Mean: {X_train_scaled.mean(axis=0)}")
print(f" Std: {X_train_scaled.std(axis=0)}")
Min-Max Normalization
Scales to a fixed range (usually 0-1).
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler(feature_range=(0, 1))
X_train_minmax = minmax_scaler.fit_transform(X_train)
print("\nMin-Max scaling:")
print(f" Min: {X_train_minmax.min(axis=0)}")
print(f" Max: {X_train_minmax.max(axis=0)}")
Robust Scaling
Robust to outliers using median and IQR.
from sklearn.preprocessing import RobustScaler
robust_scaler = RobustScaler()
X_train_robust = robust_scaler.fit_transform(X_train)
# Compare on data with outliers
outlier_data = np.array([[100], [200], [300], [400], [10000]])
print("\nRobust vs Standard scaling on outliers:")
print(f" Standard: {StandardScaler().fit_transform(outlier_data).flatten()}")
print(f" Robust: {RobustScaler().fit_transform(outlier_data).flatten()}")
When to Scale
| Algorithm | Scaling Needed? |
|---|---|
| Linear Regression | Helpful |
| Logistic Regression | Yes |
| SVM | Yes |
| KNN | Yes |
| Decision Trees | No |
| Random Forest | No |
| Gradient Boosting | No |
| Neural Networks | Yes |
Creating New Features
This is where domain knowledge shines. Here are common techniques.
Mathematical Combinations
# Ratios
df['debt_to_income'] = df['loan_amount'] / df['income']
df['loan_to_property'] = df['loan_amount'] / df['property_value']
df['income_per_year_employed'] = df['income'] / (df['years_employed'] + 1)
# Products (interactions)
df['income_x_credit_score'] = df['income'] * df['credit_score']
df['age_x_years_employed'] = df['age'] * df['years_employed']
# Sums and differences
df['total_assets'] = df['income'] + df['property_value']
df['age_at_first_job'] = df['age'] - df['years_employed']
# Polynomial features
df['income_squared'] = df['income'] ** 2
df['credit_score_squared'] = df['credit_score'] ** 2
# Log transformations (for skewed data)
df['log_income'] = np.log1p(df['income'])
df['log_loan_amount'] = np.log1p(df['loan_amount'])
df['log_property_value'] = np.log1p(df['property_value'])
# Square root (another way to handle skew)
df['sqrt_income'] = np.sqrt(df['income'])
print("New feature statistics:")
new_features = ['debt_to_income', 'loan_to_property', 'log_income']
print(df[new_features].describe())
Binning Continuous Variables
# Equal-width bins
df['age_bin'] = pd.cut(df['age'], bins=5, labels=['very_young', 'young', 'middle', 'senior', 'elderly'])
# Quantile-based bins (equal frequency)
df['income_quantile'] = pd.qcut(df['income'], q=4, labels=['low', 'medium', 'high', 'very_high'])
# Custom bins based on domain knowledge
credit_bins = [0, 580, 670, 740, 800, 900]
credit_labels = ['poor', 'fair', 'good', 'very_good', 'excellent']
df['credit_category'] = pd.cut(df['credit_score'], bins=credit_bins, labels=credit_labels)
# Binary flags
df['is_senior'] = (df['age'] >= 60).astype(int)
df['has_excellent_credit'] = (df['credit_score'] >= 750).astype(int)
df['high_debt_ratio'] = (df['debt_to_income'] > 0.4).astype(int)
print("\nBinning results:")
print(df['credit_category'].value_counts())
Date and Time Features
# Create sample date data
date_df = pd.DataFrame({
'transaction_date': pd.date_range('2023-01-01', periods=1000, freq='H')
})
date_df['date'] = date_df['transaction_date']
# Extract components
date_df['year'] = date_df['date'].dt.year
date_df['month'] = date_df['date'].dt.month
date_df['day'] = date_df['date'].dt.day
date_df['day_of_week'] = date_df['date'].dt.dayofweek
date_df['day_of_year'] = date_df['date'].dt.dayofyear
date_df['week_of_year'] = date_df['date'].dt.isocalendar().week
date_df['hour'] = date_df['date'].dt.hour
date_df['minute'] = date_df['date'].dt.minute
date_df['quarter'] = date_df['date'].dt.quarter
# Binary flags
date_df['is_weekend'] = date_df['day_of_week'].isin([5, 6]).astype(int)
date_df['is_month_start'] = date_df['date'].dt.is_month_start.astype(int)
date_df['is_month_end'] = date_df['date'].dt.is_month_end.astype(int)
date_df['is_business_hour'] = date_df['hour'].between(9, 17).astype(int)
# Cyclical encoding for time features
# Important for algorithms that need to understand cyclical nature
date_df['month_sin'] = np.sin(2 * np.pi * date_df['month'] / 12)
date_df['month_cos'] = np.cos(2 * np.pi * date_df['month'] / 12)
date_df['hour_sin'] = np.sin(2 * np.pi * date_df['hour'] / 24)
date_df['hour_cos'] = np.cos(2 * np.pi * date_df['hour'] / 24)
date_df['day_of_week_sin'] = np.sin(2 * np.pi * date_df['day_of_week'] / 7)
date_df['day_of_week_cos'] = np.cos(2 * np.pi * date_df['day_of_week'] / 7)
# Time since reference
reference_date = pd.Timestamp('2023-01-01')
date_df['days_since_ref'] = (date_df['date'] - reference_date).dt.days
print("Date features sample:")
print(date_df.head())
Text Features
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# Sample text data
text_df = pd.DataFrame({
'description': [
"Excellent credit history with long employment",
"New to credit, short employment history",
"Good credit score, stable income",
"High debt, recent bankruptcy",
"Excellent income, owns multiple properties"
]
})
# Basic text features
text_df['char_count'] = text_df['description'].str.len()
text_df['word_count'] = text_df['description'].str.split().str.len()
text_df['avg_word_length'] = text_df['char_count'] / text_df['word_count']
# Count specific keywords
text_df['has_excellent'] = text_df['description'].str.contains('excellent', case=False).astype(int)
text_df['has_good'] = text_df['description'].str.contains('good', case=False).astype(int)
text_df['has_high'] = text_df['description'].str.contains('high', case=False).astype(int)
# TF-IDF features
tfidf = TfidfVectorizer(max_features=50, stop_words='english')
tfidf_matrix = tfidf.fit_transform(text_df['description'])
tfidf_df = pd.DataFrame(
tfidf_matrix.toarray(),
columns=[f'tfidf_{word}' for word in tfidf.get_feature_names_out()]
)
print("Text features:")
print(text_df[['description', 'word_count', 'has_excellent', 'has_good']].head())
Aggregation Features
For data with multiple rows per entity (e.g., transactions per customer).
# Sample transaction data
transactions = pd.DataFrame({
'customer_id': np.random.randint(1, 100, 1000),
'amount': np.random.lognormal(4, 1, 1000),
'category': np.random.choice(['food', 'entertainment', 'utilities', 'shopping'], 1000),
'date': pd.date_range('2023-01-01', periods=1000, freq='H')
})
# Aggregation features per customer
customer_features = transactions.groupby('customer_id').agg({
'amount': ['mean', 'sum', 'std', 'min', 'max', 'count'],
'category': 'nunique',
'date': ['min', 'max']
}).reset_index()
# Flatten column names
customer_features.columns = ['_'.join(col).strip('_') for col in customer_features.columns.values]
# Additional derived features
customer_features['amount_range'] = customer_features['amount_max'] - customer_features['amount_min']
customer_features['avg_transaction_value'] = customer_features['amount_sum'] / customer_features['amount_count']
customer_features['days_active'] = (customer_features['date_max'] - customer_features['date_min']).dt.days
# Category-specific aggregations
category_agg = transactions.pivot_table(
index='customer_id',
columns='category',
values='amount',
aggfunc=['sum', 'count']
).fillna(0)
category_agg.columns = ['_'.join(col) for col in category_agg.columns]
print("Customer features:")
print(customer_features.head())
Lag Features (Time Series)
# Sample time series data
ts_df = pd.DataFrame({
'date': pd.date_range('2023-01-01', periods=100, freq='D'),
'sales': np.random.randint(100, 500, 100)
})
ts_df = ts_df.sort_values('date')
# Lag features
for lag in [1, 7, 14, 30]:
ts_df[f'sales_lag_{lag}'] = ts_df['sales'].shift(lag)
# Rolling statistics
for window in [7, 14, 30]:
ts_df[f'sales_rolling_mean_{window}'] = ts_df['sales'].rolling(window=window).mean()
ts_df[f'sales_rolling_std_{window}'] = ts_df['sales'].rolling(window=window).std()
ts_df[f'sales_rolling_min_{window}'] = ts_df['sales'].rolling(window=window).min()
ts_df[f'sales_rolling_max_{window}'] = ts_df['sales'].rolling(window=window).max()
# Exponential moving average
ts_df['sales_ema_7'] = ts_df['sales'].ewm(span=7).mean()
ts_df['sales_ema_14'] = ts_df['sales'].ewm(span=14).mean()
# Difference features
ts_df['sales_diff_1'] = ts_df['sales'].diff(1)
ts_df['sales_diff_7'] = ts_df['sales'].diff(7)
# Percent change
ts_df['sales_pct_change'] = ts_df['sales'].pct_change()
print("Time series features:")
print(ts_df.dropna().head(10))
Feature Selection
Not all features are useful. Selecting the right ones improves performance and reduces complexity.
Filter Methods
Statistical tests to select features independently of the model.
from sklearn.feature_selection import (
SelectKBest, f_classif, mutual_info_classif,
VarianceThreshold
)
# Prepare data
X = df.select_dtypes(include=[np.number]).drop(['approved'], axis=1)
X = X.fillna(X.median())
y = df['approved']
# 1. Remove low variance features
variance_selector = VarianceThreshold(threshold=0.01)
X_high_var = variance_selector.fit_transform(X)
print(f"Features after variance threshold: {X_high_var.shape[1]}/{X.shape[1]}")
# 2. Select K best features using ANOVA F-test
k_best_selector = SelectKBest(score_func=f_classif, k=10)
X_k_best = k_best_selector.fit_transform(X, y)
# Get feature scores
feature_scores = pd.DataFrame({
'feature': X.columns,
'score': k_best_selector.scores_,
'pvalue': k_best_selector.pvalues_
}).sort_values('score', ascending=False)
print("\nTop 10 features by ANOVA F-score:")
print(feature_scores.head(10))
# 3. Mutual information
mi_scores = mutual_info_classif(X, y, random_state=42)
mi_df = pd.DataFrame({
'feature': X.columns,
'mi_score': mi_scores
}).sort_values('mi_score', ascending=False)
print("\nTop 10 features by Mutual Information:")
print(mi_df.head(10))
Wrapper Methods
Use model performance to select features.
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression
# Recursive Feature Elimination
model = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=model, n_features_to_select=10, step=1)
rfe.fit(X, y)
rfe_ranking = pd.DataFrame({
'feature': X.columns,
'ranking': rfe.ranking_,
'selected': rfe.support_
}).sort_values('ranking')
print("\nRFE Feature Ranking:")
print(rfe_ranking.head(15))
# RFE with Cross-Validation (finds optimal number of features)
rfecv = RFECV(estimator=model, step=1, cv=5, scoring='roc_auc', n_jobs=-1)
rfecv.fit(X, y)
print(f"\nOptimal number of features: {rfecv.n_features_}")
print(f"Best CV score: {rfecv.cv_results_['mean_test_score'].max():.4f}")
# Plot number of features vs score
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(rfecv.cv_results_['mean_test_score']) + 1),
rfecv.cv_results_['mean_test_score'])
plt.xlabel('Number of Features')
plt.ylabel('CV Score')
plt.title('RFECV: Feature Selection')
plt.grid(True)
plt.show()
Embedded Methods
Feature selection built into the learning algorithm.
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LassoCV
# 1. L1 Regularization (Lasso) - sets some coefficients to zero
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X, y)
lasso_features = pd.DataFrame({
'feature': X.columns,
'coefficient': lasso.coef_
}).sort_values('coefficient', key=abs, ascending=False)
print("Lasso non-zero coefficients:")
print(lasso_features[lasso_features['coefficient'] != 0])
# 2. Tree-based feature importance
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
rf_importance = pd.DataFrame({
'feature': X.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print("\nRandom Forest Feature Importance:")
print(rf_importance.head(10))
# Visualize
fig, ax = plt.subplots(figsize=(10, 8))
rf_importance_top = rf_importance.head(15)
ax.barh(rf_importance_top['feature'], rf_importance_top['importance'])
ax.set_xlabel('Importance')
ax.set_title('Random Forest Feature Importance')
ax.invert_yaxis()
plt.tight_layout()
plt.show()
Permutation Importance
More reliable importance estimation.
from sklearn.inspection import permutation_importance
# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf.fit(X_train, y_train)
# Calculate permutation importance on test set
perm_importance = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)
perm_importance_df = pd.DataFrame({
'feature': X.columns,
'importance_mean': perm_importance.importances_mean,
'importance_std': perm_importance.importances_std
}).sort_values('importance_mean', ascending=False)
print("\nPermutation Importance:")
print(perm_importance_df.head(10))
# Visualize with error bars
fig, ax = plt.subplots(figsize=(10, 8))
top_features = perm_importance_df.head(15)
ax.barh(top_features['feature'], top_features['importance_mean'],
xerr=top_features['importance_std'])
ax.set_xlabel('Decrease in Accuracy')
ax.set_title('Permutation Importance')
ax.invert_yaxis()
plt.tight_layout()
plt.show()
Avoiding Data Leakage
Data leakage occurs when information from outside the training data leaks into the model.
Common Sources of Leakage
# BAD: Fitting on all data before splitting
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Leakage!
X_train, X_test = train_test_split(X_scaled, test_size=0.2)
# GOOD: Fit only on training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit on train only
X_test_scaled = scaler.transform(X_test) # Transform test with train params
# BAD: Target encoding without cross-validation
df['occupation_encoded'] = df.groupby('occupation')['approved'].transform('mean') # Leakage!
# GOOD: Use cross-validation for target encoding (shown earlier)
Using Pipelines to Prevent Leakage
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Define preprocessing
numeric_features = ['age', 'income', 'credit_score', 'years_employed', 'loan_amount']
categorical_features = ['education', 'occupation']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Full pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Cross-validation is now leakage-free
cv_scores = cross_val_score(pipeline, data[numeric_features + categorical_features],
data['approved'], cv=5, scoring='roc_auc')
print(f"CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
Measuring Feature Impact
Before and After Comparison
from sklearn.model_selection import cross_val_score
# Baseline with original features
X_original = df[['age', 'income', 'credit_score', 'years_employed', 'loan_amount']].fillna(0)
y = df['approved']
baseline_scores = cross_val_score(
RandomForestClassifier(n_estimators=100, random_state=42),
X_original, y, cv=5, scoring='roc_auc'
)
print(f"Baseline CV Score: {baseline_scores.mean():.4f}")
# With engineered features
engineered_features = ['age', 'income', 'credit_score', 'years_employed', 'loan_amount',
'debt_to_income', 'loan_to_property', 'log_income',
'income_x_credit_score', 'age_x_years_employed']
X_engineered = df[engineered_features].fillna(0)
engineered_scores = cross_val_score(
RandomForestClassifier(n_estimators=100, random_state=42),
X_engineered, y, cv=5, scoring='roc_auc'
)
print(f"Engineered CV Score: {engineered_scores.mean():.4f}")
print(f"Improvement: {(engineered_scores.mean() - baseline_scores.mean()):.4f}")
Feature Ablation Study
def ablation_study(X, y, model, cv=5):
"""Test impact of removing each feature."""
baseline = cross_val_score(model, X, y, cv=cv, scoring='roc_auc').mean()
results = {'baseline': baseline}
for col in X.columns:
X_reduced = X.drop(columns=[col])
score = cross_val_score(model, X_reduced, y, cv=cv, scoring='roc_auc').mean()
results[f'without_{col}'] = score
return pd.DataFrame([results]).T.rename(columns={0: 'score'}).sort_values('score')
# Run ablation study
ablation_results = ablation_study(
X_engineered, y,
RandomForestClassifier(n_estimators=100, random_state=42)
)
print("\nAblation Study (lower = feature is important):")
print(ablation_results)
Best Practices
1. Document Everything
# Keep track of feature engineering decisions
feature_documentation = {
'debt_to_income': {
'description': 'Ratio of loan amount to income',
'formula': 'loan_amount / income',
'rationale': 'Key indicator of debt burden',
'created_date': '2024-01-15'
},
'log_income': {
'description': 'Log-transformed income',
'formula': 'log(1 + income)',
'rationale': 'Handle right-skewed distribution',
'created_date': '2024-01-15'
}
}
2. Version Control Your Features
# Save feature engineering code in version control
# Create reproducible pipelines
def create_features(df):
"""Create all engineered features."""
df = df.copy()
# Ratios
df['debt_to_income'] = df['loan_amount'] / df['income']
df['loan_to_property'] = df['loan_amount'] / df['property_value']
# Transformations
df['log_income'] = np.log1p(df['income'])
# Interactions
df['income_x_credit'] = df['income'] * df['credit_score']
return df
# Apply consistently
df_train = create_features(train_data)
df_test = create_features(test_data)
3. Test on Validation Data
# Always test feature impact on held-out data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# Test new feature
model_without = RandomForestClassifier(n_estimators=100, random_state=42)
model_without.fit(X_train.drop(columns=['new_feature']), y_train)
score_without = roc_auc_score(y_val, model_without.predict_proba(X_val.drop(columns=['new_feature']))[:, 1])
model_with = RandomForestClassifier(n_estimators=100, random_state=42)
model_with.fit(X_train, y_train)
score_with = roc_auc_score(y_val, model_with.predict_proba(X_val)[:, 1])
print(f"Without new feature: {score_without:.4f}")
print(f"With new feature: {score_with:.4f}")
print(f"Improvement: {score_with - score_without:.4f}")
Conclusion
Feature engineering is both an art and a science. Key takeaways:
- Understand your data before engineering features
- Handle missing values thoughtfully—they often contain information
- Choose appropriate encodings for categorical variables
- Scale features when required by your algorithm
- Create domain-relevant features that capture business logic
- Select important features to reduce noise and complexity
- Avoid data leakage by using proper cross-validation and pipelines
- Measure impact of every feature engineering decision
The best features come from combining domain expertise with systematic experimentation. Start simple, measure impact, and iterate. Often, a few well-crafted features outperform hundreds of mediocre ones.
Further Reading
- “Feature Engineering for Machine Learning” by Alice Zheng
- Kaggle winning solutions for real-world examples
- Scikit-learn documentation on preprocessing
- Category Encoders library documentation
- “Applied Machine Learning” by Max Kuhn and Kjell Johnson