Dec 20, 2024

Feature Engineering: The Key to Better ML Models

Master the art of feature engineering with practical techniques to transform raw data into powerful features that dramatically improve model performance.

Dery Febriantara Developer

Feature Engineering: The Key to Better ML Models

Feature engineering is often the difference between a mediocre model and a world-class one. While algorithms and hyperparameters matter, the features you create and select have the biggest impact on model performance. In this comprehensive guide, we’ll explore practical techniques that can transform your raw data into powerful predictive features.

Why Feature Engineering Matters

“Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering.” — Andrew Ng

Raw data rarely works well directly with machine learning algorithms. Good features:

Capture relevant patterns that algorithms can learn from
Reduce noise and irrelevant information
Make relationships explicit that would be hard for algorithms to discover
Enable simpler models to achieve better performance

The Feature Engineering Workflow

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

# Typical workflow
# 1. Understand your data
# 2. Handle missing values
# 3. Encode categorical variables
# 4. Create new features
# 5. Scale/normalize
# 6. Select important features
# 7. Evaluate impact

Understanding Your Data

Before engineering features, you need to understand what you have.

Exploratory Data Analysis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load sample data
np.random.seed(42)
n_samples = 1000

data = pd.DataFrame({
    'age': np.random.randint(18, 80, n_samples),
    'income': np.random.lognormal(10.5, 0.5, n_samples),
    'education': np.random.choice(['high_school', 'bachelor', 'master', 'phd'], n_samples),
    'occupation': np.random.choice(['engineer', 'doctor', 'teacher', 'artist', 'manager'], n_samples),
    'years_employed': np.random.randint(0, 40, n_samples),
    'credit_score': np.random.normal(700, 50, n_samples).clip(300, 850),
    'num_credit_cards': np.random.randint(0, 10, n_samples),
    'loan_amount': np.random.lognormal(10, 0.8, n_samples),
    'property_value': np.random.lognormal(12, 0.5, n_samples),
    'approved': np.random.randint(0, 2, n_samples)
})

# Add some missing values
data.loc[np.random.choice(data.index, 50), 'income'] = np.nan
data.loc[np.random.choice(data.index, 30), 'credit_score'] = np.nan
data.loc[np.random.choice(data.index, 20), 'years_employed'] = np.nan

# Basic info
print("Dataset Shape:", data.shape)
print("\nColumn Types:")
print(data.dtypes)

print("\nMissing Values:")
print(data.isnull().sum())

print("\nNumeric Summary:")
print(data.describe())

# Check distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
numeric_cols = ['age', 'income', 'credit_score', 'loan_amount', 'property_value', 'years_employed']

for i, col in enumerate(numeric_cols):
    ax = axes[i // 3, i % 3]
    data[col].hist(bins=30, ax=ax, edgecolor='black')
    ax.set_title(f'Distribution of {col}')
    ax.set_xlabel(col)

plt.tight_layout()
plt.show()

# Check correlations
correlation_matrix = data.select_dtypes(include=[np.number]).corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

Handling Missing Values

Missing data is common in real-world datasets. How you handle it affects model performance.

Strategies for Missing Values

from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Create a copy for experiments
df = data.copy()

# 1. Simple Statistics-Based Imputation
# Mean imputation (for normally distributed data)
df['income_mean'] = df['income'].fillna(df['income'].mean())

# Median imputation (robust to outliers)
df['income_median'] = df['income'].fillna(df['income'].median())

# Mode imputation (for categorical or discrete data)
df['credit_score_mode'] = df['credit_score'].fillna(df['credit_score'].mode()[0])

# 2. Group-Based Imputation
# Fill based on related groups
df['income_by_education'] = df.groupby('education')['income'].transform(
    lambda x: x.fillna(x.median())
)

df['income_by_occupation'] = df.groupby('occupation')['income'].transform(
    lambda x: x.fillna(x.median())
)

# 3. Forward/Backward Fill (for time series)
# df['value_ffill'] = df['value'].fillna(method='ffill')
# df['value_bfill'] = df['value'].fillna(method='bfill')

# 4. Interpolation
# df['value_interpolate'] = df['value'].interpolate(method='linear')

# 5. KNN Imputation
numeric_cols = ['age', 'income', 'credit_score', 'years_employed']
knn_imputer = KNNImputer(n_neighbors=5)
df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])

# 6. Iterative Imputation (MICE-like)
iterative_imputer = IterativeImputer(max_iter=10, random_state=42)
# df[numeric_cols] = iterative_imputer.fit_transform(df[numeric_cols])

print("Missing values after imputation:")
print(df.isnull().sum())

Creating Missing Indicators

# Sometimes the fact that a value is missing is informative
df['income_missing'] = data['income'].isnull().astype(int)
df['credit_score_missing'] = data['credit_score'].isnull().astype(int)
df['years_employed_missing'] = data['years_employed'].isnull().astype(int)

# Check if missingness predicts the target
print("\nMissingness correlation with target:")
missing_cols = ['income_missing', 'credit_score_missing', 'years_employed_missing']
for col in missing_cols:
    correlation = df[col].corr(df['approved'])
    print(f"{col}: {correlation:.4f}")

Encoding Categorical Variables

Machine learning algorithms require numeric input. Here’s how to convert categories.

One-Hot Encoding

Best for nominal categories (no inherent order).

from sklearn.preprocessing import OneHotEncoder

# Method 1: Pandas get_dummies
df_encoded = pd.get_dummies(df, columns=['education', 'occupation'], drop_first=True)
print("Shape after one-hot encoding:", df_encoded.shape)

# Method 2: Sklearn OneHotEncoder (better for pipelines)
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
categorical_cols = ['education', 'occupation']

encoded_features = encoder.fit_transform(df[categorical_cols])
encoded_df = pd.DataFrame(
    encoded_features,
    columns=encoder.get_feature_names_out(categorical_cols)
)

print("\nEncoded column names:")
print(encoded_df.columns.tolist())

Ordinal Encoding

For categories with a natural order.

from sklearn.preprocessing import OrdinalEncoder

# Define the order
education_order = ['high_school', 'bachelor', 'master', 'phd']

# Custom ordinal mapping
education_map = {edu: i for i, edu in enumerate(education_order)}
df['education_ordinal'] = df['education'].map(education_map)

# Or use OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=[education_order])
df['education_ordinal_v2'] = ordinal_encoder.fit_transform(df[['education']])

print("Education encoding:")
print(df[['education', 'education_ordinal']].drop_duplicates().sort_values('education_ordinal'))

Target Encoding

Replace categories with target statistics. Useful for high-cardinality features.

from sklearn.model_selection import KFold

def target_encode(df, column, target, n_splits=5):
    """
    Target encode a categorical column using K-fold to prevent leakage.
    """
    df = df.copy()
    df['encoded'] = np.nan

    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

    for train_idx, val_idx in kf.split(df):
        # Calculate encoding from training fold
        encoding = df.loc[train_idx].groupby(column)[target].mean()
        # Apply to validation fold
        df.loc[val_idx, 'encoded'] = df.loc[val_idx, column].map(encoding)

    # Handle unseen categories with global mean
    global_mean = df[target].mean()
    df['encoded'] = df['encoded'].fillna(global_mean)

    return df['encoded']

# Apply target encoding
df['occupation_target_encoded'] = target_encode(df, 'occupation', 'approved')
df['education_target_encoded'] = target_encode(df, 'education', 'approved')

print("\nTarget encoding for occupation:")
occupation_stats = df.groupby('occupation').agg({
    'approved': 'mean',
    'occupation_target_encoded': 'first'
}).round(4)
print(occupation_stats)

Frequency Encoding

Replace categories with their frequency.

def frequency_encode(df, column):
    """Encode categories by their frequency in the dataset."""
    frequency = df[column].value_counts(normalize=True)
    return df[column].map(frequency)

df['occupation_frequency'] = frequency_encode(df, 'occupation')
df['education_frequency'] = frequency_encode(df, 'education')

print("\nFrequency encoding:")
print(df[['occupation', 'occupation_frequency']].drop_duplicates())

Binary Encoding

Efficient for high-cardinality categories.

import category_encoders as ce

# Binary encoding
binary_encoder = ce.BinaryEncoder(cols=['occupation'])
df_binary = binary_encoder.fit_transform(df[['occupation']])
print("\nBinary encoding columns:", df_binary.columns.tolist())

Feature Scaling

Many algorithms require features to be on similar scales.

Standardization (Z-score)

Centers data around 0 with unit variance.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
numeric_features = ['age', 'income', 'credit_score', 'years_employed', 'loan_amount']

# Fit only on training data!
X_train, X_test = train_test_split(df[numeric_features], test_size=0.2, random_state=42)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Before scaling:")
print(f"  Mean: {X_train.mean().values}")
print(f"  Std: {X_train.std().values}")

print("\nAfter scaling:")
print(f"  Mean: {X_train_scaled.mean(axis=0)}")
print(f"  Std: {X_train_scaled.std(axis=0)}")

Min-Max Normalization

Scales to a fixed range (usually 0-1).

from sklearn.preprocessing import MinMaxScaler

minmax_scaler = MinMaxScaler(feature_range=(0, 1))
X_train_minmax = minmax_scaler.fit_transform(X_train)

print("\nMin-Max scaling:")
print(f"  Min: {X_train_minmax.min(axis=0)}")
print(f"  Max: {X_train_minmax.max(axis=0)}")

Robust Scaling

Robust to outliers using median and IQR.

from sklearn.preprocessing import RobustScaler

robust_scaler = RobustScaler()
X_train_robust = robust_scaler.fit_transform(X_train)

# Compare on data with outliers
outlier_data = np.array([[100], [200], [300], [400], [10000]])
print("\nRobust vs Standard scaling on outliers:")
print(f"  Standard: {StandardScaler().fit_transform(outlier_data).flatten()}")
print(f"  Robust: {RobustScaler().fit_transform(outlier_data).flatten()}")

When to Scale

Algorithm	Scaling Needed?
Linear Regression	Helpful
Logistic Regression	Yes
SVM	Yes
KNN	Yes
Decision Trees	No
Random Forest	No
Gradient Boosting	No
Neural Networks	Yes

Creating New Features

This is where domain knowledge shines. Here are common techniques.

Mathematical Combinations

# Ratios
df['debt_to_income'] = df['loan_amount'] / df['income']
df['loan_to_property'] = df['loan_amount'] / df['property_value']
df['income_per_year_employed'] = df['income'] / (df['years_employed'] + 1)

# Products (interactions)
df['income_x_credit_score'] = df['income'] * df['credit_score']
df['age_x_years_employed'] = df['age'] * df['years_employed']

# Sums and differences
df['total_assets'] = df['income'] + df['property_value']
df['age_at_first_job'] = df['age'] - df['years_employed']

# Polynomial features
df['income_squared'] = df['income'] ** 2
df['credit_score_squared'] = df['credit_score'] ** 2

# Log transformations (for skewed data)
df['log_income'] = np.log1p(df['income'])
df['log_loan_amount'] = np.log1p(df['loan_amount'])
df['log_property_value'] = np.log1p(df['property_value'])

# Square root (another way to handle skew)
df['sqrt_income'] = np.sqrt(df['income'])

print("New feature statistics:")
new_features = ['debt_to_income', 'loan_to_property', 'log_income']
print(df[new_features].describe())

Binning Continuous Variables

# Equal-width bins
df['age_bin'] = pd.cut(df['age'], bins=5, labels=['very_young', 'young', 'middle', 'senior', 'elderly'])

# Quantile-based bins (equal frequency)
df['income_quantile'] = pd.qcut(df['income'], q=4, labels=['low', 'medium', 'high', 'very_high'])

# Custom bins based on domain knowledge
credit_bins = [0, 580, 670, 740, 800, 900]
credit_labels = ['poor', 'fair', 'good', 'very_good', 'excellent']
df['credit_category'] = pd.cut(df['credit_score'], bins=credit_bins, labels=credit_labels)

# Binary flags
df['is_senior'] = (df['age'] >= 60).astype(int)
df['has_excellent_credit'] = (df['credit_score'] >= 750).astype(int)
df['high_debt_ratio'] = (df['debt_to_income'] > 0.4).astype(int)

print("\nBinning results:")
print(df['credit_category'].value_counts())

Date and Time Features

# Create sample date data
date_df = pd.DataFrame({
    'transaction_date': pd.date_range('2023-01-01', periods=1000, freq='H')
})

date_df['date'] = date_df['transaction_date']

# Extract components
date_df['year'] = date_df['date'].dt.year
date_df['month'] = date_df['date'].dt.month
date_df['day'] = date_df['date'].dt.day
date_df['day_of_week'] = date_df['date'].dt.dayofweek
date_df['day_of_year'] = date_df['date'].dt.dayofyear
date_df['week_of_year'] = date_df['date'].dt.isocalendar().week
date_df['hour'] = date_df['date'].dt.hour
date_df['minute'] = date_df['date'].dt.minute
date_df['quarter'] = date_df['date'].dt.quarter

# Binary flags
date_df['is_weekend'] = date_df['day_of_week'].isin([5, 6]).astype(int)
date_df['is_month_start'] = date_df['date'].dt.is_month_start.astype(int)
date_df['is_month_end'] = date_df['date'].dt.is_month_end.astype(int)
date_df['is_business_hour'] = date_df['hour'].between(9, 17).astype(int)

# Cyclical encoding for time features
# Important for algorithms that need to understand cyclical nature
date_df['month_sin'] = np.sin(2 * np.pi * date_df['month'] / 12)
date_df['month_cos'] = np.cos(2 * np.pi * date_df['month'] / 12)
date_df['hour_sin'] = np.sin(2 * np.pi * date_df['hour'] / 24)
date_df['hour_cos'] = np.cos(2 * np.pi * date_df['hour'] / 24)
date_df['day_of_week_sin'] = np.sin(2 * np.pi * date_df['day_of_week'] / 7)
date_df['day_of_week_cos'] = np.cos(2 * np.pi * date_df['day_of_week'] / 7)

# Time since reference
reference_date = pd.Timestamp('2023-01-01')
date_df['days_since_ref'] = (date_df['date'] - reference_date).dt.days

print("Date features sample:")
print(date_df.head())

Text Features

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Sample text data
text_df = pd.DataFrame({
    'description': [
        "Excellent credit history with long employment",
        "New to credit, short employment history",
        "Good credit score, stable income",
        "High debt, recent bankruptcy",
        "Excellent income, owns multiple properties"
    ]
})

# Basic text features
text_df['char_count'] = text_df['description'].str.len()
text_df['word_count'] = text_df['description'].str.split().str.len()
text_df['avg_word_length'] = text_df['char_count'] / text_df['word_count']

# Count specific keywords
text_df['has_excellent'] = text_df['description'].str.contains('excellent', case=False).astype(int)
text_df['has_good'] = text_df['description'].str.contains('good', case=False).astype(int)
text_df['has_high'] = text_df['description'].str.contains('high', case=False).astype(int)

# TF-IDF features
tfidf = TfidfVectorizer(max_features=50, stop_words='english')
tfidf_matrix = tfidf.fit_transform(text_df['description'])
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=[f'tfidf_{word}' for word in tfidf.get_feature_names_out()]
)

print("Text features:")
print(text_df[['description', 'word_count', 'has_excellent', 'has_good']].head())

Aggregation Features

For data with multiple rows per entity (e.g., transactions per customer).

# Sample transaction data
transactions = pd.DataFrame({
    'customer_id': np.random.randint(1, 100, 1000),
    'amount': np.random.lognormal(4, 1, 1000),
    'category': np.random.choice(['food', 'entertainment', 'utilities', 'shopping'], 1000),
    'date': pd.date_range('2023-01-01', periods=1000, freq='H')
})

# Aggregation features per customer
customer_features = transactions.groupby('customer_id').agg({
    'amount': ['mean', 'sum', 'std', 'min', 'max', 'count'],
    'category': 'nunique',
    'date': ['min', 'max']
}).reset_index()

# Flatten column names
customer_features.columns = ['_'.join(col).strip('_') for col in customer_features.columns.values]

# Additional derived features
customer_features['amount_range'] = customer_features['amount_max'] - customer_features['amount_min']
customer_features['avg_transaction_value'] = customer_features['amount_sum'] / customer_features['amount_count']
customer_features['days_active'] = (customer_features['date_max'] - customer_features['date_min']).dt.days

# Category-specific aggregations
category_agg = transactions.pivot_table(
    index='customer_id',
    columns='category',
    values='amount',
    aggfunc=['sum', 'count']
).fillna(0)
category_agg.columns = ['_'.join(col) for col in category_agg.columns]

print("Customer features:")
print(customer_features.head())

Lag Features (Time Series)

# Sample time series data
ts_df = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=100, freq='D'),
    'sales': np.random.randint(100, 500, 100)
})
ts_df = ts_df.sort_values('date')

# Lag features
for lag in [1, 7, 14, 30]:
    ts_df[f'sales_lag_{lag}'] = ts_df['sales'].shift(lag)

# Rolling statistics
for window in [7, 14, 30]:
    ts_df[f'sales_rolling_mean_{window}'] = ts_df['sales'].rolling(window=window).mean()
    ts_df[f'sales_rolling_std_{window}'] = ts_df['sales'].rolling(window=window).std()
    ts_df[f'sales_rolling_min_{window}'] = ts_df['sales'].rolling(window=window).min()
    ts_df[f'sales_rolling_max_{window}'] = ts_df['sales'].rolling(window=window).max()

# Exponential moving average
ts_df['sales_ema_7'] = ts_df['sales'].ewm(span=7).mean()
ts_df['sales_ema_14'] = ts_df['sales'].ewm(span=14).mean()

# Difference features
ts_df['sales_diff_1'] = ts_df['sales'].diff(1)
ts_df['sales_diff_7'] = ts_df['sales'].diff(7)

# Percent change
ts_df['sales_pct_change'] = ts_df['sales'].pct_change()

print("Time series features:")
print(ts_df.dropna().head(10))

Feature Selection

Not all features are useful. Selecting the right ones improves performance and reduces complexity.

Filter Methods

Statistical tests to select features independently of the model.

from sklearn.feature_selection import (
    SelectKBest, f_classif, mutual_info_classif,
    VarianceThreshold
)

# Prepare data
X = df.select_dtypes(include=[np.number]).drop(['approved'], axis=1)
X = X.fillna(X.median())
y = df['approved']

# 1. Remove low variance features
variance_selector = VarianceThreshold(threshold=0.01)
X_high_var = variance_selector.fit_transform(X)
print(f"Features after variance threshold: {X_high_var.shape[1]}/{X.shape[1]}")

# 2. Select K best features using ANOVA F-test
k_best_selector = SelectKBest(score_func=f_classif, k=10)
X_k_best = k_best_selector.fit_transform(X, y)

# Get feature scores
feature_scores = pd.DataFrame({
    'feature': X.columns,
    'score': k_best_selector.scores_,
    'pvalue': k_best_selector.pvalues_
}).sort_values('score', ascending=False)

print("\nTop 10 features by ANOVA F-score:")
print(feature_scores.head(10))

# 3. Mutual information
mi_scores = mutual_info_classif(X, y, random_state=42)
mi_df = pd.DataFrame({
    'feature': X.columns,
    'mi_score': mi_scores
}).sort_values('mi_score', ascending=False)

print("\nTop 10 features by Mutual Information:")
print(mi_df.head(10))

Wrapper Methods

Use model performance to select features.

from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression

# Recursive Feature Elimination
model = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=model, n_features_to_select=10, step=1)
rfe.fit(X, y)

rfe_ranking = pd.DataFrame({
    'feature': X.columns,
    'ranking': rfe.ranking_,
    'selected': rfe.support_
}).sort_values('ranking')

print("\nRFE Feature Ranking:")
print(rfe_ranking.head(15))

# RFE with Cross-Validation (finds optimal number of features)
rfecv = RFECV(estimator=model, step=1, cv=5, scoring='roc_auc', n_jobs=-1)
rfecv.fit(X, y)

print(f"\nOptimal number of features: {rfecv.n_features_}")
print(f"Best CV score: {rfecv.cv_results_['mean_test_score'].max():.4f}")

# Plot number of features vs score
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(rfecv.cv_results_['mean_test_score']) + 1),
         rfecv.cv_results_['mean_test_score'])
plt.xlabel('Number of Features')
plt.ylabel('CV Score')
plt.title('RFECV: Feature Selection')
plt.grid(True)
plt.show()

Embedded Methods

Feature selection built into the learning algorithm.

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LassoCV

# 1. L1 Regularization (Lasso) - sets some coefficients to zero
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X, y)

lasso_features = pd.DataFrame({
    'feature': X.columns,
    'coefficient': lasso.coef_
}).sort_values('coefficient', key=abs, ascending=False)

print("Lasso non-zero coefficients:")
print(lasso_features[lasso_features['coefficient'] != 0])

# 2. Tree-based feature importance
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

rf_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print("\nRandom Forest Feature Importance:")
print(rf_importance.head(10))

# Visualize
fig, ax = plt.subplots(figsize=(10, 8))
rf_importance_top = rf_importance.head(15)
ax.barh(rf_importance_top['feature'], rf_importance_top['importance'])
ax.set_xlabel('Importance')
ax.set_title('Random Forest Feature Importance')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

Permutation Importance

More reliable importance estimation.

from sklearn.inspection import permutation_importance

# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf.fit(X_train, y_train)

# Calculate permutation importance on test set
perm_importance = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)

perm_importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance_mean': perm_importance.importances_mean,
    'importance_std': perm_importance.importances_std
}).sort_values('importance_mean', ascending=False)

print("\nPermutation Importance:")
print(perm_importance_df.head(10))

# Visualize with error bars
fig, ax = plt.subplots(figsize=(10, 8))
top_features = perm_importance_df.head(15)
ax.barh(top_features['feature'], top_features['importance_mean'],
        xerr=top_features['importance_std'])
ax.set_xlabel('Decrease in Accuracy')
ax.set_title('Permutation Importance')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

Avoiding Data Leakage

Data leakage occurs when information from outside the training data leaks into the model.

Common Sources of Leakage

# BAD: Fitting on all data before splitting
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Leakage!
X_train, X_test = train_test_split(X_scaled, test_size=0.2)

# GOOD: Fit only on training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit on train only
X_test_scaled = scaler.transform(X_test)  # Transform test with train params

# BAD: Target encoding without cross-validation
df['occupation_encoded'] = df.groupby('occupation')['approved'].transform('mean')  # Leakage!

# GOOD: Use cross-validation for target encoding (shown earlier)

Using Pipelines to Prevent Leakage

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Define preprocessing
numeric_features = ['age', 'income', 'credit_score', 'years_employed', 'loan_amount']
categorical_features = ['education', 'occupation']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Cross-validation is now leakage-free
cv_scores = cross_val_score(pipeline, data[numeric_features + categorical_features],
                            data['approved'], cv=5, scoring='roc_auc')
print(f"CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

Measuring Feature Impact

Before and After Comparison

from sklearn.model_selection import cross_val_score

# Baseline with original features
X_original = df[['age', 'income', 'credit_score', 'years_employed', 'loan_amount']].fillna(0)
y = df['approved']

baseline_scores = cross_val_score(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_original, y, cv=5, scoring='roc_auc'
)
print(f"Baseline CV Score: {baseline_scores.mean():.4f}")

# With engineered features
engineered_features = ['age', 'income', 'credit_score', 'years_employed', 'loan_amount',
                       'debt_to_income', 'loan_to_property', 'log_income',
                       'income_x_credit_score', 'age_x_years_employed']
X_engineered = df[engineered_features].fillna(0)

engineered_scores = cross_val_score(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_engineered, y, cv=5, scoring='roc_auc'
)
print(f"Engineered CV Score: {engineered_scores.mean():.4f}")
print(f"Improvement: {(engineered_scores.mean() - baseline_scores.mean()):.4f}")

Feature Ablation Study

def ablation_study(X, y, model, cv=5):
    """Test impact of removing each feature."""
    baseline = cross_val_score(model, X, y, cv=cv, scoring='roc_auc').mean()

    results = {'baseline': baseline}

    for col in X.columns:
        X_reduced = X.drop(columns=[col])
        score = cross_val_score(model, X_reduced, y, cv=cv, scoring='roc_auc').mean()
        results[f'without_{col}'] = score

    return pd.DataFrame([results]).T.rename(columns={0: 'score'}).sort_values('score')

# Run ablation study
ablation_results = ablation_study(
    X_engineered, y,
    RandomForestClassifier(n_estimators=100, random_state=42)
)
print("\nAblation Study (lower = feature is important):")
print(ablation_results)

Best Practices

1. Document Everything

# Keep track of feature engineering decisions
feature_documentation = {
    'debt_to_income': {
        'description': 'Ratio of loan amount to income',
        'formula': 'loan_amount / income',
        'rationale': 'Key indicator of debt burden',
        'created_date': '2024-01-15'
    },
    'log_income': {
        'description': 'Log-transformed income',
        'formula': 'log(1 + income)',
        'rationale': 'Handle right-skewed distribution',
        'created_date': '2024-01-15'
    }
}

2. Version Control Your Features

# Save feature engineering code in version control
# Create reproducible pipelines

def create_features(df):
    """Create all engineered features."""
    df = df.copy()

    # Ratios
    df['debt_to_income'] = df['loan_amount'] / df['income']
    df['loan_to_property'] = df['loan_amount'] / df['property_value']

    # Transformations
    df['log_income'] = np.log1p(df['income'])

    # Interactions
    df['income_x_credit'] = df['income'] * df['credit_score']

    return df

# Apply consistently
df_train = create_features(train_data)
df_test = create_features(test_data)

3. Test on Validation Data

# Always test feature impact on held-out data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Test new feature
model_without = RandomForestClassifier(n_estimators=100, random_state=42)
model_without.fit(X_train.drop(columns=['new_feature']), y_train)
score_without = roc_auc_score(y_val, model_without.predict_proba(X_val.drop(columns=['new_feature']))[:, 1])

model_with = RandomForestClassifier(n_estimators=100, random_state=42)
model_with.fit(X_train, y_train)
score_with = roc_auc_score(y_val, model_with.predict_proba(X_val)[:, 1])

print(f"Without new feature: {score_without:.4f}")
print(f"With new feature: {score_with:.4f}")
print(f"Improvement: {score_with - score_without:.4f}")

Conclusion

Feature engineering is both an art and a science. Key takeaways:

Understand your data before engineering features
Handle missing values thoughtfully—they often contain information
Choose appropriate encodings for categorical variables
Scale features when required by your algorithm
Create domain-relevant features that capture business logic
Select important features to reduce noise and complexity
Avoid data leakage by using proper cross-validation and pipelines
Measure impact of every feature engineering decision

The best features come from combining domain expertise with systematic experimentation. Start simple, measure impact, and iterate. Often, a few well-crafted features outperform hundreds of mediocre ones.