feature-engineering

Create and transform features using encoding, scaling, polynomial features, and domain-specific transformations for improved model performance and…

INSTALLATION
npx skills add https://github.com/aj-geddes/useful-ai-prompts --skill feature-engineering
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Feature Engineering

Overview

Feature engineering creates and transforms features to improve model performance, interpretability, and generalization through domain knowledge and mathematical transformations.

When to Use

  • When you need to improve model performance beyond using raw features
  • When dealing with categorical variables that need encoding for ML algorithms
  • When features have different scales and require normalization
  • When creating domain-specific features based on business knowledge
  • When handling skewed distributions or non-linear relationships
  • When preparing data for different types of ML algorithms with specific requirements

Engineering Techniques

  • Encoding: Converting categorical to numerical
  • Scaling: Normalizing feature ranges
  • Polynomial Features: Higher-order terms
  • Interactions: Combining features
  • Domain-specific: Business-relevant transformations
  • Temporal: Time-based features

Key Principles

  • Create features based on domain knowledge
  • Remove redundant features
  • Scale features appropriately
  • Handle categorical variables
  • Create meaningful interactions

Implementation with Python

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.preprocessing import (

    StandardScaler, MinMaxScaler, RobustScaler, PolynomialFeatures,

    OneHotEncoder, OrdinalEncoder, LabelEncoder

)

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

import seaborn as sns

# Create sample dataset

np.random.seed(42)

df = pd.DataFrame({

    'age': np.random.uniform(18, 80, 1000),

    'income': np.random.uniform(20000, 150000, 1000),

    'experience_years': np.random.uniform(0, 50, 1000),

    'category': np.random.choice(['A', 'B', 'C'], 1000),

    'city': np.random.choice(['NYC', 'LA', 'Chicago'], 1000),

    'purchased': np.random.choice([0, 1], 1000),

})

print("Original Data:")

print(df.head())

print(df.info())

# 1. Categorical Encoding

# One-Hot Encoding

print("\n1. One-Hot Encoding:")

df_ohe = pd.get_dummies(df, columns=['category', 'city'], drop_first=True)

print(df_ohe.head())

# Ordinal Encoding

print("\n2. Ordinal Encoding:")

ordinal_encoder = OrdinalEncoder()

df['category_ordinal'] = ordinal_encoder.fit_transform(df[['category']])

print(df[['category', 'category_ordinal']].head())

# Label Encoding

print("\n3. Label Encoding:")

le = LabelEncoder()

df['city_encoded'] = le.fit_transform(df['city'])

print(df[['city', 'city_encoded']].head())

# 2. Feature Scaling

print("\n4. Feature Scaling:")

X = df[['age', 'income', 'experience_years']].copy()

# StandardScaler (mean=0, std=1)

scaler = StandardScaler()

X_standard = scaler.fit_transform(X)

# MinMaxScaler [0, 1]

minmax_scaler = MinMaxScaler()

X_minmax = minmax_scaler.fit_transform(X)

# RobustScaler (resistant to outliers)

robust_scaler = RobustScaler()

X_robust = robust_scaler.fit_transform(X)

# Visualization

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

axes[0, 0].hist(X['age'], bins=30, edgecolor='black')

axes[0, 0].set_title('Original Age')

axes[0, 1].hist(X_standard[:, 0], bins=30, edgecolor='black')

axes[0, 1].set_title('StandardScaler Age')

axes[1, 0].hist(X_minmax[:, 0], bins=30, edgecolor='black')

axes[1, 0].set_title('MinMaxScaler Age')

axes[1, 1].hist(X_robust[:, 0], bins=30, edgecolor='black')

axes[1, 1].set_title('RobustScaler Age')

plt.tight_layout()

plt.show()

# 3. Polynomial Features

print("\n5. Polynomial Features:")

X_simple = df[['age']].copy()

poly = PolynomialFeatures(degree=2, include_bias=False)

X_poly = poly.fit_transform(X_simple)

X_poly_df = pd.DataFrame(X_poly, columns=['age', 'age^2'])

print(X_poly_df.head())

# Visualization

plt.figure(figsize=(12, 5))

plt.scatter(df['age'], df['income'], alpha=0.5)

plt.xlabel('Age')

plt.ylabel('Income')

plt.title('Age vs Income')

plt.grid(True, alpha=0.3)

plt.show()

# 4. Feature Interactions

print("\n6. Feature Interactions:")

df['age_income_interaction'] = df['age'] * df['income'] / 10000

df['age_experience_ratio'] = df['age'] / (df['experience_years'] + 1)

print(df[['age', 'income', 'age_income_interaction', 'age_experience_ratio']].head())

# 5. Domain-specific Transformations

print("\n7. Domain-specific Features:")

df['age_group'] = pd.cut(df['age'], bins=[0, 30, 45, 60, 100],

                          labels=['Young', 'Middle', 'Senior', 'Retired'])

df['income_level'] = pd.qcut(df['income'], q=3, labels=['Low', 'Medium', 'High'])

df['log_income'] = np.log1p(df['income'])

df['sqrt_experience'] = np.sqrt(df['experience_years'])

print(df[['age', 'age_group', 'income', 'income_level', 'log_income']].head())

# 6. Temporal Features (if date data available)

print("\n8. Temporal Features:")

dates = pd.date_range('2023-01-01', periods=len(df))

df['date'] = dates

df['year'] = df['date'].dt.year

df['month'] = df['date'].dt.month

df['day_of_week'] = df['date'].dt.dayofweek

df['quarter'] = df['date'].dt.quarter

df['is_weekend'] = df['date'].dt.dayofweek >= 5

print(df[['date', 'year', 'month', 'day_of_week', 'is_weekend']].head())

# 7. Feature Standardization Pipeline

print("\n9. Feature Engineering Pipeline:")

# Separate numerical and categorical features

numerical_features = ['age', 'income', 'experience_years']

categorical_features = ['category', 'city']

# Create preprocessing pipeline

preprocessor = ColumnTransformer(

    transformers=[

        ('num', StandardScaler(), numerical_features),

        ('cat', OneHotEncoder(drop='first'), categorical_features),

    ]

)

X_processed = preprocessor.fit_transform(df[numerical_features + categorical_features])

print(f"Processed shape: {X_processed.shape}")

# 8. Feature Statistics

print("\n10. Feature Statistics:")

X_for_stats = df[numerical_features].copy()

X_for_stats['category_A'] = (df['category'] == 'A').astype(int)

X_for_stats['city_NYC'] = (df['city'] == 'NYC').astype(int)

feature_stats = pd.DataFrame({

    'Feature': X_for_stats.columns,

    'Mean': X_for_stats.mean(),

    'Std': X_for_stats.std(),

    'Min': X_for_stats.min(),

    'Max': X_for_stats.max(),

    'Skewness': X_for_stats.skew(),

    'Kurtosis': X_for_stats.kurtosis(),

})

print(feature_stats)

# 9. Feature Correlations

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

X_numeric = df[numerical_features].copy()

X_numeric['purchased'] = df['purchased']

corr_matrix = X_numeric.corr()

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[0])

axes[0].set_title('Feature Correlation Matrix')

# Distribution of engineered features

axes[1].hist(df['age_income_interaction'], bins=30, edgecolor='black', alpha=0.7)

axes[1].set_title('Age-Income Interaction Distribution')

axes[1].set_xlabel('Value')

axes[1].set_ylabel('Frequency')

plt.tight_layout()

plt.show()

# 10. Feature Binning / Discretization

print("\n11. Feature Binning:")

df['age_bin_equal'] = pd.cut(df['age'], bins=5)

df['age_bin_quantile'] = pd.qcut(df['age'], q=5)

df['income_bins'] = pd.cut(df['income'], bins=[0, 50000, 100000, 150000])

print("Equal Width Binning:")

print(df['age_bin_equal'].value_counts().sort_index())

print("\nEqual Frequency Binning:")

print(df['age_bin_quantile'].value_counts().sort_index())

# 11. Missing Value Creation and Handling

print("\n12. Missing Value Imputation:")

df_with_missing = df.copy()

missing_indices = np.random.choice(len(df), 50, replace=False)

df_with_missing.loc[missing_indices, 'age'] = np.nan

# Mean imputation

age_mean = df_with_missing['age'].mean()

df_with_missing['age_imputed_mean'] = df_with_missing['age'].fillna(age_mean)

# Median imputation

age_median = df_with_missing['age'].median()

df_with_missing['age_imputed_median'] = df_with_missing['age'].fillna(age_median)

# Forward fill

df_with_missing['age_imputed_ffill'] = df_with_missing['age'].fillna(method='ffill')

print(df_with_missing[['age', 'age_imputed_mean', 'age_imputed_median']].head(10))

print("\nFeature Engineering Complete!")

print(f"Original features: {len(df.columns) - 5}")

print(f"Final features available: {len(df.columns)}")

Best Practices

  • Understand your domain before engineering features
  • Create features that are interpretable
  • Avoid data leakage (using future information)
  • Test feature importance after engineering
  • Document all transformations
  • Use appropriate scaling for different algorithms

Common Transformations

  • Log Transform: For skewed distributions
  • Polynomial Features: For non-linear relationships
  • Interaction Terms: For combined effects
  • Binning: For categorical approximation
  • Normalization: For comparison across scales

Deliverables

  • Engineered feature dataset
  • Feature transformation documentation
  • Correlation analysis of new features
  • Distribution comparisons (before/after)
  • Feature importance rankings
  • Preprocessing pipeline code
  • Data dictionary with feature descriptions
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card