SKILL.md
$27
Scenario
Recommended
When to upgrade
Need interpretability
Logistic / Linear Regression
Always start here for stakeholder-facing models
Small data (< 10K rows)
Random Forest
Move to XGBoost if accuracy insufficient
Medium data, high accuracy needed
XGBoost / LightGBM
Default workhorse for tabular data
Large data, complex patterns
Neural Network
Only when tree methods plateau
Unsupervised grouping
K-Means / DBSCAN
Use silhouette score to validate k
Feature Engineering Examples
Numerical transforms:
import numpy as np, pandas as pd
def engineer_numerical(df: pd.DataFrame, col: str) -> pd.DataFrame:
return pd.DataFrame({
f'{col}_log': np.log1p(df[col]),
f'{col}_sqrt': np.sqrt(df[col].clip(lower=0)),
f'{col}_squared': df[col] ** 2,
f'{col}_binned': pd.cut(df[col], bins=5, labels=False),
})
Time-based features with cyclical encoding:
def engineer_time(df: pd.DataFrame, col: str) -> pd.DataFrame:
dt = pd.to_datetime(df[col])
return pd.DataFrame({
f'{col}_hour': dt.dt.hour,
f'{col}_dayofweek': dt.dt.dayofweek,
f'{col}_month': dt.dt.month,
f'{col}_is_weekend': dt.dt.dayofweek.isin([5, 6]).astype(int),
f'{col}_hour_sin': np.sin(2 * np.pi * dt.dt.hour / 24),
f'{col}_hour_cos': np.cos(2 * np.pi * dt.dt.hour / 24),
})
Feature selection (importance-based):
from sklearn.ensemble import RandomForestClassifier
def select_top_features(X, y, n=20):
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
importance = pd.Series(rf.feature_importances_, index=X.columns)
return importance.nlargest(n).index.tolist()
Model Evaluation
Classification:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
def evaluate_classifier(y_true, y_pred, y_proba=None) -> dict:
m = {
"accuracy": accuracy_score(y_true, y_pred),
"precision": precision_score(y_true, y_pred),
"recall": recall_score(y_true, y_pred),
"f1": f1_score(y_true, y_pred),
}
if y_proba is not None:
m["auc_roc"] = roc_auc_score(y_true, y_proba)
return m
Regression:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
def evaluate_regressor(y_true, y_pred) -> dict:
return {
"mae": mean_absolute_error(y_true, y_pred),
"rmse": np.sqrt(mean_squared_error(y_true, y_pred)),
"r2": r2_score(y_true, y_pred),
}
A/B Test Design and Analysis
Sample size calculation:
from scipy import stats
import numpy as np
def required_sample_size(baseline_rate: float, mde: float, alpha: float = 0.05, power: float = 0.8) -> int:
"""Return required N per variant. mde is relative (e.g., 0.10 = 10% lift)."""
effect = baseline_rate * mde
z_a = stats.norm.ppf(1 - alpha / 2)
z_b = stats.norm.ppf(power)
p = baseline_rate
return int(np.ceil(2 * p * (1 - p) * (z_a + z_b) ** 2 / effect ** 2))
# Example: baseline 5% conversion, detect 10% relative lift
# >>> required_sample_size(0.05, 0.10) -> ~62,214 per variant
Result analysis:
def analyze_ab(control: np.ndarray, treatment: np.ndarray, alpha: float = 0.05) -> dict:
"""Analyze A/B test with proportions z-test."""
n_c, n_t = len(control), len(treatment)
p_c, p_t = control.mean(), treatment.mean()
p_pool = (control.sum() + treatment.sum()) / (n_c + n_t)
se = np.sqrt(p_pool * (1 - p_pool) * (1/n_c + 1/n_t))
z = (p_t - p_c) / se
p_val = 2 * (1 - stats.norm.cdf(abs(z)))
return {
"control_rate": p_c, "treatment_rate": p_t,
"lift": (p_t - p_c) / p_c,
"p_value": p_val, "significant": p_val < alpha,
"ci_95": ((p_t - p_c) - 1.96 * se, (p_t - p_c) + 1.96 * se),
}
Project Template
# Data Science Project: [Name]
## Business Objective -- What problem are we solving?
## Success Metrics -- Primary: [metric]; Secondary: [metric]
## Data -- Sources, size (rows/features), time period
## Methodology -- Numbered steps
## Results
| Metric | Baseline | Model | Improvement |
|--------|----------|-------|-------------|
## Business Impact -- [Quantified impact]
## Recommendations -- [Next actions]
## Limitations -- [Known caveats]
Reference Materials
references/ml_algorithms.md-- Algorithm deep dives
references/feature_engineering.md-- Feature engineering patterns
references/experimentation.md-- A/B testing guide
references/statistics.md-- Statistical methods
Scripts
python scripts/experiment_tracker.py log --name "xgb_v2" --params '{"lr":0.1,"depth":6}' --metrics '{"f1":0.87,"auc":0.92}'
python scripts/experiment_tracker.py list --sort-by f1 --top 5
python scripts/experiment_tracker.py compare --ids 1 3 5 --json
python scripts/hypothesis_tester.py ttest --file data.csv --col-a group_a --col-b group_b
python scripts/hypothesis_tester.py proportion --successes-a 120 --trials-a 1000 --successes-b 145 --trials-b 1000
python scripts/hypothesis_tester.py chi-square --file contingency.csv --json
python scripts/feature_selector.py --file dataset.csv --target churn --top 10
python scripts/feature_selector.py --file dataset.csv --target revenue --method correlation --json
Tool Reference
Tool
Purpose
Key Flags
experiment_tracker.py
Log, list, and compare experiments with parameters, metrics, and tags in a local JSON file
log --name --params --metrics --tags, list --sort-by --top, compare --ids, --json
hypothesis_tester.py
Run statistical tests: Welch's t-test, paired t-test, proportion z-test, chi-square independence
ttest --file --col-a --col-b [--paired], proportion --successes-a --trials-a ..., chi-square --file, --json
feature_selector.py
Rank features by composite score (variance, correlation, mutual information, null rate) for a target column
--file <csv>, --target <col>, --top <n>, --method all/correlation/mutual_info, --json
Troubleshooting
Problem
Likely Cause
Resolution
Model overfits (large train-test gap in metrics)
Too many features, insufficient regularization, or data leakage
Reduce feature count with feature_selector.py, add regularization, and audit feature engineering for temporal leakage
A/B test shows significant result but tiny effect size
Large sample size makes small differences statistically significant
Always report effect size (Cohen's d) alongside p-value; use practical significance thresholds
hypothesis_tester.py p-value differs from scipy
The tool uses normal/t-distribution approximations (standard library only)
For publication-grade analysis, validate with scipy.stats; the tool is designed for fast directional estimates
Feature importance scores are near-zero for all features
Target variable has extremely low variance or the feature set lacks predictive signal
Check target distribution; consider feature engineering or collecting additional data sources
experiment_tracker.py shows experiment IDs out of order
Experiments were logged non-sequentially or the log file was manually edited
IDs are auto-incremented; use --sort-by on a metric for meaningful ordering
Chi-square test fails with "table must be at least 2x2"
CSV contingency table has fewer than 2 rows or 2 columns of numeric data
Ensure the CSV has a header row and at least 2x2 numeric cells; verify the format matches expectations
Class imbalance causes misleading accuracy
Accuracy inflated by majority class predictions
Use F1, precision-recall, or AUC-ROC instead; apply SMOTE or class weights during training
Success Criteria
- Every ML project follows the Define-Collect-Engineer-Train-Evaluate-Communicate workflow before deployment.
- Feature selection is documented:
feature_selector.pyoutput is saved with the experiment record.
- All experiments are tracked with
experiment_tracker.pyincluding parameters, metrics, and a descriptive name.
- Model evaluation reports include at least 3 metrics (e.g., F1, AUC-ROC, precision) and comparison against a baseline.
- A/B tests pre-register the hypothesis, sample size calculation, and primary metric before data collection begins.
- Statistical tests report effect size and confidence intervals, not just p-values.
- Business impact is quantified in dollar terms or user-metric terms (e.g., "reduces false positives by 30%, saving $500K/yr").
Scope & Limitations
In scope: Machine learning algorithm selection, feature engineering, model training and evaluation, A/B test design and analysis, statistical hypothesis testing, experiment tracking, and communicating results to stakeholders.
Out of scope: Model deployment to production (see ml-ops-engineer), data pipeline infrastructure, dashboard development, and real-time serving architecture.
Limitations: The Python tools use only the Python standard library. hypothesis_tester.py uses normal and t-distribution approximations that are accurate for moderate sample sizes but should be validated with scipy for edge cases (very small n, extreme skew). feature_selector.py computes approximate mutual information using binned discretization -- for high-precision feature selection, use sklearn's mutual_info_classif or permutation importance. All tools process local files and do not integrate with MLflow, W&B, or other tracking platforms.
Integration Points
- MLOps Engineer (
data-analytics/ml-ops-engineer): Trained models are handed off for production deployment, monitoring, and registry management.
- Data Analyst (
data-analytics/data-analyst): Complex analytical questions requiring predictive modeling are escalated from the analyst to the data scientist.
- Analytics Engineer (
data-analytics/analytics-engineer): Feature engineering pipelines may depend on mart models as upstream data sources.
- Product Team (
product-team/): Experiment results inform product decisions; A/B test designs are co-created with product managers.
- Engineering (
engineering/senior-ml-engineer): Algorithm implementation details and model architecture decisions bridge data science and ML engineering.