SKILL.md

Data Cleaning and Variable Screening

Quick Start

# Run the complete data cleaning pipeline

python ".github/skills/datanalysis-credit-risk/scripts/example.py"

Complete Process Description

The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data:

Get Data - Load and format raw data

Organization Sample Analysis - Statistics of sample count and bad sample rate for each organization

Separate OOS Data - Separate out-of-sample (OOS) samples from modeling samples

Filter Abnormal Months - Remove months with insufficient bad sample count or total sample count

Calculate Missing Rate - Calculate overall and organization-level missing rates for each feature

Drop High Missing Rate Features - Remove features with overall missing rate exceeding threshold

Drop Low IV Features - Remove features with overall IV too low or IV too low in too many organizations

Drop High PSI Features - Remove features with unstable PSI

Null Importance Denoising - Remove noise features using label permutation method

Drop High Correlation Features - Remove high correlation features based on original gain

Export Report - Generate Excel report containing details and statistics of all steps

Core Functions

Function

Purpose

Module

get_dataset()

Load and format data

references.func

org_analysis()

Organization sample analysis

references.func

missing_check()

Calculate missing rate

references.func

drop_abnormal_ym()

Filter abnormal months

references.analysis

drop_highmiss_features()

Drop high missing rate features

references.analysis

drop_lowiv_features()

Drop low IV features

references.analysis

drop_highpsi_features()

Drop high PSI features

references.analysis

drop_highnoise_features()

Null Importance denoising

references.analysis

drop_highcorr_features()

Drop high correlation features

references.analysis

iv_distribution_by_org()

IV distribution statistics

references.analysis

psi_distribution_by_org()

PSI distribution statistics

references.analysis

value_ratio_distribution_by_org()

Value ratio distribution statistics

references.analysis

export_cleaning_report()

Export cleaning report

references.analysis

Parameter Description

Data Loading Parameters

DATA_PATH: Data file path (best are parquet format)

DATE_COL: Date column name

Y_COL: Label column name

ORG_COL: Organization column name

KEY_COLS: Primary key column name list

OOS Organization Configuration

OOS_ORGS: Out-of-sample organization list

Abnormal Month Filtering Parameters

min_ym_bad_sample: Minimum bad sample count per month (default 10)

min_ym_sample: Minimum total sample count per month (default 500)

Missing Rate Parameters

missing_ratio: Overall missing rate threshold (default 0.6)

IV Parameters

overall_iv_threshold: Overall IV threshold (default 0.1)

org_iv_threshold: Single organization IV threshold (default 0.1)

max_org_threshold: Maximum tolerated low IV organization count (default 2)

PSI Parameters

psi_threshold: PSI threshold (default 0.1)

max_months_ratio: Maximum unstable month ratio (default 1/3)

max_orgs: Maximum unstable organization count (default 6)

Null Importance Parameters

n_estimators: Number of trees (default 100)

max_depth: Maximum tree depth (default 5)

gain_threshold: Gain difference threshold (default 50)

High Correlation Parameters

max_corr: Correlation threshold (default 0.9)

top_n_keep: Keep top N features by original gain ranking (default 20)

Output Report

The generated Excel report contains the following sheets:

汇总 - Summary information of all steps, including operation results and conditions

机构样本统计 - Sample count and bad sample rate for each organization

分离OOS数据 - OOS sample and modeling sample counts

Step4-异常月份处理 - Abnormal months that were removed

缺失率明细 - Overall and organization-level missing rates for each feature

Step5-有值率分布统计 - Distribution of features in different value ratio ranges

Step6-高缺失率处理 - High missing rate features that were removed

Step7-IV明细 - IV values of each feature in each organization and overall

Step7-IV处理 - Features that do not meet IV conditions and low IV organizations

Step7-IV分布统计 - Distribution of features in different IV ranges

Step8-PSI明细 - PSI values of each feature in each organization each month

Step8-PSI处理 - Features that do not meet PSI conditions and unstable organizations

Step8-PSI分布统计 - Distribution of features in different PSI ranges

Step9-null importance处理 - Noise features that were removed

Step10-高相关性剔除 - High correlation features that were removed

Features

Interactive Input: Parameters can be input before each step execution, with default values supported

Independent Execution: Each step is executed independently without deleting original data, facilitating comparative analysis

Complete Report: Generate complete Excel report containing details, statistics, and distributions

Multi-process Support: IV and PSI calculations support multi-process acceleration

Organization-level Analysis: Support organization-level statistics and modeling/OOS distinction

datanalysis-credit-risk

SKILL.md

Data Cleaning and Variable Screening

Quick Start

Complete Process Description

Core Functions

Parameter Description

Data Loading Parameters

OOS Organization Configuration

Abnormal Month Filtering Parameters

Missing Rate Parameters

IV Parameters

PSI Parameters

Null Importance Parameters

High Correlation Parameters

Output Report

Features

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers