SKILL.md
Pandas Best Practices
Expert guidelines for Pandas development, focusing on data manipulation, analysis, and efficient DataFrame operations.
Code Style and Structure
- Write concise, technical responses with accurate Python examples
- Prioritize reproducibility in data analysis workflows
- Use functional programming; avoid unnecessary classes
- Prefer vectorized operations over explicit loops
- Use descriptive variable names reflecting data content
- Follow PEP 8 style guidelines
DataFrame Creation and I/O
- Use
pd.read_csv(),pd.read_excel(),pd.read_json()with appropriate parameters
- Specify
dtypeparameter to ensure correct data types on load
- Use
parse_datesfor automatic datetime parsing
- Set
index_colwhen the data has a natural index column
- Use
chunksizefor reading large files incrementally
Data Selection
- Use
.loc[]for label-based indexing
- Use
.iloc[]for integer position-based indexing
- Avoid chained indexing (e.g.,
df['col'][0]) - use.locor.ilocinstead
- Use boolean indexing for conditional selection:
df[df['col'] > value]
- Use
.query()method for complex filtering conditions
Method Chaining
- Prefer method chaining for data transformations when possible
- Use
.pipe()for applying custom functions in a chain
- Chain operations like
.assign(),.query(),.groupby(),.agg()
- Keep chains readable by breaking across multiple lines
Data Cleaning and Validation
Missing Data
- Check for missing data with
.isna()and.info()
- Handle missing data appropriately:
.fillna(),.dropna(), or imputation
- Use
pd.NAfor nullable integer and boolean types
- Document decisions about missing data handling
Data Quality Checks
- Implement data quality checks at the beginning of analysis
- Validate data types with
.dtypesand convert as needed
- Check for duplicates with
.duplicated()and handle appropriately
- Use
.describe()for quick statistical overview
Type Conversion
- Use
.astype()for explicit type conversion
- Use
pd.to_datetime()for date parsing
- Use
pd.to_numeric()witherrors='coerce'for safe numeric conversion
- Utilize categorical data types for low-cardinality string columns
Grouping and Aggregation
GroupBy Operations
- Use
.groupby()for efficient aggregation operations
- Specify aggregation functions with
.agg()for multiple operations
- Use named aggregation for clearer output column names
- Consider
.transform()for broadcasting results back to original shape
Pivot Tables and Reshaping
- Use
.pivot_table()for multi-dimensional aggregation
- Use
.melt()to convert wide to long format
- Use
.pivot()to convert long to wide format
- Use
.stack()and.unstack()for hierarchical index manipulation
Performance Optimization
Memory Efficiency
- Use categorical data types for low-cardinality strings
- Downcast numeric types when appropriate
- Use
pd.eval()and.eval()for large expression evaluation
Computation Speed
- Use vectorized operations instead of
.apply()with row-wise functions
- Prefer built-in aggregation functions over custom ones
- Use
.valuesor.to_numpy()for NumPy operations when faster
Avoiding Common Pitfalls
- Avoid iterating with
.iterrows()- use vectorized operations
- Don't modify DataFrames while iterating
- Be aware of SettingWithCopyWarning - use
.copy()when needed
- Avoid growing DataFrames row by row - collect in list and create once
Time Series Operations
- Use
DatetimeIndexfor time series data
- Leverage
.resample()for time-based aggregation
- Use
.shift()and.diff()for lag operations
- Use
.rolling()and.expanding()for window calculations
Merging and Joining
- Use
.merge()for SQL-style joins
- Specify
howparameter: 'inner', 'outer', 'left', 'right'
- Use
validateparameter to check join cardinality
- Use
.concat()for stacking DataFrames
Key Conventions
- Import as
import pandas as pd
- Use
snake_casefor column names when possible
- Document data sources and transformations
- Keep notebooks reproducible with clear cell execution order