SKILL.md
$28
Build and run the benchmarks to get a baseline measurement. Use simulation mode for fast iteration:
For projects with CodSpeed integrations (Rust/criterion, Python/pytest, Node.js/vitest, etc.):
# Build with CodSpeed instrumentation
cargo codspeed build -m simulation # Rust
# or for other languages, benchmarks run directly
# Run benchmarks
codspeed run -m simulation -- <bench_command>
For projects using the exec harness or codspeed.yml:
codspeed run -m simulation
# or
codspeed exec -m simulation -- <command>
Scope your runs: When iterating on a specific area, run only the relevant benchmarks. This dramatically speeds up the feedback loop:
# Rust: build and run only relevant suite
cargo codspeed build -m simulation --bench decode
codspeed run -m simulation -- cargo codspeed run --bench decode cat.jpg
# codspeed.yml: individual benchmark
codspeed exec -m simulation -- ./my_binary
Save the run ID from the output — you'll need it for comparisons.
Step 2: Analyze with flamegraphs
Use the CodSpeed MCP tools to understand where time is spent:
-
List runs to find your baseline run ID:
- Use
list_runswith appropriate filters (branch, event type)
-
Query flamegraphs on the hottest benchmarks:
- Use
query_flamegraphwith the run ID and benchmark name
- Start with
depth_limit: 5to get the big picture
- Use
root_function_nameto zoom into hot subtrees
- Look for:
- Functions with high self time (these are the actual bottlenecks)
- Instruction-bound vs cache-bound vs memory-bound breakdown
- Unexpected functions appearing high in the profile (redundant work, unnecessary abstractions)
-
Identify optimization targets: Rank functions by self time. The top 2-3 are your targets. Consider:
- Can this computation be avoided entirely?
- Can the algorithm be improved (O(n) vs O(n^2))?
- Are there unnecessary allocations in hot loops?
- Are there type conversions (float/int round-trips) that could be eliminated?
- Could data layout be improved for cache locality?
- Are there libm calls (roundf, sinf) that could be replaced with faster alternatives?
- Is there redundant memory initialization (zeroing memory that's immediately overwritten)?
Step 3: Make targeted changes
Apply optimizations one at a time. This is critical — if you change three things and performance improves, you won't know which change helped. If it regresses, you won't know which one hurt.
Important constraints:
- Only change code you've read and understood
- Preserve correctness — run existing tests after each change
- Keep changes minimal and focused
- Don't over-engineer — the simplest fix that works is the best fix
Common optimization patterns by bottleneck type:
- Instruction-bound: Algorithmic improvements, loop unrolling, removing redundant computations, SIMD
- Cache-bound: Improve data locality, reduce struct size, use contiguous memory, avoid pointer chasing
- Memory-bound: Reduce allocations, reuse buffers, avoid unnecessary copies, use stack allocation
- System-call-bound: Batch I/O, reduce file operations, buffer writes (note: simulation mode doesn't measure syscalls, use walltime for these)
Step 4: Re-measure and compare
After each change, rebuild and rerun the relevant benchmarks:
# Rebuild and rerun (scoped to what you changed)
cargo codspeed build -m simulation --bench <suite>
codspeed run -m simulation -- cargo codspeed run --bench <suite>
Then compare against the baseline using the MCP tools:
- Use
compare_runswithbase_run_id(baseline) andhead_run_id(after your change)
- Check for:
- Improvements in your target benchmarks
- Regressions in other benchmarks (shared code paths can affect unrelated benchmarks)
- The magnitude of the change — is it significant?
Step 5: Report and decide next steps
When you find a significant improvement (>5% on target benchmarks with no regressions), pause and tell the user:
- What you changed and why
- The before/after numbers from
compare_runs
- What the flamegraph showed as the bottleneck
- What further optimizations you see as possible next steps
Then ask if they want you to continue optimizing or if they're satisfied.
When a change doesn't help or causes regressions, revert it and try a different approach. Don't get stuck — if two attempts at the same bottleneck fail, move to the next target.
Step 6: Validate with walltime
Before finalizing any optimization, always validate with walltime benchmarks. Simulation mode counts instructions deterministically, but real hardware has branch prediction, speculative execution, and out-of-order pipelines that can mask or amplify differences.
# Build for walltime
cargo codspeed build -m walltime # Rust with cargo-codspeed
# or just run directly for other setups
# Run with walltime
codspeed run -m walltime -- <bench_command>
# or
codspeed exec -m walltime -- <command>
Then compare the walltime run against a walltime baseline using compare_runs.
Patterns that often show up in simulation but NOT walltime:
- Iterator adapter overhead (e.g.,
.take(n)to[..n]) — branch prediction hides it
- Bounds check elimination — hardware speculates past them
- Trivial arithmetic simplifications — hidden by out-of-order execution
Patterns that reliably help in both modes:
- Avoiding type conversions in hot loops (float/integer round-trips)
- Eliminating libm calls (roundf, sinf — these are software routines)
- Skipping redundant memory initialization
- Algorithmic improvements (reducing overall work)
If a simulation improvement doesn't show up in walltime, strongly consider reverting it — the added code complexity isn't worth a phantom improvement.
Step 7: Continue or finish
If the user wants more optimization, go back to Step 2 with fresh flamegraphs from your latest run. The profile will have shifted now that you've addressed the top bottleneck, revealing new targets.
Keep iterating until:
- The user says they're satisfied
- The flamegraph shows no clear bottleneck (time is spread evenly)
- Remaining optimizations would require architectural changes the user hasn't approved
- You've hit diminishing returns (<1-2% improvement per change)
Language-specific notes
Rust
- Use
cargo codspeed build -m <mode>to build,cargo codspeed runto run
--bench <name>selects specific benchmark suites (matching[[bench]]targets in Cargo.toml)
- Positional filter after
cargo codspeed runmatches benchmark names (e.g.,cargo codspeed run cat.jpg)
- Frameworks: criterion, divan, bencher (all work with cargo-codspeed)
Python
- Uses pytest-codspeed:
codspeed run -m simulation -- pytest --codspeed
- Framework: pytest-benchmark compatible
Node.js
- Frameworks: vitest (
@codspeed/vitest-plugin), tinybench v5 (@codspeed/tinybench-plugin), benchmark.js (@codspeed/benchmark.js-plugin)
- Run via:
codspeed run -m simulation -- npx vitest bench(or equivalent)
Go
- Built-in:
codspeed run -m simulation -- go test -bench .
- No special packages needed — CodSpeed instruments
go test -benchdirectly
C/C++
- Uses Google Benchmark with valgrind-codspeed
- Build with CMake, run benchmarks via
codspeed run
Any language (exec harness)
- Use
codspeed exec -m <mode> -- <command>for any executable
- Or define benchmarks in
codspeed.ymland usecodspeed run
- No code changes required — CodSpeed instruments the binary externally
MCP tools reference
You have access to these CodSpeed MCP tools:
- **
list_runs**: Find run IDs. Filter by branch, event type. Use this to find your baseline and latest runs.
- **
compare_runs**: Compare two runs. Shows improvements, regressions, new/missing benchmarks with formatted values. This is your primary tool for measuring impact.
- **
query_flamegraph**: Inspect where time is spent. Parameters:
run_id: which run to look at
benchmark_name: full benchmark URI
depth_limit: call tree depth (default 5, max 20)
root_function_name: re-root at a specific function to zoom in
- **
list_repositories**: Find the repository slug if needed
- **
get_run**: Get details about a specific run
Guiding principles
- Everything goes through CodSpeed. Never run benchmarks outside of the CodSpeed CLI. Never quote timing numbers from raw benchmark output. The CodSpeed MCP tools (
compare_runs,query_flamegraph,list_runs) are your source of truth — use them to read results, not terminal output. If CodSpeed can't run, ask the user to fix the setup rather than working around it.
- Measure first, optimize second. Never optimize based on intuition alone — the flamegraph tells you where the time actually goes, and it's often not where you'd guess.
- One change at a time. Isolated changes make it clear what helped and what didn't.
- Correctness over speed. Always run tests. A fast but broken program is useless.
- Simulation for iteration, walltime for validation. Simulation is deterministic and fast for feedback. Walltime is the ground truth. Both run through CodSpeed.
- Know when to stop. Diminishing returns are real. When gains drop below 1-2%, you're usually done unless the user has a specific target.
- Be transparent. Show the user your reasoning, the numbers, and the tradeoffs. Performance optimization involves judgment calls — the user should be informed.