experiment-design

Name: experiment-design
Author: lingzhi227

Design experiment plans with progressive stages — initial implementation, baseline tuning, creative research, and ablation studies. Plan baselines, datasets,…

INSTALLATION

npx skills add https://github.com/lingzhi227/agent-research-skills --skill experiment-design

Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

$27

Generates baselines, ablation matrix, hyperparameter grid, metric selection. Stdlib-only.

4-Stage Progressive Framework (from AI-Scientist-v2)

Stage 1: Initial Implementation

Focus on getting a basic working implementation

Use a simple dataset

Aim for basic functional correctness

Completion: at least one working (non-buggy) implementation

Stage 2: Baseline Tuning

Tune hyperparameters (learning rate, epochs, batch size)

Do NOT change model architecture

Test on at least TWO datasets

Completion: stable training curves, improvement over Stage 1

Stage 3: Creative Research

Explore novel improvements and insights

Be creative and think outside the box

Test on at least THREE datasets

Completion: demonstrated novel improvement

Stage 4: Ablation Studies

Systematic component analysis

Each ablation tests a different aspect

Use same datasets as Stage 3

Completion: all planned ablations done

Output Format

{

  "stages": [

    {

      "name": "initial_implementation",

      "goals": ["Basic working baseline", "Simple dataset"],

      "max_iterations": 5,

      "completion_criteria": "Working implementation with non-zero accuracy"

    }

  ],

  "baselines": ["Method A", "Method B"],

  "datasets": ["Dataset1", "Dataset2", "Dataset3"],

  "metrics": ["accuracy", "F1", "inference_time"],

  "ablation_components": ["component_A", "component_B"],

  "hyperparameter_grid": {

    "lr": [1e-4, 1e-3, 1e-2],

    "batch_size": [32, 64, 128]

  },

  "num_seeds": 3

}

Rules

Always start simple (Stage 1) before complex experiments

Each stage builds on the best result from the previous stage

Multi-seed evaluation for statistical significance

Document every experiment run in notes.txt

Generate figures for training curves and comparisons

Related Skills

Upstream: research-planning, idea-generation

Downstream: experiment-code, data-analysis

See also: paper-assembly