autoresearch

>

INSTALLATION
npx skills add https://github.com/supercent-io/skills-template --skill autoresearch
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

$28

Human authors program.md

       │

       ▼

Agent reads program.md + train.py

       │

       ▼

Agent modifies train.py → git commit

       │

       ▼

uv run train.py  (exactly 300 seconds)

       │

       ▼

Extract val_bpb + peak_vram_mb

       │

  ┌────┴────┐

improved?   no improvement

  │              │

keep commit   git reset HEAD~1

  │              │

  └──────┬───────┘

         │

   log to results.tsv

         │

         ▼

    repeat ∞

Mutable vs. Immutable Files

File

Agent access

Purpose

train.py

Read + Write

Model, optimizer, training loop (~630 lines)

program.md

Read-only

Human research directives

prepare.py

Read-only

Data pipeline + evaluate_bpb() harness

constants.py

Read-only

TIME_BUDGET=300, MAX_SEQ_LEN, EVAL_TOKENS

pyproject.toml

Read-only

Locked dependencies (no new packages)

results.tsv

Append

All experiments: kept and discarded

Instructions

Step 1: Install Prerequisites

# Install uv (fast Python package manager)

curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository

git clone https://github.com/karpathy/autoresearch

cd autoresearch

# Install locked dependencies

uv sync

Step 2: Prepare Data (One-Time, ~2 Minutes)

# Downloads FineWeb-Edu parquet shards, trains BPE tokenizer

# Last shard is reserved for validation — never seen during training

uv run prepare.py

For constrained hardware, edit prepare.py before running:

# Lower MAX_SEQ_LEN for GPUs with limited VRAM

MAX_SEQ_LEN = 256   # default: 2048

Step 3: Run a Baseline Experiment

# Single 5-minute experiment to verify setup

uv run train.py > run.log 2>&1

# Extract key metrics

grep "^val_bpb:\|^peak_vram_mb:" run.log

Expected output:

val_bpb: 0.9979

peak_vram_mb: 38420

Step 4: Author program.md

program.md is the human-written research charter the agent reads at the start of every loop iteration. Write it as precise Markdown instructions:

# Research Program

## Goal

Minimize val_bpb on the FineWeb-Edu validation set within the 300-second budget.

## Current Baseline

val_bpb: 0.9979 (depth-12 GPT, Muon + AdamW optimizer)

## Directions to Explore

1. Attention variants: MLA, GQA, sliding window, local-global hybrid

2. Layer types: MoE FFN layers, SwiGLU activations

3. Optimizer tuning: Muon momentum, AdamW β values, learning rate schedule

4. Architectural depth/width tradeoffs within VRAM budget

## Constraints

- Must complete within 300 seconds

- Peak VRAM must stay under 39GB

- No new packages (use only what is in pyproject.toml)

- Do not modify prepare.py or constants.py

## Notes from Previous Runs

- Depth-12 improvements transfer to depth-24 (scale-invariant gains)

- RoPE positional encoding outperformed learned embeddings (+0.008 val_bpb)

Effective program.md principles:

  • Be specific about what to explore — vague directives waste experiments
  • Record what has already been tried (prevents redundant experiments)
  • Note hardware constraints explicitly
  • Use the current best val_bpb as a reference point

Step 5: Run the Autonomous Agent Loop

Point your AI agent (Claude Code, Codex, etc.) at the repository with program.md as its research context. The agent will:

  • Read program.md + current train.py
  • Hypothesize an improvement
  • Modify train.py + commit
  • Execute uv run train.py (300 seconds)
  • Extract val_bpb; keep or revert via git
  • Append to results.tsv
  • Repeat

With Claude Code (OMC):

# From inside autoresearch/

# Give Claude the context: "Run the autoresearch loop following program.md"

With Claude Code CLI directly:

claude "Follow program.md. Run autonomous research loop on train.py.

Execute: uv run train.py, extract val_bpb, keep improvements, revert failures.

Log everything to results.tsv. Do not stop until I say so."

Step 6: Monitor Results

# Live monitoring during a run

watch -n 30 "tail -20 results.tsv"

# Count kept vs. discarded

awk -F'\t' '{print $4}' results.tsv | sort | uniq -c

# Find the best experiment

sort -t$'\t' -k2 -n results.tsv | head -5

# Check current best val_bpb

git log --oneline -5

Step 7: Interpret results.tsv

commit    val_bpb    memory_gb    status     description

a3f2c91   0.9697     37.2         keep       SwiGLU activation + depth-12

b8e1d04   0.9821     38.1         discard    MoE 4-expert: marginal gain

c1a5f30   crash      —            crash      OOM: sequence length 4096

Status

Meaning

keep

val_bpb improved; commit retained on branch

discard

No improvement; git reset HEAD~1 applied

crash

OOM, syntax error, or timeout; always reverted

Examples

Example 1: Overnight Run Summary

Session summary: 126 experiments, 18 improvements

Best val_bpb: 0.9697 (started: 0.9979)

Top improvements:

- SwiGLU activation: -0.012 val_bpb

- GQA with 4 KV heads: -0.009 val_bpb

- Muon momentum 0.92→0.95: -0.006 val_bpb

Example 2: Low-VRAM Configuration (6GB GPU)

# In prepare.py — edit before uv run prepare.py

MAX_SEQ_LEN = 256       # was 2048

EVAL_TOKENS = 2_097_152  # was 20_971_520 (scale down proportionally)

Example 3: Extract Experiments by Category

# Find all attention-related experiments

grep -i "attention\|GQA\|MLA\|MHA" results.tsv

# List only improvements sorted by gain

awk -F'\t' '$4=="keep"' results.tsv | sort -t$'\t' -k2 -n

Available scripts

Run from inside the autoresearch repository directory:

Script

Purpose

Usage

setup.sh

One-time environment setup

bash scripts/setup.sh [--seq-len 512]

run-experiment.sh

Single 5-min experiment + metric extraction

bash scripts/run-experiment.sh

run-loop.sh

Autonomous loop: run → keep/revert → repeat

bash scripts/run-loop.sh [--max 20]

show-results.sh

Human-readable results.tsv report

bash scripts/show-results.sh [--top 10]

check-hardware.sh

GPU/CUDA/uv availability check (JSON output)

bash scripts/check-hardware.sh

# Typical overnight session

bash scripts/check-hardware.sh

bash scripts/setup.sh --seq-len 512     # adjust for your VRAM

# Edit program.md with your research directives

bash scripts/run-loop.sh --max 100 --desc "session-1"

bash scripts/show-results.sh --kept-only

References

Detailed documentation in references/:

File

Contents

references/architecture.md

System design, immutability contract, git ratcheting, key design decisions

references/program-md-guide.md

How to write effective program.md directives; full template + principles

references/hardware-config.md

VRAM settings by GPU, memory optimization techniques, troubleshooting

Best practices

  • Write program.md before running — the agent is only as good as its directives; vague programs waste compute
  • Start with the baseline first — always uv run train.py manually before launching the loop to confirm the setup works
  • **Keep MAX_SEQ_LEN in prepare.py consistent** — changing it mid-run invalidates val_bpb comparisons
  • **Never modify prepare.py or constants.py** — the evaluation harness must stay fixed for results to be meaningful
  • Scale improvements before committing — test that a depth-12 improvement also holds at depth-24 before treating it as a fundamental gain
  • **Commit program.md updates** — version-control your research directives alongside results.tsv for reproducibility
  • Monitor VRAM — add peak_vram_mb constraints in program.md for your GPU's headroom
  • No new dependencies — the agent cannot pip install; it can only use what is in pyproject.toml

Hardware Requirements

Hardware

Status

Notes

H100 80GB

Recommended

Default config, full MAX_SEQ_LEN=2048

A100 40GB

Supported

Lower MAX_SEQ_LEN if needed

RTX 4090 24GB

Community

Reduce MAX_SEQ_LEN to 512

GTX 1660 Ti 6GB

Community fork

MAX_SEQ_LEN=256, reduced EVAL_TOKENS

Apple Silicon (M-series)

MLX port

Community fork; different optimizer API

Windows RTX

Community

WSL2 + CUDA recommended

Key Metrics Reference

Metric

Direction

Description

val_bpb

Lower = better

Validation bits-per-byte; vocabulary-size-independent

peak_vram_mb

Lower = more headroom

Peak GPU memory during the training run

Experiments/hour

Higher = faster search

~12 at TIME_BUDGET=300

References

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card