nanochat-llm-training

Train your own GPT-2 level LLM for under $100 using nanochat, Karpathy's minimal hackable harness covering tokenization, pretraining, finetuning, evaluation,…

INSTALLATION
npx skills add https://github.com/aradotso/trending-skills --skill nanochat-llm-training
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

$27

Key Commands

Full GPT-2 Speedrun (8×H100 node, ~2–3 hours, ~$48)

# Run the reference pipeline: data download, pretraining, SFT, eval, chat

bash runs/speedrun.sh

Pretraining (distributed)

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \

    --depth=26 \

    --run="d26_run" \

    --model-tag="d26"

Pretraining (single GPU)

python -m scripts.base_train -- \

    --depth=26 \

    --run="d26_single"

Quick Research Iteration (~5 min, GPT-1 scale)

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \

    --depth=12 \

    --run="d12_exp" \

    --model-tag="d12" \

    --core-metric-every=999999 \

    --sample-every=-1 \

    --save-every=-1

CPU / Apple Silicon (tiny model, ~minutes)

bash runs/runcpu.sh

Serve Chat UI

# After training completes

source .venv/bin/activate

python -m scripts.chat_web

# Visit http://<your-server-ip>:8000/

CLI Chat

python -m scripts.chat_cli -p "hello"

Scaling Laws / Miniseries

bash runs/scaling_laws.sh   # sweep depths for scaling law data

bash runs/miniseries.sh     # train full compute-optimal miniseries

The Depth Dial

The single most important parameter. Everything else is derived automatically:

--depth

Approximate model scale

Notes

6–8

Tiny (toy)

CPU/MPS feasible

12

GPT-1 size

~5 min on 8×H100, great for research iteration

16

Medium

~15 min on 8×H100

24–26

GPT-2 size

~2 hrs on 8×H100, ~$48

# Smaller/faster experiments

python -m scripts.base_train -- --depth=12 --run="quick_test"

# Full GPT-2 grade

torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --run="gpt2_repro"

Precision / dtype Configuration

nanochat uses explicit dtype management via COMPUTE_DTYPE in nanochat/common.py. No torch.amp.autocast.

Hardware

Default

Override

CUDA SM 80+ (A100, H100)

bfloat16

NANOCHAT_DTYPE=float32

CUDA SM < 80 (V100, T4)

float32

NANOCHAT_DTYPE=float16

CPU / MPS

float32

# Force fp32 for inference

NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p "hello"

# Force bf16 for training

NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train

# float16 training (enables GradScaler automatically)

NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train

How it works: Weights stored in fp32 (optimizer precision), custom Linear casts to COMPUTE_DTYPE in forward pass, embeddings stored directly in COMPUTE_DTYPE to save memory.

Key Python Modules

nanochat/

├── gpt.py              # GPT nn.Module Transformer

├── engine.py           # Inference with KV Cache

├── dataloader.py       # Tokenizing Distributed Data Loader

├── dataset.py          # Download/read utils for pretraining data

├── optim.py            # AdamW + Muon optimizer (1GPU and distributed)

├── core_eval.py        # DCLM CORE score evaluation

├── loss_eval.py        # Bits-per-byte evaluation

├── checkpoint_manager.py  # Save/Load checkpoints

├── common.py           # Utilities, COMPUTE_DTYPE

├── execution.py        # Python code execution tool for LLM

└── engine.py           # Efficient KV-cache inference

scripts/

├── base_train.py       # Pretraining entry point

├── chat_web.py         # Web chat UI server

└── chat_cli.py         # CLI chat interface

runs/

├── speedrun.sh         # Reference full pipeline (GPT-2 speedrun)

├── scaling_laws.sh     # Scaling law sweeps

├── miniseries.sh       # Full compute-optimal miniseries

└── runcpu.sh           # CPU/MPS example

Real Code Examples

Load and Run Inference on a Trained Model

import torch

from nanochat.gpt import GPT

from nanochat.engine import InferenceEngine

from nanochat.checkpoint_manager import CheckpointManager

# Load checkpoint

ckpt_manager = CheckpointManager("checkpoints/d26")

model, config = ckpt_manager.load()

model.eval()

# Run inference with KV cache

engine = InferenceEngine(model)

output = engine.generate(

    prompt="Once upon a time",

    max_new_tokens=200,

    temperature=0.8,

    top_p=0.95,

)

print(output)

Custom Training Script with Depth Dial

import subprocess

def train_model(depth: int, run_name: str, nproc: int = 8):

    """Launch a compute-optimal training run for given depth."""

    cmd = [

        "torchrun",

        "--standalone",

        f"--nproc_per_node={nproc}",

        "-m", "scripts.base_train",

        "--",

        f"--depth={depth}",

        f"--run={run_name}",

        f"--model-tag={run_name}",

    ]

    subprocess.run(cmd, env={"OMP_NUM_THREADS": "1", **__import__("os").environ})

# Quick research iteration

train_model(depth=12, run_name="my_experiment_d12")

# Full GPT-2 grade

train_model(depth=26, run_name="my_gpt2_repro")

Adjust Device Batch Size for Lower VRAM

# Default device_batch_size=32 needs ~80GB VRAM per GPU

# Reduce for smaller GPUs (gradient accumulation handles the rest)

torchrun --standalone --nproc_per_node=4 -m scripts.base_train -- \

    --depth=12 \

    --device_batch_size=16 \

    --run="low_vram_run"

# Even smaller

python -m scripts.base_train -- \

    --depth=8 \

    --device_batch_size=4 \

    --run="single_gpu_small"

Monitoring Key Metrics in wandb

# nanochat logs to wandb automatically. Key metrics to watch:

# - val_bpb: validation loss in bits-per-byte (vocab-size-invariant)

#   as a function of step, total_training_time, total_training_flops

# - core_metric: DCLM CORE score (target > 0.2565 to beat GPT-2)

# - train/mfu: Model FLOPS utilization

# - train/tok_per_sec: Training throughput

# Set wandb project via env var before training

import os

os.environ["WANDB_PROJECT"] = "my-nanochat-runs"

Synthetic Data for SFT Personality

# dev/gen_synthetic_data.py — generate identity/personality data

# Then mix into SFT stage per the guide:

# https://github.com/karpathy/nanochat/discussions/139

# Example: generate data and point SFT to it

python dev/gen_synthetic_data.py --output data/identity_sft.jsonl

# Then reference in your SFT script configuration

Common Patterns

Research Iteration Loop

# 1. Make a code change in nanochat/

# 2. Run quick d12 to validate

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \

    --depth=12 --run="test_my_change" \

    --core-metric-every=999999 --sample-every=-1 --save-every=-1

# 3. Check wandb: val_bpb vs step/time/flops

# 4. If promising, test at d16 or d26

FP8 Training (H100 only, for speedrun)

# FP8 is used in the speedrun for additional speedup

# See runs/speedrun.sh for the exact invocation

bash runs/speedrun.sh

Evaluate CORE Score Only

python -m nanochat.core_eval --checkpoint checkpoints/d26/latest

Serve on Lambda / Remote Machine

# On remote machine after training:

source .venv/bin/activate

python -m scripts.chat_web

# Access via: http://<PUBLIC_IP>:8000/

# Use `screen` or `tmux` to keep alive

screen -S nanochat

python -m scripts.chat_web

# Ctrl+A, D to detach

Troubleshooting

OOM / Out of VRAM

# Reduce --device_batch_size (default 32)

# Code uses gradient accumulation to maintain effective batch size

--device_batch_size=16   # Try 16, 8, 4, 2, 1

Single GPU is 8× Slower

This is expected. Omit torchrun and use python -m scripts.base_train directly. Gradient accumulation kicks in automatically to maintain equivalent total batch size.

Running on Non-CUDA Hardware

# MPS (Apple Silicon) or CPU — use runcpu.sh as template

bash runs/runcpu.sh

# Results will be weak; this is for development/debugging only

float16 Gradient Underflow

# nanochat auto-enables GradScaler when NANOCHAT_DTYPE=float16

NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12

# Note: RL scripts do NOT support float16 (SFT and base_train do)

V100 / T4 (SM < 80) — No bf16

# Default falls back to float32; optionally use float16

NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12

Chat UI Not Accessible

# Ensure the port (default 8000) is open in your cloud provider's firewall/security group

# Use the public IP, not localhost:

# http://<PUBLIC_IP>:8000/

Resources

  • Discord: #nanochat channel on Karpathy's Discord
  • Leaderboard docs: dev/LEADERBOARD.md
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card