SKILL.md
$2b
Or LLaMA-3 8B training
./examples/llama/train_llama3_8b_fp8.sh
## Common workflows
### Workflow 1: Train LLaMA-style model with 3D parallelism
Copy this checklist:
LLaMA Training Setup:
- Step 1: Choose parallelism configuration
- Step 2: Configure training hyperparameters
- Step 3: Launch distributed training
- Step 4: Monitor performance metrics
**Step 1: Choose parallelism configuration**
Model size determines parallelism strategy:
| Model Size | GPUs | Tensor Parallel | Pipeline Parallel | Data Parallel | Context Parallel |
|------------|------|-----------------|-------------------|---------------|------------------|
| 7B | 8 | 1 | 1 | 8 | 1 |
| 13B | 8 | 2 | 1 | 4 | 1 |
| 70B | 64 | 4 | 4 | 4 | 1 |
| 405B | 128 | 8 | 8 | 2 | 2 |
**Step 2: Configure training hyperparameters**
#!/bin/bash
train_llama_70b.sh
GPUS_PER_NODE=8
NNODES=8 # 64 GPUs total
TP=4 # Tensor parallel
PP=4 # Pipeline parallel
CP=1 # Context parallel
LLaMA 70B configuration
MODEL_SIZE=70 # Billion parameters
HIDDEN_SIZE=8192
NUM_LAYERS=80
NUM_HEADS=64
SEQ_LENGTH=4096
Training hyperparameters
MICRO_BATCH=1
GLOBAL_BATCH=1024
LR=3e-4
torchrun \
--nproc_per_node=$GPUS_PER_NODE \
--nnodes=$NNODES \
pretrain_gpt.py \
--tensor-model-parallel-size $TP \
--pipeline-model-parallel-size $PP \
--context-parallel-size $CP \
--sequence-parallel \
--num-layers $NUM_LAYERS \
--hidden-size $HIDDEN_SIZE \
--num-attention-heads $NUM_HEADS \
--seq-length $SEQ_LENGTH \
--max-position-embeddings $SEQ_LENGTH \
--micro-batch-size $MICRO_BATCH \
--global-batch-size $GLOBAL_BATCH \
--lr $LR \
--train-iters 100000 \
--lr-decay-style cosine \
--lr-warmup-iters 2000 \
--weight-decay 0.1 \
--clip-grad 1.0 \
--bf16 \
--use-mcore-models \
--transformer-impl transformer_engine \
--data-path /path/to/data \
--vocab-file /path/to/vocab.json \
--merge-file /path/to/merges.txt
**Step 3: Launch distributed training**
Single node (8 GPUs)
bash train_llama_70b.sh
Multi-node with SLURM
sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh
**Step 4: Monitor performance metrics**
Key metrics to track:
Model FLOP Utilization (MFU): Target >40% on H100
Throughput: Tokens/sec/GPU
Memory usage: <80GB per GPU for 70B model
Loss: Should decrease steadily
### Workflow 2: Configure Mixture of Experts (MoE) training
For sparse MoE models like Mixtral.
MoE Training:
- [ ] Step 1: Configure expert parallelism
- [ ] Step 2: Set MoE hyperparameters
- [ ] Step 3: Launch training with EP
**Step 1: Configure expert parallelism**
Mixtral 8x7B example
TENSOR_PARALLEL=2
PIPELINE_PARALLEL=1
EXPERT_PARALLEL=4 # Split 8 experts across 4 GPUs
DATA_PARALLEL=4
TOTAL_GPUS=$((TENSOR_PARALLEL PIPELINE_PARALLEL EXPERT_PARALLEL * DATA_PARALLEL))
= 2 1 4 * 4 = 32 GPUs
**Step 2: Set MoE hyperparameters**
torchrun \
--nproc_per_node=8 \
pretrain_gpt.py \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 1 \
--expert-model-parallel-size 4 \
--num-experts 8 \
--moe-router-topk 2 \
--moe-router-load-balancing-type aux_loss \
--moe-aux-loss-coeff 0.01 \
--hidden-size 4096 \
--num-layers 32 \
--num-attention-heads 32 \
--seq-length 4096 \
--max-position-embeddings 4096 \
--bf16 \
--use-mcore-models \
--transformer-impl transformer_engine \
--data-path /path/to/data \
--vocab-file /path/to/vocab.json \
--merge-file /path/to/merges.txt
**Step 3: Launch training with EP**
Expert parallelism distributes different experts across GPUs, reducing memory while maintaining capacity.
Memory without EP: 8 experts × 7B = 56GB per GPU
Memory with EP=4: 2 experts × 7B = 14GB per GPU
Savings: 75% memory reduction
### Workflow 3: Optimize for maximum throughput
Achieve 47% MFU on H100.
Performance Optimization:
- [ ] Step 1: Enable Flash Attention
- [ ] Step 2: Use FP8 precision (H100)
- [ ] Step 3: Optimize micro-batch size
- [ ] Step 4: Tune parallelism degrees
**Step 1: Enable optimizations**
--use-mcore-models # Use Megatron Core models
--transformer-impl transformer_engine # Use Transformer Engine
--sequence-parallel # Reduce activation memory (use with TP)
**Step 2: Use FP8 precision (H100 only)**
--fp8-hybrid # FP8 mixed precision training
Transformer Engine handles FP8 automatically
Result: 1.5-2x speedup on H100 vs BF16.
**Step 3: Optimize micro-batch size**
Find largest micro-batch that fits in memory:
Start with 1, increase until OOM
for MBS in 1 2 4 8; do
echo "Testing micro-batch-size=$MBS"
torchrun ... --micro-batch-size $MBS
done
Typical values:
- 7B model: 4-8
- 70B model: 1-2
- 405B model: 1
**Step 4: Tune parallelism degrees**
Rules of thumb:
Tensor Parallel: Use ≤8 (limited by NVLink within node)
Pipeline Parallel: Use for >70B models
Context Parallel: Use for sequences >8K tokens
Data Parallel: Fill remaining GPUs
Example 405B on 128 H100s:
TP=8 (1 node)
PP=8 (across nodes)
CP=2 (long sequences)
DP=1
Total = 8 × 8 × 2 × 1 = 128 GPUs
## When to use vs alternatives
**Use Megatron-Core when:**
- Training models >10B parameters
- Need maximum efficiency (target >40% MFU)
- Using NVIDIA GPUs (A100, H100)
- Production training at scale
- Want fine-grained parallelism control
**Use alternatives instead:**
- **PyTorch FSDP**: Models <70B, simpler API, PyTorch native
- **DeepSpeed**: Easier setup, good for <100B models
- **HuggingFace Accelerate**: Prototyping, simpler workflows
- **LitGPT**: Educational, single-file implementations
## Common issues
**Issue: Low GPU utilization (<30% MFU)**
Causes:
- Micro-batch too small
- Too much parallelism overhead
- Not using Flash Attention
Fixes:
Increase micro-batch
--micro-batch-size 4 # Was 1
Enable optimizations
--use-flash-attn
--sequence-parallel
Reduce TP if >8
--tensor-model-parallel-size 4 # Was 16
**Issue: Out of memory**
Reduce memory with:
--tensor-model-parallel-size 2 # Split model across GPUs
--recompute-granularity full # Gradient checkpointing
--recompute-method block # Checkpoint transformer blocks
--recompute-num-layers 1 # Checkpoint every layer
Or use CPU/NVMe offloading:
--cpu-optimizer # Offload optimizer to CPU
--cpu-optimizer-type ADAM # CPU Adam variant
**Issue: Training slower than expected**
Check:
- **Network bottleneck**: Ensure InfiniBand/NVLink enabled
- **Pipeline bubbles**: Use interleaved pipeline schedule
--num-layers-per-virtual-pipeline-stage 2
- **Data loading**: Use fast data loader
--dataloader-type cyclic
**Issue: Diverging loss**
Stabilize training:
--lr-warmup-iters 2000 # Longer warmup
--clip-grad 1.0 # Gradient clipping
--init-method-std 0.006 # Smaller init
--attention-dropout 0.0 # No dropout in attention
--hidden-dropout 0.0 # No dropout in FFN