SKILL.md

$2b

Or LLaMA-3 8B training

./examples/llama/train_llama3_8b_fp8.sh

## Common workflows

### Workflow 1: Train LLaMA-style model with 3D parallelism

Copy this checklist:

LLaMA Training Setup:

Step 1: Choose parallelism configuration

Step 2: Configure training hyperparameters

Step 3: Launch distributed training

Step 4: Monitor performance metrics

**Step 1: Choose parallelism configuration**

Model size determines parallelism strategy:

| Model Size | GPUs | Tensor Parallel | Pipeline Parallel | Data Parallel | Context Parallel |

|------------|------|-----------------|-------------------|---------------|------------------|

| 7B | 8 | 1 | 1 | 8 | 1 |

| 13B | 8 | 2 | 1 | 4 | 1 |

| 70B | 64 | 4 | 4 | 4 | 1 |

| 405B | 128 | 8 | 8 | 2 | 2 |

**Step 2: Configure training hyperparameters**

#!/bin/bash

train_llama_70b.sh

GPUS_PER_NODE=8

NNODES=8 # 64 GPUs total

TP=4 # Tensor parallel

PP=4 # Pipeline parallel

CP=1 # Context parallel

LLaMA 70B configuration

MODEL_SIZE=70 # Billion parameters

HIDDEN_SIZE=8192

NUM_LAYERS=80

NUM_HEADS=64

SEQ_LENGTH=4096

Training hyperparameters

MICRO_BATCH=1

GLOBAL_BATCH=1024

LR=3e-4

torchrun \

--nproc_per_node=$GPUS_PER_NODE \

--nnodes=$NNODES \

pretrain_gpt.py \

--tensor-model-parallel-size $TP \

--pipeline-model-parallel-size $PP \

--context-parallel-size $CP \

--sequence-parallel \

--num-layers $NUM_LAYERS \

--hidden-size $HIDDEN_SIZE \

--num-attention-heads $NUM_HEADS \

--seq-length $SEQ_LENGTH \

--max-position-embeddings $SEQ_LENGTH \

--micro-batch-size $MICRO_BATCH \

--global-batch-size $GLOBAL_BATCH \

--lr $LR \

--train-iters 100000 \

--lr-decay-style cosine \

--lr-warmup-iters 2000 \

--weight-decay 0.1 \

--clip-grad 1.0 \

--bf16 \

--use-mcore-models \

--transformer-impl transformer_engine \

--data-path /path/to/data \

--vocab-file /path/to/vocab.json \

--merge-file /path/to/merges.txt


**Step 3: Launch distributed training**

Single node (8 GPUs)

bash train_llama_70b.sh

Multi-node with SLURM

sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh


**Step 4: Monitor performance metrics**

Key metrics to track:

Model FLOP Utilization (MFU): Target >40% on H100

Throughput: Tokens/sec/GPU

Memory usage: <80GB per GPU for 70B model

Loss: Should decrease steadily


### Workflow 2: Configure Mixture of Experts (MoE) training

For sparse MoE models like Mixtral.

MoE Training:

[ ] Step 1: Configure expert parallelism

[ ] Step 2: Set MoE hyperparameters

[ ] Step 3: Launch training with EP


**Step 1: Configure expert parallelism**

Mixtral 8x7B example

TENSOR_PARALLEL=2

PIPELINE_PARALLEL=1

EXPERT_PARALLEL=4 # Split 8 experts across 4 GPUs

DATA_PARALLEL=4

TOTAL_GPUS=$((TENSOR_PARALLEL PIPELINE_PARALLEL EXPERT_PARALLEL * DATA_PARALLEL))

= 2 1 4 * 4 = 32 GPUs


**Step 2: Set MoE hyperparameters**

torchrun \

--nproc_per_node=8 \

pretrain_gpt.py \

--tensor-model-parallel-size 2 \

--pipeline-model-parallel-size 1 \

--expert-model-parallel-size 4 \

--num-experts 8 \

--moe-router-topk 2 \

--moe-router-load-balancing-type aux_loss \

--moe-aux-loss-coeff 0.01 \

--hidden-size 4096 \

--num-layers 32 \

--num-attention-heads 32 \

--seq-length 4096 \

--max-position-embeddings 4096 \

--bf16 \

--use-mcore-models \

--transformer-impl transformer_engine \

--data-path /path/to/data \

--vocab-file /path/to/vocab.json \

--merge-file /path/to/merges.txt


**Step 3: Launch training with EP**

Expert parallelism distributes different experts across GPUs, reducing memory while maintaining capacity.

Memory without EP: 8 experts × 7B = 56GB per GPU

Memory with EP=4: 2 experts × 7B = 14GB per GPU

Savings: 75% memory reduction


### Workflow 3: Optimize for maximum throughput

Achieve 47% MFU on H100.

Performance Optimization:

[ ] Step 1: Enable Flash Attention

[ ] Step 2: Use FP8 precision (H100)

[ ] Step 3: Optimize micro-batch size

[ ] Step 4: Tune parallelism degrees


**Step 1: Enable optimizations**

--use-mcore-models # Use Megatron Core models

--transformer-impl transformer_engine # Use Transformer Engine

--sequence-parallel # Reduce activation memory (use with TP)


**Step 2: Use FP8 precision (H100 only)**

--fp8-hybrid # FP8 mixed precision training

Transformer Engine handles FP8 automatically


Result: 1.5-2x speedup on H100 vs BF16.

**Step 3: Optimize micro-batch size**

Find largest micro-batch that fits in memory:

Start with 1, increase until OOM

for MBS in 1 2 4 8; do

echo "Testing micro-batch-size=$MBS"

torchrun ... --micro-batch-size $MBS

done


Typical values:

- 7B model: 4-8

- 70B model: 1-2

- 405B model: 1

**Step 4: Tune parallelism degrees**

Rules of thumb:

Tensor Parallel: Use ≤8 (limited by NVLink within node)

Pipeline Parallel: Use for >70B models

Context Parallel: Use for sequences >8K tokens

Data Parallel: Fill remaining GPUs


Example 405B on 128 H100s:

TP=8 (1 node)

PP=8 (across nodes)

CP=2 (long sequences)

DP=1

Total = 8 × 8 × 2 × 1 = 128 GPUs


## When to use vs alternatives

**Use Megatron-Core when:**

- Training models >10B parameters

- Need maximum efficiency (target >40% MFU)

- Using NVIDIA GPUs (A100, H100)

- Production training at scale

- Want fine-grained parallelism control

**Use alternatives instead:**

- **PyTorch FSDP**: Models <70B, simpler API, PyTorch native

- **DeepSpeed**: Easier setup, good for <100B models

- **HuggingFace Accelerate**: Prototyping, simpler workflows

- **LitGPT**: Educational, single-file implementations

## Common issues

**Issue: Low GPU utilization (<30% MFU)**

Causes:

- Micro-batch too small

- Too much parallelism overhead

- Not using Flash Attention

Fixes:

Increase micro-batch

--micro-batch-size 4 # Was 1

Enable optimizations

--use-flash-attn

--sequence-parallel

Reduce TP if >8

--tensor-model-parallel-size 4 # Was 16


**Issue: Out of memory**

Reduce memory with:

--tensor-model-parallel-size 2 # Split model across GPUs

--recompute-granularity full # Gradient checkpointing

--recompute-method block # Checkpoint transformer blocks

--recompute-num-layers 1 # Checkpoint every layer


Or use CPU/NVMe offloading:

--cpu-optimizer # Offload optimizer to CPU

--cpu-optimizer-type ADAM # CPU Adam variant


**Issue: Training slower than expected**

Check:

- **Network bottleneck**: Ensure InfiniBand/NVLink enabled

- **Pipeline bubbles**: Use interleaved pipeline schedule

--num-layers-per-virtual-pipeline-stage 2


- **Data loading**: Use fast data loader

--dataloader-type cyclic


**Issue: Diverging loss**

Stabilize training:

--lr-warmup-iters 2000 # Longer warmup

--clip-grad 1.0 # Gradient clipping

--init-method-std 0.006 # Smaller init

--attention-dropout 0.0 # No dropout in attention

--hidden-dropout 0.0 # No dropout in FFN

training-llms-megatron

SKILL.md

Or LLaMA-3 8B training

train_llama_70b.sh

LLaMA 70B configuration

Training hyperparameters

Single node (8 GPUs)

Multi-node with SLURM

Mixtral 8x7B example

= 2 1 4 * 4 = 32 GPUs

Transformer Engine handles FP8 automatically

Start with 1, increase until OOM

Increase micro-batch

Enable optimizations

Reduce TP if >8

Stop writing automation&scrapers

training-llms-megatron

SKILL.md

Or LLaMA-3 8B training

train_llama_70b.sh

LLaMA 70B configuration

Training hyperparameters

Single node (8 GPUs)

Multi-node with SLURM

Mixtral 8x7B example

= 2 1 4 * 4 = 32 GPUs

Transformer Engine handles FP8 automatically

Start with 1, increase until OOM

Increase micro-batch

Enable optimizations

Reduce TP if >8

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers