pytorch-lightning

Streamlined PyTorch training framework with automatic distributed training, callbacks, and minimal boilerplate. Organizes PyTorch code into LightningModule with training_step, validation_step, and test_step methods; Trainer class handles device management, mixed precision, checkpointing, and logging automatically Supports distributed training strategies including DDP, FSDP, and DeepSpeed with single-line configuration; scales from laptop to multi-node clusters without code changes Built-in callbacks system for ModelCheckpoint, EarlyStopping, LearningRateMonitor, and custom extensions; integrates with TensorBoard and popular logging platforms Handles gradient accumulation, learning rate scheduling, and precision modes (FP32, FP16, BF16, FP8); works across GPU, TPU, CPU, and Apple MPS accelerators

INSTALLATION
npx skills add https://github.com/davila7/claude-code-templates --skill pytorch-lightning
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

$2a

Step 1: Define LightningModule (organize your PyTorch code)

class LitModel(L.LightningModule):

def init(self, hidden_size=128):

super().init()

self.model = nn.Sequential(

nn.Linear(28 * 28, hidden_size),

nn.ReLU(),

nn.Linear(hidden_size, 10)

)

def training_step(self, batch, batch_idx):

    x, y = batch

    y_hat = self.model(x)

    loss = nn.functional.cross_entropy(y_hat, y)

    self.log('train_loss', loss)  # Auto-logged to TensorBoard

    return loss

def configure_optimizers(self):

    return torch.optim.Adam(self.parameters(), lr=1e-3)

Step 2: Create data

train_loader = DataLoader(train_dataset, batch_size=32)

Step 3: Train with Trainer (handles everything else!)

trainer = L.Trainer(max_epochs=10, accelerator='gpu', devices=2)

model = LitModel()

trainer.fit(model, train_loader)

**That's it!** Trainer handles:

- GPU/TPU/CPU switching

- Distributed training (DDP, FSDP, DeepSpeed)

- Mixed precision (FP16, BF16)

- Gradient accumulation

- Checkpointing

- Logging

- Progress bars

## Common workflows

### Workflow 1: From PyTorch to Lightning

**Original PyTorch code**:

model = MyModel()

optimizer = torch.optim.Adam(model.parameters())

model.to('cuda')

for epoch in range(max_epochs):

for batch in train_loader:

batch = batch.to('cuda')

optimizer.zero_grad()

loss = model(batch)

loss.backward()

optimizer.step()


**Lightning version**:

class LitModel(L.LightningModule):

def __init__(self):

super().__init__()

self.model = MyModel()

def training_step(self, batch, batch_idx):

loss = self.model(batch) # No .to('cuda') needed!

return loss

def configure_optimizers(self):

return torch.optim.Adam(self.parameters())

Train

trainer = L.Trainer(max_epochs=10, accelerator='gpu')

trainer.fit(LitModel(), train_loader)


**Benefits**: 40+ lines → 15 lines, no device management, automatic distributed

### Workflow 2: Validation and testing

class LitModel(L.LightningModule):

def __init__(self):

super().__init__()

self.model = MyModel()

def training_step(self, batch, batch_idx):

x, y = batch

y_hat = self.model(x)

loss = nn.functional.cross_entropy(y_hat, y)

self.log('train_loss', loss)

return loss

def validation_step(self, batch, batch_idx):

x, y = batch

y_hat = self.model(x)

val_loss = nn.functional.cross_entropy(y_hat, y)

acc = (y_hat.argmax(dim=1) == y).float().mean()

self.log('val_loss', val_loss)

self.log('val_acc', acc)

def test_step(self, batch, batch_idx):

x, y = batch

y_hat = self.model(x)

test_loss = nn.functional.cross_entropy(y_hat, y)

self.log('test_loss', test_loss)

def configure_optimizers(self):

return torch.optim.Adam(self.parameters(), lr=1e-3)

Train with validation

trainer = L.Trainer(max_epochs=10)

trainer.fit(model, train_loader, val_loader)

Test

trainer.test(model, test_loader)


**Automatic features**:

- Validation runs every epoch by default

- Metrics logged to TensorBoard

- Best model checkpointing based on val_loss

### Workflow 3: Distributed training (DDP)

Same code as single GPU!

model = LitModel()

8 GPUs with DDP (automatic!)

trainer = L.Trainer(

accelerator='gpu',

devices=8,

strategy='ddp' # Or 'fsdp', 'deepspeed'

)

trainer.fit(model, train_loader)


**Launch**:

Single command, Lightning handles the rest

python train.py


**No changes needed**:

- Automatic data distribution

- Gradient synchronization

- Multi-node support (just set `num_nodes=2`)

### Workflow 4: Callbacks for monitoring

from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor

Create callbacks

checkpoint = ModelCheckpoint(

monitor='val_loss',

mode='min',

save_top_k=3,

filename='model-{epoch:02d}-{val_loss:.2f}'

)

early_stop = EarlyStopping(

monitor='val_loss',

patience=5,

mode='min'

)

lr_monitor = LearningRateMonitor(logging_interval='epoch')

Add to Trainer

trainer = L.Trainer(

max_epochs=100,

callbacks=[checkpoint, early_stop, lr_monitor]

)

trainer.fit(model, train_loader, val_loader)


**Result**:

- Auto-saves best 3 models

- Stops early if no improvement for 5 epochs

- Logs learning rate to TensorBoard

### Workflow 5: Learning rate scheduling

class LitModel(L.LightningModule):

# ... (training_step, etc.)

def configure_optimizers(self):

optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)

# Cosine annealing

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(

optimizer,

T_max=100,

eta_min=1e-5

)

return {

'optimizer': optimizer,

'lr_scheduler': {

'scheduler': scheduler,

'interval': 'epoch', # Update per epoch

'frequency': 1

}

}

Learning rate auto-logged!

trainer = L.Trainer(max_epochs=100)

trainer.fit(model, train_loader)


## When to use vs alternatives

**Use PyTorch Lightning when**:

- Want clean, organized code

- Need production-ready training loops

- Switching between single GPU, multi-GPU, TPU

- Want built-in callbacks and logging

- Team collaboration (standardized structure)

**Key advantages**:

- **Organized**: Separates research code from engineering

- **Automatic**: DDP, FSDP, DeepSpeed with 1 line

- **Callbacks**: Modular training extensions

- **Reproducible**: Less boilerplate = fewer bugs

- **Tested**: 1M+ downloads/month, battle-tested

**Use alternatives instead**:

- **Accelerate**: Minimal changes to existing code, more flexibility

- **Ray Train**: Multi-node orchestration, hyperparameter tuning

- **Raw PyTorch**: Maximum control, learning purposes

- **Keras**: TensorFlow ecosystem

## Common issues

**Issue: Loss not decreasing**

Check data and model setup:

Add to training_step

def training_step(self, batch, batch_idx):

if batch_idx == 0:

print(f"Batch shape: {batch[0].shape}")

print(f"Labels: {batch[1]}")

loss = ...

return loss


**Issue: Out of memory**

Reduce batch size or use gradient accumulation:

trainer = L.Trainer(

accumulate_grad_batches=4, # Effective batch = batch_size × 4

precision='bf16' # Or 'fp16', reduces memory 50%

)


**Issue: Validation not running**

Ensure you pass val_loader:

WRONG

trainer.fit(model, train_loader)

CORRECT

trainer.fit(model, train_loader, val_loader)


**Issue: DDP spawns multiple processes unexpectedly**

Lightning auto-detects GPUs. Explicitly set devices:

Test on CPU first

trainer = L.Trainer(accelerator='cpu', devices=1)

Then GPU

trainer = L.Trainer(accelerator='gpu', devices=1)

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card