evaluating-llms-harness

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting…

INSTALLATION
npx skills add https://github.com/davila7/claude-code-templates --skill evaluating-llms-harness
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

$2c

View available tasks:

lm_eval --tasks list

Common workflows

Workflow 1: Standard benchmark evaluation

Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval).

Copy this checklist:

Benchmark Evaluation:

- [ ] Step 1: Choose benchmark suite

- [ ] Step 2: Configure model

- [ ] Step 3: Run evaluation

- [ ] Step 4: Analyze results

Step 1: Choose benchmark suite

Core reasoning benchmarks:

  • MMLU (Massive Multitask Language Understanding) - 57 subjects, multiple choice
  • GSM8K - Grade school math word problems
  • HellaSwag - Common sense reasoning
  • TruthfulQA - Truthfulness and factuality
  • ARC (AI2 Reasoning Challenge) - Science questions

Code benchmarks:

  • HumanEval - Python code generation (164 problems)
  • MBPP (Mostly Basic Python Problems) - Python coding

Standard suite (recommended for model releases):

--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge

Step 2: Configure model

HuggingFace model:

lm_eval --model hf \

  --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \

  --tasks mmlu \

  --device cuda:0 \

  --batch_size auto  # Auto-detect optimal batch size

Quantized model (4-bit/8-bit):

lm_eval --model hf \

  --model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \

  --tasks mmlu \

  --device cuda:0

Custom checkpoint:

lm_eval --model hf \

  --model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \

  --tasks mmlu \

  --device cuda:0

Step 3: Run evaluation

# Full MMLU evaluation (57 subjects)

lm_eval --model hf \

  --model_args pretrained=meta-llama/Llama-2-7b-hf \

  --tasks mmlu \

  --num_fewshot 5 \  # 5-shot evaluation (standard)

  --batch_size 8 \

  --output_path results/ \

  --log_samples  # Save individual predictions

# Multiple benchmarks at once

lm_eval --model hf \

  --model_args pretrained=meta-llama/Llama-2-7b-hf \

  --tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge \

  --num_fewshot 5 \

  --batch_size 8 \

  --output_path results/llama2-7b-eval.json

Step 4: Analyze results

Results saved to results/llama2-7b-eval.json:

{

  "results": {

    "mmlu": {

      "acc": 0.459,

      "acc_stderr": 0.004

    },

    "gsm8k": {

      "exact_match": 0.142,

      "exact_match_stderr": 0.006

    },

    "hellaswag": {

      "acc_norm": 0.765,

      "acc_norm_stderr": 0.004

    }

  },

  "config": {

    "model": "hf",

    "model_args": "pretrained=meta-llama/Llama-2-7b-hf",

    "num_fewshot": 5

  }

}

Workflow 2: Track training progress

Evaluate checkpoints during training.

Training Progress Tracking:

- [ ] Step 1: Set up periodic evaluation

- [ ] Step 2: Choose quick benchmarks

- [ ] Step 3: Automate evaluation

- [ ] Step 4: Plot learning curves

Step 1: Set up periodic evaluation

Evaluate every N training steps:

#!/bin/bash

# eval_checkpoint.sh

CHECKPOINT_DIR=$1

STEP=$2

lm_eval --model hf \

  --model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP \

  --tasks gsm8k,hellaswag \

  --num_fewshot 0 \  # 0-shot for speed

  --batch_size 16 \

  --output_path results/step-$STEP.json

Step 2: Choose quick benchmarks

Fast benchmarks for frequent evaluation:

  • HellaSwag: ~10 minutes on 1 GPU
  • GSM8K: ~5 minutes
  • PIQA: ~2 minutes

Avoid for frequent eval (too slow):

  • MMLU: ~2 hours (57 subjects)
  • HumanEval: Requires code execution

Step 3: Automate evaluation

Integrate with training script:

# In training loop

if step % eval_interval == 0:

    model.save_pretrained(f"checkpoints/step-{step}")

    # Run evaluation

    os.system(f"./eval_checkpoint.sh checkpoints step-{step}")

Or use PyTorch Lightning callbacks:

from pytorch_lightning import Callback

class EvalHarnessCallback(Callback):

    def on_validation_epoch_end(self, trainer, pl_module):

        step = trainer.global_step

        checkpoint_path = f"checkpoints/step-{step}"

        # Save checkpoint

        trainer.save_checkpoint(checkpoint_path)

        # Run lm-eval

        os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")

Step 4: Plot learning curves

import json

import matplotlib.pyplot as plt

# Load all results

steps = []

mmlu_scores = []

for file in sorted(glob.glob("results/step-*.json")):

    with open(file) as f:

        data = json.load(f)

        step = int(file.split("-")[1].split(".")[0])

        steps.append(step)

        mmlu_scores.append(data["results"]["mmlu"]["acc"])

# Plot

plt.plot(steps, mmlu_scores)

plt.xlabel("Training Step")

plt.ylabel("MMLU Accuracy")

plt.title("Training Progress")

plt.savefig("training_curve.png")

Workflow 3: Compare multiple models

Benchmark suite for model comparison.

Model Comparison:

- [ ] Step 1: Define model list

- [ ] Step 2: Run evaluations

- [ ] Step 3: Generate comparison table

Step 1: Define model list

# models.txt

meta-llama/Llama-2-7b-hf

meta-llama/Llama-2-13b-hf

mistralai/Mistral-7B-v0.1

microsoft/phi-2

Step 2: Run evaluations

#!/bin/bash

# eval_all_models.sh

TASKS="mmlu,gsm8k,hellaswag,truthfulqa"

while read model; do

    echo "Evaluating $model"

    # Extract model name for output file

    model_name=$(echo $model | sed 's/\//-/g')

    lm_eval --model hf \

      --model_args pretrained=$model,dtype=bfloat16 \

      --tasks $TASKS \

      --num_fewshot 5 \

      --batch_size auto \

      --output_path results/$model_name.json

done < models.txt

Step 3: Generate comparison table

import json

import pandas as pd

models = [

    "meta-llama-Llama-2-7b-hf",

    "meta-llama-Llama-2-13b-hf",

    "mistralai-Mistral-7B-v0.1",

    "microsoft-phi-2"

]

tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]

results = []

for model in models:

    with open(f"results/{model}.json") as f:

        data = json.load(f)

        row = {"Model": model.replace("-", "/")}

        for task in tasks:

            # Get primary metric for each task

            metrics = data["results"][task]

            if "acc" in metrics:

                row[task.upper()] = f"{metrics['acc']:.3f}"

            elif "exact_match" in metrics:

                row[task.upper()] = f"{metrics['exact_match']:.3f}"

        results.append(row)

df = pd.DataFrame(results)

print(df.to_markdown(index=False))

Output:

| Model                  | MMLU  | GSM8K | HELLASWAG | TRUTHFULQA |

|------------------------|-------|-------|-----------|------------|

| meta-llama/Llama-2-7b  | 0.459 | 0.142 | 0.765     | 0.391      |

| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801     | 0.430      |

| mistralai/Mistral-7B   | 0.626 | 0.395 | 0.812     | 0.428      |

| microsoft/phi-2        | 0.560 | 0.613 | 0.682     | 0.447      |

Workflow 4: Evaluate with vLLM (faster inference)

Use vLLM backend for 5-10x faster evaluation.

vLLM Evaluation:

- [ ] Step 1: Install vLLM

- [ ] Step 2: Configure vLLM backend

- [ ] Step 3: Run evaluation

Step 1: Install vLLM

pip install vllm

Step 2: Configure vLLM backend

lm_eval --model vllm \

  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \

  --tasks mmlu \

  --batch_size auto

Step 3: Run evaluation

vLLM is 5-10× faster than standard HuggingFace:

# Standard HF: ~2 hours for MMLU on 7B model

lm_eval --model hf \

  --model_args pretrained=meta-llama/Llama-2-7b-hf \

  --tasks mmlu \

  --batch_size 8

# vLLM: ~15-20 minutes for MMLU on 7B model

lm_eval --model vllm \

  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \

  --tasks mmlu \

  --batch_size auto

When to use vs alternatives

Use lm-evaluation-harness when:

  • Benchmarking models for academic papers
  • Comparing model quality across standard tasks
  • Tracking training progress
  • Reporting standardized metrics (everyone uses same prompts)
  • Need reproducible evaluation

Use alternatives instead:

  • HELM (Stanford): Broader evaluation (fairness, efficiency, calibration)
  • AlpacaEval: Instruction-following evaluation with LLM judges
  • MT-Bench: Conversational multi-turn evaluation
  • Custom scripts: Domain-specific evaluation

Common issues

Issue: Evaluation too slow

Use vLLM backend:

lm_eval --model vllm \

  --model_args pretrained=model-name,tensor_parallel_size=2

Or reduce fewshot examples:

--num_fewshot 0  # Instead of 5

Or evaluate subset of MMLU:

--tasks mmlu_stem  # Only STEM subjects

Issue: Out of memory

Reduce batch size:

--batch_size 1  # Or --batch_size auto

Use quantization:

--model_args pretrained=model-name,load_in_8bit=True

Enable CPU offloading:

--model_args pretrained=model-name,device_map=auto,offload_folder=offload

Issue: Different results than reported

Check fewshot count:

--num_fewshot 5  # Most papers use 5-shot

Check exact task name:

--tasks mmlu  # Not mmlu_direct or mmlu_fewshot

Verify model and tokenizer match:

--model_args pretrained=model-name,tokenizer=same-model-name

Issue: HumanEval not executing code

Install execution dependencies:

pip install human-eval

Enable code execution:

lm_eval --model hf \

  --model_args pretrained=model-name \

  --tasks humaneval \

  --allow_code_execution  # Required for HumanEval

Advanced topics

Benchmark descriptions: See references/benchmark-guide.md for detailed description of all 60+ tasks, what they measure, and interpretation.

Custom tasks: See references/custom-tasks.md for creating domain-specific evaluation tasks.

API evaluation: See references/api-evaluation.md for evaluating OpenAI, Anthropic, and other API models.

Multi-GPU strategies: See references/distributed-eval.md for data parallel and tensor parallel evaluation.

Hardware requirements

  • GPU: NVIDIA (CUDA 11.8+), works on CPU (very slow)
  • VRAM:
  • 7B model: 16GB (bf16) or 8GB (8-bit)
  • 13B model: 28GB (bf16) or 14GB (8-bit)
  • 70B model: Requires multi-GPU or quantization
  • Time (7B model, single A100):
  • HellaSwag: 10 minutes
  • GSM8K: 5 minutes
  • MMLU (full): 2 hours
  • HumanEval: 20 minutes

Resources

  • Task library: 60+ tasks including MMLU, GSM8K, HumanEval, TruthfulQA, HellaSwag, ARC, WinoGrande, etc.
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card