SKILL.md
$27
Without sudo, installs to ~/.local/bin
curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local
### Windows (Scoop)
scoop install llmfit
### Docker / Podman
docker run ghcr.io/alexsjones/llmfit
With jq for scripting
podman run ghcr.io/alexsjones/llmfit recommend --use-case coding | jq '.models[].name'
### From source (Rust)
git clone https://github.com/AlexsJones/llmfit.git
cd llmfit
cargo build --release
binary at target/release/llmfit
## Core Concepts
- **Fit tiers**: `perfect` (runs great), `good` (runs well), `marginal` (runs but tight), `too_tight` (won't run)
- **Scoring dimensions**: quality, speed (tok/s estimate), fit (memory headroom), context capacity
- **Run modes**: GPU, CPU+GPU offload, CPU-only, MoE
- **Quantization**: automatically selects best quant (e.g. Q4_K_M, Q5_K_S, mlx-4bit) for your hardware
- **Providers**: Ollama, llama.cpp, MLX, Docker Model Runner
## Key Commands
### Launch Interactive TUI
llmfit
### CLI Table Output
llmfit --cli
### Show System Hardware Detection
llmfit system
llmfit --json system # JSON output
### List All Models
llmfit list
### Search Models
llmfit search "llama 8b"
llmfit search "mistral"
llmfit search "qwen coding"
### Fit Analysis
All runnable models ranked by fit
llmfit fit
Only perfect fits, top 5
llmfit fit --perfect -n 5
JSON output
llmfit --json fit -n 10
### Model Detail
llmfit info "Mistral-7B"
llmfit info "Llama-3.1-70B"
### Recommendations
Top 5 recommendations (JSON default)
llmfit recommend --json --limit 5
Filter by use case: general, coding, reasoning, chat, multimodal, embedding
llmfit recommend --json --use-case coding --limit 3
llmfit recommend --json --use-case reasoning --limit 5
### Hardware Planning (invert: what hardware do I need?)
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --quant mlx-4bit
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --target-tps 25 --json
llmfit plan "Qwen/Qwen2.5-Coder-0.5B-Instruct" --context 8192 --json
### REST API Server (for cluster scheduling)
llmfit serve
llmfit serve --host 0.0.0.0 --port 8787
## Hardware Overrides
When autodetection fails (VMs, broken nvidia-smi, passthrough setups):
Override GPU VRAM
llmfit --memory=32G
llmfit --memory=24G --cli
llmfit --memory=24G fit --perfect -n 5
llmfit --memory=24G recommend --json
Megabytes
llmfit --memory=32000M
Works with any subcommand
llmfit --memory=16G info "Llama-3.1-70B"
Accepted suffixes: `G`/`GB`/`GiB`, `M`/`MB`/`MiB`, `T`/`TB`/`TiB` (case-insensitive).
### Context Length Cap
Estimate memory fit at 4K context
llmfit --max-context 4096 --cli
With subcommands
llmfit --max-context 8192 fit --perfect -n 5
llmfit --max-context 16384 recommend --json --limit 5
Environment variable alternative
export OLLAMA_CONTEXT_LENGTH=8192
llmfit recommend --json
## REST API Reference
Start the server:
llmfit serve --host 0.0.0.0 --port 8787
### Endpoints
Health check
curl http://localhost:8787/health
Node hardware info
curl http://localhost:8787/api/v1/system
Full model list with filters
curl "http://localhost:8787/api/v1/models?min_fit=marginal&runtime=llamacpp&sort=score&limit=20"
Top runnable models for this node (key scheduling endpoint)
curl "http://localhost:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"
Search by model name/provider
curl "http://localhost:8787/api/v1/models/Mistral?runtime=any"
### Query Parameters for /models and /models/top
Param
Values
Description
`limit` / `n`
integer
Max rows returned
`min_fit`
`perfect|good|marginal|too_tight`
Minimum fit tier
`perfect`
`true|false`
Force perfect-only
`runtime`
`any|mlx|llamacpp`
Filter by runtime
`use_case`
`general|coding|reasoning|chat|multimodal|embedding`
Use case filter
`provider`
string
Substring match on provider
`search`
string
Free-text across name/provider/size/use-case
`sort`
`score|tps|params|mem|ctx|date|use_case`
Sort column
`include_too_tight`
`true|false`
Include non-runnable models
`max_context`
integer
Per-request context cap
## Scripting & Automation Examples
### Bash: Get top coding models as JSON
#!/bin/bash
Get top 3 coding models that fit perfectly
llmfit recommend --json --use-case coding --limit 3 | \
jq -r '.models[] | "\(.name) (\(.score)) - \(.quantization)"'
### Bash: Check if a specific model fits
#!/bin/bash
MODEL="Mistral-7B"
RESULT=$(llmfit info "$MODEL" --json 2>/dev/null)
FIT=$(echo "$RESULT" | jq -r '.fit')
if [[ "$FIT" == "perfect" || "$FIT" == "good" ]]; then
echo "$MODEL will run well (fit: $FIT)"
else
echo "$MODEL may not run well (fit: $FIT)"
fi
### Bash: Auto-pull top Ollama model
#!/bin/bash
Get the top fitting model name and pull it with Ollama
TOP_MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')
echo "Pulling: $TOP_MODEL"
ollama pull "$TOP_MODEL"
### Python: Query the REST API
import requests
BASE_URL = "http://localhost:8787"
def get_system_info():
resp = requests.get(f"{BASE_URL}/api/v1/system")
return resp.json()
def get_top_models(use_case="coding", limit=5, min_fit="good"):
params = {
"use_case": use_case,
"limit": limit,
"min_fit": min_fit,
"sort": "score"
}
resp = requests.get(f"{BASE_URL}/api/v1/models/top", params=params)
return resp.json()
def search_models(query, runtime="any"):
resp = requests.get(
f"{BASE_URL}/api/v1/models/{query}",
params={"runtime": runtime}
)
return resp.json()
Example usage
system = get_system_info()
print(f"GPU: {system.get('gpu_name')} | VRAM: {system.get('vram_gb')}GB")
models = get_top_models(use_case="reasoning", limit=3)
for m in models.get("models", []):
print(f"{m['name']}: score={m['score']}, fit={m['fit']}, quant={m['quantization']}")
### Python: Hardware-aware model selector for agents
import subprocess
import json
def get_best_model_for_task(use_case: str, min_fit: str = "good") -> dict:
"""Use llmfit to select the best model for a given task."""
result = subprocess.run(
["llmfit", "recommend", "--json", "--use-case", use_case, "--limit", "1"],
capture_output=True,
text=True
)
data = json.loads(result.stdout)
models = data.get("models", [])
return models[0] if models else None
def plan_hardware_requirements(model_name: str, context: int = 4096) -> dict:
"""Get hardware requirements for running a specific model."""
result = subprocess.run(
["llmfit", "plan", model_name, "--context", str(context), "--json"],
capture_output=True,
text=True
)
return json.loads(result.stdout)
Select best coding model
best = get_best_model_for_task("coding")
if best:
print(f"Best coding model: {best['name']}")
print(f" Quantization: {best['quantization']}")
print(f" Estimated tok/s: {best['tps']}")
print(f" Memory usage: {best['mem_pct']}%")
Plan hardware for a specific model
plan = plan_hardware_requirements("Qwen/Qwen3-4B-MLX-4bit", context=8192)
print(f"Min VRAM needed: {plan['hardware']['min_vram_gb']}GB")
print(f"Recommended VRAM: {plan['hardware']['recommended_vram_gb']}GB")
### Docker Compose: Node scheduler pattern
version: "3.8"
services:
llmfit-api:
image: ghcr.io/alexsjones/llmfit
command: serve --host 0.0.0.0 --port 8787
ports:
- "8787:8787"
environment:
- OLLAMA_CONTEXT_LENGTH=8192
devices:
- /dev/nvidia0:/dev/nvidia0 # pass GPU through
## TUI Key Reference
Key
Action
`↑`/`↓` or `j`/`k`
Navigate models
`/`
Search (name, provider, params, use case)
`Esc`/`Enter`
Exit search
`Ctrl-U`
Clear search
`f`
Cycle fit filter: All → Runnable → Perfect → Good → Marginal
`a`
Cycle availability: All → GGUF Avail → Installed
`s`
Cycle sort: Score → Params → Mem% → Ctx → Date → Use Case
`t`
Cycle color theme (auto-saved)
`v`
Visual mode (multi-select for comparison)
`V`
Select mode (column-based filtering)
`p`
Plan mode (what hardware needed for this model?)
`P`
Provider filter popup
`U`
Use-case filter popup
`C`
Capability filter popup
`m`
Mark model for comparison
`c`
Compare view (marked vs selected)
`d`
Download model (via detected runtime)
`r`
Refresh installed models from runtimes
`Enter`
Toggle detail view
`g`/`G`
Jump to top/bottom
`q`
Quit
### Themes
`t` cycles: Default → Dracula → Solarized → Nord → Monokai → Gruvbox
Theme saved to `~/.config/llmfit/theme`
## GPU Detection Details
GPU Vendor
Detection Method
NVIDIA
`nvidia-smi` (multi-GPU, aggregates VRAM)
AMD
`rocm-smi`
Intel Arc
sysfs (discrete) / `lspci` (integrated)
Apple Silicon
`system_profiler` (unified memory = VRAM)
Ascend
`npu-smi`
## Common Patterns
### "What can I run on my 16GB M2 Mac?"
llmfit fit --perfect -n 10
or interactively
llmfit
press 'f' to filter to Perfect fit
### "I have a 3090 (24GB VRAM), what coding models fit?"
llmfit recommend --json --use-case coding | jq '.models[]'
or with manual override if detection fails
llmfit --memory=24G recommend --json --use-case coding
### "Can Llama 70B run on my machine?"
llmfit info "Llama-3.1-70B"
Plan what hardware you'd need
llmfit plan "Llama-3.1-70B" --context 4096 --json
### "Show me only models already installed in Ollama"
llmfit
press 'a' to cycle to Installed filter
or
llmfit fit -n 20 # run, press 'i' in TUI for installed-first
### "Script: find best model and start Ollama"
MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')
ollama serve &
ollama run "$MODEL"
### "API: poll node capabilities for cluster scheduler"
Check node, get top 3 good+ models for reasoning
curl -s "http://node1:8787/api/v1/models/top?limit=3&min_fit=good&use_case=reasoning" | \
jq '.models[].name'
## Troubleshooting
**GPU not detected / wrong VRAM reported**
Verify detection
llmfit system
Manual override
llmfit --memory=24G --cli
**`nvidia-smi` not found but you have an NVIDIA GPU**
Install CUDA toolkit or nvidia-utils, then retry
Or override manually:
llmfit --memory=8G fit --perfect
**Models show as too_tight but you have enough RAM**
llmfit may be using context-inflated estimates; cap context
llmfit --max-context 2048 fit --perfect -n 10
**REST API: test endpoints**
Spawn server and run validation suite
python3 scripts/test_api.py --spawn
Test already-running server
python3 scripts/test_api.py --base-url http://127.0.0.1:8787
**Apple Silicon: VRAM shows as system RAM (expected)**
This is correct — Apple Silicon uses unified memory
llmfit accounts for this automatically
llmfit system # should show backend: Metal
**Context length environment variable**
export OLLAMA_CONTEXT_LENGTH=4096
llmfit recommend --json # uses 4096 as context cap