SKILL.md

$27

Without sudo, installs to ~/.local/bin

curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local

### Windows (Scoop)

scoop install llmfit


### Docker / Podman

docker run ghcr.io/alexsjones/llmfit

With jq for scripting

podman run ghcr.io/alexsjones/llmfit recommend --use-case coding | jq '.models[].name'


### From source (Rust)

git clone https://github.com/AlexsJones/llmfit.git

cd llmfit

cargo build --release

binary at target/release/llmfit


## Core Concepts

- **Fit tiers**: `perfect` (runs great), `good` (runs well), `marginal` (runs but tight), `too_tight` (won't run)

- **Scoring dimensions**: quality, speed (tok/s estimate), fit (memory headroom), context capacity

- **Run modes**: GPU, CPU+GPU offload, CPU-only, MoE

- **Quantization**: automatically selects best quant (e.g. Q4_K_M, Q5_K_S, mlx-4bit) for your hardware

- **Providers**: Ollama, llama.cpp, MLX, Docker Model Runner

## Key Commands

### Launch Interactive TUI

llmfit


### CLI Table Output

llmfit --cli


### Show System Hardware Detection

llmfit system

llmfit --json system # JSON output


### List All Models

llmfit list


### Search Models

llmfit search "llama 8b"

llmfit search "mistral"

llmfit search "qwen coding"


### Fit Analysis

All runnable models ranked by fit

llmfit fit

Only perfect fits, top 5

llmfit fit --perfect -n 5

JSON output

llmfit --json fit -n 10


### Model Detail

llmfit info "Mistral-7B"

llmfit info "Llama-3.1-70B"


### Recommendations

Top 5 recommendations (JSON default)

llmfit recommend --json --limit 5

Filter by use case: general, coding, reasoning, chat, multimodal, embedding

llmfit recommend --json --use-case coding --limit 3

llmfit recommend --json --use-case reasoning --limit 5


### Hardware Planning (invert: what hardware do I need?)

llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192

llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --quant mlx-4bit

llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --target-tps 25 --json

llmfit plan "Qwen/Qwen2.5-Coder-0.5B-Instruct" --context 8192 --json


### REST API Server (for cluster scheduling)

llmfit serve

llmfit serve --host 0.0.0.0 --port 8787


## Hardware Overrides

When autodetection fails (VMs, broken nvidia-smi, passthrough setups):

Override GPU VRAM

llmfit --memory=32G

llmfit --memory=24G --cli

llmfit --memory=24G fit --perfect -n 5

llmfit --memory=24G recommend --json

Megabytes

llmfit --memory=32000M

Works with any subcommand

llmfit --memory=16G info "Llama-3.1-70B"


Accepted suffixes: `G`/`GB`/`GiB`, `M`/`MB`/`MiB`, `T`/`TB`/`TiB` (case-insensitive).

### Context Length Cap

Estimate memory fit at 4K context

llmfit --max-context 4096 --cli

With subcommands

llmfit --max-context 8192 fit --perfect -n 5

llmfit --max-context 16384 recommend --json --limit 5

Environment variable alternative

export OLLAMA_CONTEXT_LENGTH=8192

llmfit recommend --json


## REST API Reference

Start the server:

llmfit serve --host 0.0.0.0 --port 8787


### Endpoints

Health check

curl http://localhost:8787/health

Node hardware info

curl http://localhost:8787/api/v1/system

Full model list with filters

curl "http://localhost:8787/api/v1/models?min_fit=marginal&runtime=llamacpp&sort=score&limit=20"

Top runnable models for this node (key scheduling endpoint)

curl "http://localhost:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"

Search by model name/provider

curl "http://localhost:8787/api/v1/models/Mistral?runtime=any"


### Query Parameters for /models and /models/top

Param
Values
Description

`limit` / `n`
integer
Max rows returned

`min_fit`
`perfect|good|marginal|too_tight`
Minimum fit tier

`perfect`
`true|false`
Force perfect-only

`runtime`
`any|mlx|llamacpp`
Filter by runtime

`use_case`
`general|coding|reasoning|chat|multimodal|embedding`
Use case filter

`provider`
string
Substring match on provider

`search`
string
Free-text across name/provider/size/use-case

`sort`
`score|tps|params|mem|ctx|date|use_case`
Sort column

`include_too_tight`
`true|false`
Include non-runnable models

`max_context`
integer
Per-request context cap

## Scripting &#x26; Automation Examples

### Bash: Get top coding models as JSON

#!/bin/bash

Get top 3 coding models that fit perfectly

llmfit recommend --json --use-case coding --limit 3 | \

jq -r '.models[] | "\(.name) (\(.score)) - \(.quantization)"'


### Bash: Check if a specific model fits

#!/bin/bash

MODEL="Mistral-7B"

RESULT=$(llmfit info "$MODEL" --json 2>/dev/null)

FIT=$(echo "$RESULT" | jq -r '.fit')

if [[ "$FIT" == "perfect" || "$FIT" == "good" ]]; then

echo "$MODEL will run well (fit: $FIT)"

else

echo "$MODEL may not run well (fit: $FIT)"


### Bash: Auto-pull top Ollama model

#!/bin/bash

Get the top fitting model name and pull it with Ollama

TOP_MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')

echo "Pulling: $TOP_MODEL"

ollama pull "$TOP_MODEL"


### Python: Query the REST API

import requests

BASE_URL = "http://localhost:8787"

def get_system_info():

resp = requests.get(f"{BASE_URL}/api/v1/system")

return resp.json()

def get_top_models(use_case="coding", limit=5, min_fit="good"):

params = {

"use_case": use_case,

"limit": limit,

"min_fit": min_fit,

"sort": "score"

}

resp = requests.get(f"{BASE_URL}/api/v1/models/top", params=params)

return resp.json()

def search_models(query, runtime="any"):

resp = requests.get(

f"{BASE_URL}/api/v1/models/{query}",

params={"runtime": runtime}

)

return resp.json()

Example usage

system = get_system_info()

print(f"GPU: {system.get('gpu_name')} | VRAM: {system.get('vram_gb')}GB")

models = get_top_models(use_case="reasoning", limit=3)

for m in models.get("models", []):

print(f"{m['name']}: score={m['score']}, fit={m['fit']}, quant={m['quantization']}")


### Python: Hardware-aware model selector for agents

import subprocess

import json

def get_best_model_for_task(use_case: str, min_fit: str = "good") -> dict:

"""Use llmfit to select the best model for a given task."""

result = subprocess.run(

["llmfit", "recommend", "--json", "--use-case", use_case, "--limit", "1"],

capture_output=True,

text=True

)

data = json.loads(result.stdout)

models = data.get("models", [])

return models[0] if models else None

def plan_hardware_requirements(model_name: str, context: int = 4096) -> dict:

"""Get hardware requirements for running a specific model."""

result = subprocess.run(

["llmfit", "plan", model_name, "--context", str(context), "--json"],

capture_output=True,

text=True

)

return json.loads(result.stdout)

Select best coding model

best = get_best_model_for_task("coding")

if best:

print(f"Best coding model: {best['name']}")

print(f" Quantization: {best['quantization']}")

print(f" Estimated tok/s: {best['tps']}")

print(f" Memory usage: {best['mem_pct']}%")

Plan hardware for a specific model

plan = plan_hardware_requirements("Qwen/Qwen3-4B-MLX-4bit", context=8192)

print(f"Min VRAM needed: {plan['hardware']['min_vram_gb']}GB")

print(f"Recommended VRAM: {plan['hardware']['recommended_vram_gb']}GB")


### Docker Compose: Node scheduler pattern

version: "3.8"

services:

llmfit-api:

image: ghcr.io/alexsjones/llmfit

command: serve --host 0.0.0.0 --port 8787

ports:

- "8787:8787"

environment:

- OLLAMA_CONTEXT_LENGTH=8192

devices:

- /dev/nvidia0:/dev/nvidia0 # pass GPU through


## TUI Key Reference

Key
Action

`↑`/`↓` or `j`/`k`
Navigate models

`/`
Search (name, provider, params, use case)

`Esc`/`Enter`
Exit search

`Ctrl-U`
Clear search

`f`
Cycle fit filter: All → Runnable → Perfect → Good → Marginal

`a`
Cycle availability: All → GGUF Avail → Installed

`s`
Cycle sort: Score → Params → Mem% → Ctx → Date → Use Case

`t`
Cycle color theme (auto-saved)

`v`
Visual mode (multi-select for comparison)

`V`
Select mode (column-based filtering)

`p`
Plan mode (what hardware needed for this model?)

`P`
Provider filter popup

`U`
Use-case filter popup

`C`
Capability filter popup

`m`
Mark model for comparison

`c`
Compare view (marked vs selected)

`d`
Download model (via detected runtime)

`r`
Refresh installed models from runtimes

`Enter`
Toggle detail view

`g`/`G`
Jump to top/bottom

`q`
Quit

### Themes

`t` cycles: Default → Dracula → Solarized → Nord → Monokai → Gruvbox

Theme saved to `~/.config/llmfit/theme`

## GPU Detection Details

GPU Vendor
Detection Method

NVIDIA
`nvidia-smi` (multi-GPU, aggregates VRAM)

AMD
`rocm-smi`

Intel Arc
sysfs (discrete) / `lspci` (integrated)

Apple Silicon
`system_profiler` (unified memory = VRAM)

Ascend
`npu-smi`

## Common Patterns

### "What can I run on my 16GB M2 Mac?"

llmfit fit --perfect -n 10

or interactively

llmfit

press 'f' to filter to Perfect fit


### "I have a 3090 (24GB VRAM), what coding models fit?"

llmfit recommend --json --use-case coding | jq '.models[]'

or with manual override if detection fails

llmfit --memory=24G recommend --json --use-case coding


### "Can Llama 70B run on my machine?"

llmfit info "Llama-3.1-70B"

Plan what hardware you'd need

llmfit plan "Llama-3.1-70B" --context 4096 --json


### "Show me only models already installed in Ollama"

llmfit

press 'a' to cycle to Installed filter

or

llmfit fit -n 20 # run, press 'i' in TUI for installed-first


### "Script: find best model and start Ollama"

MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')

ollama serve &

ollama run "$MODEL"


### "API: poll node capabilities for cluster scheduler"

Check node, get top 3 good+ models for reasoning

curl -s "http://node1:8787/api/v1/models/top?limit=3&min_fit=good&use_case=reasoning" | \

jq '.models[].name'


## Troubleshooting

**GPU not detected / wrong VRAM reported**

Verify detection

llmfit system

Manual override

llmfit --memory=24G --cli


**`nvidia-smi` not found but you have an NVIDIA GPU**

Install CUDA toolkit or nvidia-utils, then retry

Or override manually:

llmfit --memory=8G fit --perfect


**Models show as too_tight but you have enough RAM**

llmfit may be using context-inflated estimates; cap context

llmfit --max-context 2048 fit --perfect -n 10


**REST API: test endpoints**

Spawn server and run validation suite

python3 scripts/test_api.py --spawn

Test already-running server

python3 scripts/test_api.py --base-url http://127.0.0.1:8787


**Apple Silicon: VRAM shows as system RAM (expected)**

This is correct — Apple Silicon uses unified memory

llmfit accounts for this automatically

llmfit system # should show backend: Metal


**Context length environment variable**

export OLLAMA_CONTEXT_LENGTH=4096

llmfit recommend --json # uses 4096 as context cap

llmfit-hardware-model-matcher

SKILL.md

Without sudo, installs to ~/.local/bin

With jq for scripting

binary at target/release/llmfit

All runnable models ranked by fit

Only perfect fits, top 5

JSON output

Top 5 recommendations (JSON default)

Filter by use case: general, coding, reasoning, chat, multimodal, embedding

Override GPU VRAM

Megabytes

Works with any subcommand

Estimate memory fit at 4K context

With subcommands

Environment variable alternative

Health check

Node hardware info

Full model list with filters

Top runnable models for this node (key scheduling endpoint)

Search by model name/provider

Get top 3 coding models that fit perfectly

Get the top fitting model name and pull it with Ollama

Example usage

Select best coding model

Plan hardware for a specific model

or interactively

press 'f' to filter to Perfect fit

or with manual override if detection fails

Plan what hardware you'd need

press 'a' to cycle to Installed filter

or

Check node, get top 3 good+ models for reasoning

Verify detection

Manual override

Install CUDA toolkit or nvidia-utils, then retry

Or override manually:

llmfit may be using context-inflated estimates; cap context

Spawn server and run validation suite

Test already-running server

This is correct — Apple Silicon uses unified memory

llmfit accounts for this automatically

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers