llmfit-hardware-model-matcher

Terminal tool that detects your hardware and recommends which LLM models will actually run well on your system

INSTALLATION
npx skills add https://github.com/aradotso/trending-skills --skill llmfit-hardware-model-matcher
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

$27

Without sudo, installs to ~/.local/bin

curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local

### Windows (Scoop)

scoop install llmfit


### Docker / Podman

docker run ghcr.io/alexsjones/llmfit

With jq for scripting

podman run ghcr.io/alexsjones/llmfit recommend --use-case coding | jq '.models[].name'


### From source (Rust)

git clone https://github.com/AlexsJones/llmfit.git

cd llmfit

cargo build --release

binary at target/release/llmfit


## Core Concepts

- **Fit tiers**: `perfect` (runs great), `good` (runs well), `marginal` (runs but tight), `too_tight` (won't run)

- **Scoring dimensions**: quality, speed (tok/s estimate), fit (memory headroom), context capacity

- **Run modes**: GPU, CPU+GPU offload, CPU-only, MoE

- **Quantization**: automatically selects best quant (e.g. Q4_K_M, Q5_K_S, mlx-4bit) for your hardware

- **Providers**: Ollama, llama.cpp, MLX, Docker Model Runner

## Key Commands

### Launch Interactive TUI

llmfit


### CLI Table Output

llmfit --cli


### Show System Hardware Detection

llmfit system

llmfit --json system # JSON output


### List All Models

llmfit list


### Search Models

llmfit search "llama 8b"

llmfit search "mistral"

llmfit search "qwen coding"


### Fit Analysis

All runnable models ranked by fit

llmfit fit

Only perfect fits, top 5

llmfit fit --perfect -n 5

JSON output

llmfit --json fit -n 10


### Model Detail

llmfit info "Mistral-7B"

llmfit info "Llama-3.1-70B"


### Recommendations

Top 5 recommendations (JSON default)

llmfit recommend --json --limit 5

Filter by use case: general, coding, reasoning, chat, multimodal, embedding

llmfit recommend --json --use-case coding --limit 3

llmfit recommend --json --use-case reasoning --limit 5


### Hardware Planning (invert: what hardware do I need?)

llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192

llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --quant mlx-4bit

llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --target-tps 25 --json

llmfit plan "Qwen/Qwen2.5-Coder-0.5B-Instruct" --context 8192 --json


### REST API Server (for cluster scheduling)

llmfit serve

llmfit serve --host 0.0.0.0 --port 8787


## Hardware Overrides

When autodetection fails (VMs, broken nvidia-smi, passthrough setups):

Override GPU VRAM

llmfit --memory=32G

llmfit --memory=24G --cli

llmfit --memory=24G fit --perfect -n 5

llmfit --memory=24G recommend --json

Megabytes

llmfit --memory=32000M

Works with any subcommand

llmfit --memory=16G info "Llama-3.1-70B"


Accepted suffixes: `G`/`GB`/`GiB`, `M`/`MB`/`MiB`, `T`/`TB`/`TiB` (case-insensitive).

### Context Length Cap

Estimate memory fit at 4K context

llmfit --max-context 4096 --cli

With subcommands

llmfit --max-context 8192 fit --perfect -n 5

llmfit --max-context 16384 recommend --json --limit 5

Environment variable alternative

export OLLAMA_CONTEXT_LENGTH=8192

llmfit recommend --json


## REST API Reference

Start the server:

llmfit serve --host 0.0.0.0 --port 8787


### Endpoints

Health check

curl http://localhost:8787/health

Node hardware info

curl http://localhost:8787/api/v1/system

Full model list with filters

curl "http://localhost:8787/api/v1/models?min_fit=marginal&runtime=llamacpp&sort=score&limit=20"

Top runnable models for this node (key scheduling endpoint)

curl "http://localhost:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"

Search by model name/provider

curl "http://localhost:8787/api/v1/models/Mistral?runtime=any"


### Query Parameters for /models and /models/top

Param
Values
Description

`limit` / `n`
integer
Max rows returned

`min_fit`
`perfect|good|marginal|too_tight`
Minimum fit tier

`perfect`
`true|false`
Force perfect-only

`runtime`
`any|mlx|llamacpp`
Filter by runtime

`use_case`
`general|coding|reasoning|chat|multimodal|embedding`
Use case filter

`provider`
string
Substring match on provider

`search`
string
Free-text across name/provider/size/use-case

`sort`
`score|tps|params|mem|ctx|date|use_case`
Sort column

`include_too_tight`
`true|false`
Include non-runnable models

`max_context`
integer
Per-request context cap

## Scripting & Automation Examples

### Bash: Get top coding models as JSON

#!/bin/bash

Get top 3 coding models that fit perfectly

llmfit recommend --json --use-case coding --limit 3 | \

jq -r '.models[] | "\(.name) (\(.score)) - \(.quantization)"'


### Bash: Check if a specific model fits

#!/bin/bash

MODEL="Mistral-7B"

RESULT=$(llmfit info "$MODEL" --json 2>/dev/null)

FIT=$(echo "$RESULT" | jq -r '.fit')

if [[ "$FIT" == "perfect" || "$FIT" == "good" ]]; then

echo "$MODEL will run well (fit: $FIT)"

else

echo "$MODEL may not run well (fit: $FIT)"

fi


### Bash: Auto-pull top Ollama model

#!/bin/bash

Get the top fitting model name and pull it with Ollama

TOP_MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')

echo "Pulling: $TOP_MODEL"

ollama pull "$TOP_MODEL"


### Python: Query the REST API

import requests

BASE_URL = "http://localhost:8787"

def get_system_info():

resp = requests.get(f"{BASE_URL}/api/v1/system")

return resp.json()

def get_top_models(use_case="coding", limit=5, min_fit="good"):

params = {

"use_case": use_case,

"limit": limit,

"min_fit": min_fit,

"sort": "score"

}

resp = requests.get(f"{BASE_URL}/api/v1/models/top", params=params)

return resp.json()

def search_models(query, runtime="any"):

resp = requests.get(

f"{BASE_URL}/api/v1/models/{query}",

params={"runtime": runtime}

)

return resp.json()

Example usage

system = get_system_info()

print(f"GPU: {system.get('gpu_name')} | VRAM: {system.get('vram_gb')}GB")

models = get_top_models(use_case="reasoning", limit=3)

for m in models.get("models", []):

print(f"{m['name']}: score={m['score']}, fit={m['fit']}, quant={m['quantization']}")


### Python: Hardware-aware model selector for agents

import subprocess

import json

def get_best_model_for_task(use_case: str, min_fit: str = "good") -> dict:

"""Use llmfit to select the best model for a given task."""

result = subprocess.run(

["llmfit", "recommend", "--json", "--use-case", use_case, "--limit", "1"],

capture_output=True,

text=True

)

data = json.loads(result.stdout)

models = data.get("models", [])

return models[0] if models else None

def plan_hardware_requirements(model_name: str, context: int = 4096) -> dict:

"""Get hardware requirements for running a specific model."""

result = subprocess.run(

["llmfit", "plan", model_name, "--context", str(context), "--json"],

capture_output=True,

text=True

)

return json.loads(result.stdout)

Select best coding model

best = get_best_model_for_task("coding")

if best:

print(f"Best coding model: {best['name']}")

print(f" Quantization: {best['quantization']}")

print(f" Estimated tok/s: {best['tps']}")

print(f" Memory usage: {best['mem_pct']}%")

Plan hardware for a specific model

plan = plan_hardware_requirements("Qwen/Qwen3-4B-MLX-4bit", context=8192)

print(f"Min VRAM needed: {plan['hardware']['min_vram_gb']}GB")

print(f"Recommended VRAM: {plan['hardware']['recommended_vram_gb']}GB")


### Docker Compose: Node scheduler pattern

version: "3.8"

services:

llmfit-api:

image: ghcr.io/alexsjones/llmfit

command: serve --host 0.0.0.0 --port 8787

ports:

- "8787:8787"

environment:

- OLLAMA_CONTEXT_LENGTH=8192

devices:

- /dev/nvidia0:/dev/nvidia0 # pass GPU through


## TUI Key Reference

Key
Action

`↑`/`↓` or `j`/`k`
Navigate models

`/`
Search (name, provider, params, use case)

`Esc`/`Enter`
Exit search

`Ctrl-U`
Clear search

`f`
Cycle fit filter: All → Runnable → Perfect → Good → Marginal

`a`
Cycle availability: All → GGUF Avail → Installed

`s`
Cycle sort: Score → Params → Mem% → Ctx → Date → Use Case

`t`
Cycle color theme (auto-saved)

`v`
Visual mode (multi-select for comparison)

`V`
Select mode (column-based filtering)

`p`
Plan mode (what hardware needed for this model?)

`P`
Provider filter popup

`U`
Use-case filter popup

`C`
Capability filter popup

`m`
Mark model for comparison

`c`
Compare view (marked vs selected)

`d`
Download model (via detected runtime)

`r`
Refresh installed models from runtimes

`Enter`
Toggle detail view

`g`/`G`
Jump to top/bottom

`q`
Quit

### Themes

`t` cycles: Default → Dracula → Solarized → Nord → Monokai → Gruvbox

Theme saved to `~/.config/llmfit/theme`

## GPU Detection Details

GPU Vendor
Detection Method

NVIDIA
`nvidia-smi` (multi-GPU, aggregates VRAM)

AMD
`rocm-smi`

Intel Arc
sysfs (discrete) / `lspci` (integrated)

Apple Silicon
`system_profiler` (unified memory = VRAM)

Ascend
`npu-smi`

## Common Patterns

### "What can I run on my 16GB M2 Mac?"

llmfit fit --perfect -n 10

or interactively

llmfit

press 'f' to filter to Perfect fit


### "I have a 3090 (24GB VRAM), what coding models fit?"

llmfit recommend --json --use-case coding | jq '.models[]'

or with manual override if detection fails

llmfit --memory=24G recommend --json --use-case coding


### "Can Llama 70B run on my machine?"

llmfit info "Llama-3.1-70B"

Plan what hardware you'd need

llmfit plan "Llama-3.1-70B" --context 4096 --json


### "Show me only models already installed in Ollama"

llmfit

press 'a' to cycle to Installed filter

or

llmfit fit -n 20 # run, press 'i' in TUI for installed-first


### "Script: find best model and start Ollama"

MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')

ollama serve &

ollama run "$MODEL"


### "API: poll node capabilities for cluster scheduler"

Check node, get top 3 good+ models for reasoning

curl -s "http://node1:8787/api/v1/models/top?limit=3&min_fit=good&use_case=reasoning" | \

jq '.models[].name'


## Troubleshooting

**GPU not detected / wrong VRAM reported**

Verify detection

llmfit system

Manual override

llmfit --memory=24G --cli


**`nvidia-smi` not found but you have an NVIDIA GPU**

Install CUDA toolkit or nvidia-utils, then retry

Or override manually:

llmfit --memory=8G fit --perfect


**Models show as too_tight but you have enough RAM**

llmfit may be using context-inflated estimates; cap context

llmfit --max-context 2048 fit --perfect -n 10


**REST API: test endpoints**

Spawn server and run validation suite

python3 scripts/test_api.py --spawn

Test already-running server

python3 scripts/test_api.py --base-url http://127.0.0.1:8787


**Apple Silicon: VRAM shows as system RAM (expected)**

This is correct — Apple Silicon uses unified memory

llmfit accounts for this automatically

llmfit system # should show backend: Metal


**Context length environment variable**

export OLLAMA_CONTEXT_LENGTH=4096

llmfit recommend --json # uses 4096 as context cap

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card