SKILL.md

$27

Build everything

make

Verify build artifacts

ls infer chat main

The Makefile compiles `infer.m`, `chat.m`, `main.m` with Metal shader compilation for `shaders.metal`.

## Weight Preparation

### Step 1: Extract non-expert weights

From the metal_infer/ directory

Point to your downloaded Qwen3.5-397B safetensors directory

python3 extract_weights.py /path/to/Qwen3.5-397B-A17B-Instruct/

Produces:

model_weights.bin (~5.5GB, mmap'd at runtime)

model_weights.json (tensor manifest)

vocab.bin (vocabulary)

tokenizer.bin (BPE tokenizer data)


### Step 2: Pack expert weights (4-bit, production)

From repo root

python3 repack_experts.py /path/to/Qwen3.5-397B-A17B-Instruct/ metal_infer/packed_experts/

Produces packed_experts/ directory (~209GB)

Each expert is a separate file: layer_XX_expert_YYYY.bin


### Step 3: Optional 2-bit requantization (faster but breaks JSON/tool calling)

Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total)

python3 metal_infer/repack_experts_2bit.py \

metal_infer/packed_experts/ \

metal_infer/packed_experts_2bit/


## Key Commands

### Basic inference

cd metal_infer

4-bit inference (production quality, tool calling works)

./infer --prompt "Explain quantum computing" --tokens 100

2-bit inference (faster, breaks JSON/tool calling)

./infer --prompt "Explain quantum computing" --tokens 100 --2bit

Per-layer timing breakdown

./infer --prompt "Hello" --tokens 20 --timing


### Interactive chat with tool calling

./chat

Opens TUI with full tool calling support

Uses 4-bit experts by default


### MoE-only benchmark (measures expert throughput)

./main

Runs pure expert forward-pass benchmark

Reports tokens/sec without attention overhead


## Project Structure

flash-moe/

├── paper/

│ └── flash_moe.pdf # Full technical paper

├── metal_infer/

│ ├── infer.m # Complete inference engine (~7000 lines)

│ ├── shaders.metal # Metal compute kernels (~1200 lines)

│ ├── chat.m # Interactive chat TUI

│ ├── tokenizer.h # Single-header C BPE tokenizer (449 lines)

│ ├── main.m # MoE-only benchmark

│ ├── Makefile

│ ├── extract_weights.py # Safetensors → model_weights.bin

│ ├── repack_experts_2bit.py # 4-bit → 2-bit requantization

│ ├── train_predictor.py # Expert routing prediction analysis

│ ├── model_weights.bin # Non-expert weights (mmap'd)

│ ├── model_weights.json # Tensor manifest

│ ├── vocab.bin

│ ├── tokenizer.bin

│ ├── packed_experts/ # 4-bit expert files (209GB)

│ └── packed_experts_2bit/ # 2-bit expert files (120GB, optional)

├── repack_experts.py # 4-bit expert packing from safetensors

├── progress.py # Results visualization

└── results.tsv # Experiment log


## Architecture Overview

The model has **60 transformer layers**:

- 45 GatedDeltaNet (linear attention) layers

- 15 standard full attention layers

- Each layer: 512 experts, K=4 activated per token + 1 shared expert

- Hidden dimension: 4096

### Per-layer pipeline (4.28ms average at 4-bit)

CMD3(prev) → CMD1: attention projections + delta-net [1.22ms GPU]

→ CPU: flush results [0.01ms CPU]

→ CMD2: o_proj + norm + routing + shared [0.55ms GPU]

→ CPU: softmax + topK routing [0.003ms]

→ I/O: parallel pread K=4 experts [2.41ms SSD]

→ CMD3: expert forward + combine + norm [0.04ms encode, DEFERRED]


## Metal Shader Kernels

The `shaders.metal` file contains hand-written kernels. Key kernels:

// 4-bit dequantized matrix-vector multiply (FMA-optimized)

// Key insight: fma(nibble, scalex, biasx) instead of (nibblescale + bias)x

// Pre-compute scalex and biasx to fuse dequant+multiply in one FMA instruction

kernel void matvec_4bit_fma(

device const uint8_t* weights [[buffer(0)]],

device const float* scales [[buffer(1)]],

device const float* biases [[buffer(2)]],

device const float* x [[buffer(3)]],

device float* out [[buffer(4)]],

uint tid [[thread_position_in_threadgroup]],

uint gid [[threadgroup_position_in_grid]])

{

// ... tiled SIMD-reduced FMA kernel

// 12% faster than naive (nibble scale + bias) x

}

// Fused SwiGLU activation

kernel void swiglu(device float* gate [[buffer(0)]],

device const float* up [[buffer(1)]],

uint gid [[thread_position_in_grid]])

{

float g = gate[gid];

gate[gid] = (g / (1.0f + exp(-g))) * up[gid];

}

// RMS normalization (two-pass)

kernel void rms_norm_pass1(...) // sum of squares reduction

kernel void rms_norm_pass2(...) // apply normalization

// GPU RoPE (fused with Q deinterleave and K normalization)

kernel void rope_qk(...)

// MoE combine + residual + sigmoid gate (fused)

kernel void moe_combine_residual(...)


## SSD Expert Streaming Pattern

The core innovation — loading only K=4 active experts per layer from SSD:

// Parallel expert loading using GCD dispatch groups

// From infer.m (conceptual pattern)

dispatch_group_t group = dispatch_group_create();

dispatch_queue_t ioQueue = dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0);

for (int k = 0; k < K_EXPERTS; k++) {

int expert_id = top_k_indices[k];

dispatch_group_async(group, ioQueue, ^{

// Each expert: ~6.75MB at 4-bit

char path[256];

snprintf(path, sizeof(path),

"packed_experts/layer_%02d_expert_%04d.bin",

layer, expert_id);

int fd = open(path, O_RDONLY);

// pread() — non-blocking, OS page cache handles LRU

pread(fd, expert_buffer[k], expert_size, 0);

close(fd);

});

}

dispatch_group_wait(group, DISPATCH_TIME_FOREVER);

// GPU compute follows — serial pipeline is hardware-optimal on Apple Silicon


**Why `pread()` not `mmap()`**: mmap incurs per-page fault overhead on cold data (~5x slower). Direct `pread()` with OS page cache achieves ~71% hit rate naturally.

## GatedDeltaNet Linear Attention (BLAS)

The recurrence update uses Accelerate BLAS — 64% faster than scalar:

// GatedDeltaNet state update per head (conceptual pattern)

// state: 128×128 float matrix, 64 heads

// From infer.m

#import <Accelerate/Accelerate.h>

for (int h = 0; h < 64; h++) {

float S = state + h 128 * 128; // 128×128 state matrix

float q = Q + h 128;

float k = K + h 128;

float v = V + h 128;

// β·(k⊗v) outer product update

// cblas_sger: S += beta * (k ⊗ v)

cblas_sger(CblasRowMajor, 128, 128,

beta[h], k, 1, v, 1, S, 128);

// Decay: S = alpha * S

cblas_sscal(128 * 128, alpha[h], S, 1);

// Output: o = S @ q

cblas_sgemv(CblasRowMajor, CblasNoTrans,

128, 128, 1.0f, S, 128, q, 1, 0.0f,

output + h * 128, 1);

}


## Performance Configuration

### 4-bit (production default)

- **Quality**: Excellent — full tool calling, correct JSON

- **Speed**: 4.36 tok/s

- **Disk**: 209GB

### 2-bit (speed testing only)

- **Quality**: Good — but breaks JSON/tool calling (`\name\` instead of `"name"`)

- **Speed**: 5.74 tok/s (7.05 peak single token with warm cache)

- **Disk**: 120GB

- Uses `F_NOCACHE` flag to avoid page cache thrashing

## What NOT to Try (Learned from 58 Experiments)

Approach
Why it fails

`mmap()` expert files
Per-page fault overhead: 5x slower than `pread()`

`dispatch_io`
`dispatch_data` management overhead: -70%

`F_RDADVISE` prefetch
SSD DMA + GPU share memory controller — concurrent access: -73% GPU speed

Custom Metal LRU cache
GPU memory pressure: -38% vs OS page cache

LZ4 expert compression
Decompress overhead > warm cache savings: -13%

Temporal expert prediction
25% hit rate, wastes SSD bandwidth: -18%

Speculative early routing
Cache pollution: -38%

MTP speculative decoding
MoE I/O scales per-token (unlike dense models): break-even

Spin-poll GPU wait
CPU thermal throttle competes with GPU: -23%

Parallel SSD + GPU overlap
Unified memory controller arbitration: net negative

**Key principle**: On Apple Silicon, GPU DMA and SSD DMA share the same memory controller. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.

## Troubleshooting

### Build fails

Ensure Xcode CLI tools are installed

xcode-select --install

Check Metal compiler is available

xcrun -sdk macosx metal --version


### Out of memory

The engine is designed to use ~6GB active:

- 5.5GB: `model_weights.bin` (mmap'd, read-only)

- ~200MB: Metal scratch buffers

- Remaining ~42GB: OS page cache for expert data

If you see OOM, check for other processes consuming unified memory:

sudo memory_pressure

vm_stat


### Slow performance

Check SSD speed — needs ~17GB/s for target performance

Run with timing to identify bottleneck

./infer --prompt "Hello" --tokens 5 --timing

Verify packed_experts/ is on internal SSD, not external drive

diskutil info /


### Wrong expert directory

flash-moe-inference

SKILL.md

Build everything

Verify build artifacts

From the metal_infer/ directory

Point to your downloaded Qwen3.5-397B safetensors directory

Produces:

model_weights.bin (~5.5GB, mmap'd at runtime)

model_weights.json (tensor manifest)

vocab.bin (vocabulary)

tokenizer.bin (BPE tokenizer data)

From repo root

Produces packed_experts/ directory (~209GB)

Each expert is a separate file: layer_XX_expert_YYYY.bin

Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total)

4-bit inference (production quality, tool calling works)

2-bit inference (faster, breaks JSON/tool calling)

Per-layer timing breakdown

Opens TUI with full tool calling support

Uses 4-bit experts by default

Runs pure expert forward-pass benchmark

Reports tokens/sec without attention overhead

Ensure Xcode CLI tools are installed

Check Metal compiler is available

Check SSD speed — needs ~17GB/s for target performance

Run with timing to identify bottleneck

Verify packed_experts/ is on internal SSD, not external drive

Default paths expected by infer.m:

metal_infer/packed_experts/ (4-bit)

metal_infer/packed_experts_2bit/ (2-bit)

Ensure you're running from metal_infer/ directory

Stop writing automation&scrapers

flash-moe-inference

SKILL.md

Build everything

Verify build artifacts

From the metal_infer/ directory

Point to your downloaded Qwen3.5-397B safetensors directory

Produces:

model_weights.bin (~5.5GB, mmap'd at runtime)

model_weights.json (tensor manifest)

vocab.bin (vocabulary)

tokenizer.bin (BPE tokenizer data)

From repo root

Produces packed_experts/ directory (~209GB)

Each expert is a separate file: layer_XX_expert_YYYY.bin

Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total)

4-bit inference (production quality, tool calling works)

2-bit inference (faster, breaks JSON/tool calling)

Per-layer timing breakdown

Opens TUI with full tool calling support

Uses 4-bit experts by default

Runs pure expert forward-pass benchmark

Reports tokens/sec without attention overhead

Ensure Xcode CLI tools are installed

Check Metal compiler is available

Check SSD speed — needs ~17GB/s for target performance

Run with timing to identify bottleneck

Verify packed_experts/ is on internal SSD, not external drive

Default paths expected by infer.m:

metal_infer/packed_experts/ (4-bit)

metal_infer/packed_experts_2bit/ (2-bit)

Ensure you're running from metal_infer/ directory

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers