SKILL.md
$27
Build everything
make
Verify build artifacts
ls infer chat main
The Makefile compiles `infer.m`, `chat.m`, `main.m` with Metal shader compilation for `shaders.metal`.
## Weight Preparation
### Step 1: Extract non-expert weights
From the metal_infer/ directory
Point to your downloaded Qwen3.5-397B safetensors directory
python3 extract_weights.py /path/to/Qwen3.5-397B-A17B-Instruct/
Produces:
model_weights.bin (~5.5GB, mmap'd at runtime)
model_weights.json (tensor manifest)
vocab.bin (vocabulary)
tokenizer.bin (BPE tokenizer data)
### Step 2: Pack expert weights (4-bit, production)
From repo root
python3 repack_experts.py /path/to/Qwen3.5-397B-A17B-Instruct/ metal_infer/packed_experts/
Produces packed_experts/ directory (~209GB)
Each expert is a separate file: layer_XX_expert_YYYY.bin
### Step 3: Optional 2-bit requantization (faster but breaks JSON/tool calling)
Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total)
python3 metal_infer/repack_experts_2bit.py \
metal_infer/packed_experts/ \
metal_infer/packed_experts_2bit/
## Key Commands
### Basic inference
cd metal_infer
4-bit inference (production quality, tool calling works)
./infer --prompt "Explain quantum computing" --tokens 100
2-bit inference (faster, breaks JSON/tool calling)
./infer --prompt "Explain quantum computing" --tokens 100 --2bit
Per-layer timing breakdown
./infer --prompt "Hello" --tokens 20 --timing
### Interactive chat with tool calling
./chat
Opens TUI with full tool calling support
Uses 4-bit experts by default
### MoE-only benchmark (measures expert throughput)
./main
Runs pure expert forward-pass benchmark
Reports tokens/sec without attention overhead
## Project Structure
flash-moe/
├── paper/
│ └── flash_moe.pdf # Full technical paper
├── metal_infer/
│ ├── infer.m # Complete inference engine (~7000 lines)
│ ├── shaders.metal # Metal compute kernels (~1200 lines)
│ ├── chat.m # Interactive chat TUI
│ ├── tokenizer.h # Single-header C BPE tokenizer (449 lines)
│ ├── main.m # MoE-only benchmark
│ ├── Makefile
│ ├── extract_weights.py # Safetensors → model_weights.bin
│ ├── repack_experts_2bit.py # 4-bit → 2-bit requantization
│ ├── train_predictor.py # Expert routing prediction analysis
│ ├── model_weights.bin # Non-expert weights (mmap'd)
│ ├── model_weights.json # Tensor manifest
│ ├── vocab.bin
│ ├── tokenizer.bin
│ ├── packed_experts/ # 4-bit expert files (209GB)
│ └── packed_experts_2bit/ # 2-bit expert files (120GB, optional)
├── repack_experts.py # 4-bit expert packing from safetensors
├── progress.py # Results visualization
└── results.tsv # Experiment log
## Architecture Overview
The model has **60 transformer layers**:
- 45 GatedDeltaNet (linear attention) layers
- 15 standard full attention layers
- Each layer: 512 experts, K=4 activated per token + 1 shared expert
- Hidden dimension: 4096
### Per-layer pipeline (4.28ms average at 4-bit)
CMD3(prev) → CMD1: attention projections + delta-net [1.22ms GPU]
→ CPU: flush results [0.01ms CPU]
→ CMD2: o_proj + norm + routing + shared [0.55ms GPU]
→ CPU: softmax + topK routing [0.003ms]
→ I/O: parallel pread K=4 experts [2.41ms SSD]
→ CMD3: expert forward + combine + norm [0.04ms encode, DEFERRED]
## Metal Shader Kernels
The `shaders.metal` file contains hand-written kernels. Key kernels:
// 4-bit dequantized matrix-vector multiply (FMA-optimized)
// Key insight: fma(nibble, scalex, biasx) instead of (nibblescale + bias)x
// Pre-compute scalex and biasx to fuse dequant+multiply in one FMA instruction
kernel void matvec_4bit_fma(
device const uint8_t* weights [[buffer(0)]],
device const float* scales [[buffer(1)]],
device const float* biases [[buffer(2)]],
device const float* x [[buffer(3)]],
device float* out [[buffer(4)]],
uint tid [[thread_position_in_threadgroup]],
uint gid [[threadgroup_position_in_grid]])
{
// ... tiled SIMD-reduced FMA kernel
// 12% faster than naive (nibble scale + bias) x
}
// Fused SwiGLU activation
kernel void swiglu(device float* gate [[buffer(0)]],
device const float* up [[buffer(1)]],
uint gid [[thread_position_in_grid]])
{
float g = gate[gid];
gate[gid] = (g / (1.0f + exp(-g))) * up[gid];
}
// RMS normalization (two-pass)
kernel void rms_norm_pass1(...) // sum of squares reduction
kernel void rms_norm_pass2(...) // apply normalization
// GPU RoPE (fused with Q deinterleave and K normalization)
kernel void rope_qk(...)
// MoE combine + residual + sigmoid gate (fused)
kernel void moe_combine_residual(...)
## SSD Expert Streaming Pattern
The core innovation — loading only K=4 active experts per layer from SSD:
// Parallel expert loading using GCD dispatch groups
// From infer.m (conceptual pattern)
dispatch_group_t group = dispatch_group_create();
dispatch_queue_t ioQueue = dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0);
for (int k = 0; k < K_EXPERTS; k++) {
int expert_id = top_k_indices[k];
dispatch_group_async(group, ioQueue, ^{
// Each expert: ~6.75MB at 4-bit
char path[256];
snprintf(path, sizeof(path),
"packed_experts/layer_%02d_expert_%04d.bin",
layer, expert_id);
int fd = open(path, O_RDONLY);
// pread() — non-blocking, OS page cache handles LRU
pread(fd, expert_buffer[k], expert_size, 0);
close(fd);
});
}
dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
// GPU compute follows — serial pipeline is hardware-optimal on Apple Silicon
**Why `pread()` not `mmap()`**: mmap incurs per-page fault overhead on cold data (~5x slower). Direct `pread()` with OS page cache achieves ~71% hit rate naturally.
## GatedDeltaNet Linear Attention (BLAS)
The recurrence update uses Accelerate BLAS — 64% faster than scalar:
// GatedDeltaNet state update per head (conceptual pattern)
// state: 128×128 float matrix, 64 heads
// From infer.m
#import <Accelerate/Accelerate.h>
for (int h = 0; h < 64; h++) {
float S = state + h 128 * 128; // 128×128 state matrix
float q = Q + h 128;
float k = K + h 128;
float v = V + h 128;
// β·(k⊗v) outer product update
// cblas_sger: S += beta * (k ⊗ v)
cblas_sger(CblasRowMajor, 128, 128,
beta[h], k, 1, v, 1, S, 128);
// Decay: S = alpha * S
cblas_sscal(128 * 128, alpha[h], S, 1);
// Output: o = S @ q
cblas_sgemv(CblasRowMajor, CblasNoTrans,
128, 128, 1.0f, S, 128, q, 1, 0.0f,
output + h * 128, 1);
}
## Performance Configuration
### 4-bit (production default)
- **Quality**: Excellent — full tool calling, correct JSON
- **Speed**: 4.36 tok/s
- **Disk**: 209GB
### 2-bit (speed testing only)
- **Quality**: Good — but breaks JSON/tool calling (`\name\` instead of `"name"`)
- **Speed**: 5.74 tok/s (7.05 peak single token with warm cache)
- **Disk**: 120GB
- Uses `F_NOCACHE` flag to avoid page cache thrashing
## What NOT to Try (Learned from 58 Experiments)
Approach
Why it fails
`mmap()` expert files
Per-page fault overhead: 5x slower than `pread()`
`dispatch_io`
`dispatch_data` management overhead: -70%
`F_RDADVISE` prefetch
SSD DMA + GPU share memory controller — concurrent access: -73% GPU speed
Custom Metal LRU cache
GPU memory pressure: -38% vs OS page cache
LZ4 expert compression
Decompress overhead > warm cache savings: -13%
Temporal expert prediction
25% hit rate, wastes SSD bandwidth: -18%
Speculative early routing
Cache pollution: -38%
MTP speculative decoding
MoE I/O scales per-token (unlike dense models): break-even
Spin-poll GPU wait
CPU thermal throttle competes with GPU: -23%
Parallel SSD + GPU overlap
Unified memory controller arbitration: net negative
**Key principle**: On Apple Silicon, GPU DMA and SSD DMA share the same memory controller. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.
## Troubleshooting
### Build fails
Ensure Xcode CLI tools are installed
xcode-select --install
Check Metal compiler is available
xcrun -sdk macosx metal --version
### Out of memory
The engine is designed to use ~6GB active:
- 5.5GB: `model_weights.bin` (mmap'd, read-only)
- ~200MB: Metal scratch buffers
- Remaining ~42GB: OS page cache for expert data
If you see OOM, check for other processes consuming unified memory:
sudo memory_pressure
vm_stat
### Slow performance
Check SSD speed — needs ~17GB/s for target performance
Run with timing to identify bottleneck
./infer --prompt "Hello" --tokens 5 --timing
Verify packed_experts/ is on internal SSD, not external drive
diskutil info /
### Wrong expert directory
Default paths expected by infer.m:
metal_infer/packed_experts/ (4-bit)
metal_infer/packed_experts_2bit/ (2-bit)
Ensure you're running from metal_infer/ directory
cd metal_infer
./infer --prompt "test"