nowait-reasoning-optimizer

Implements the NOWAIT technique for efficient reasoning in R1-style LLMs. Use when optimizing inference of reasoning models (QwQ, DeepSeek-R1, Phi4-Reasoning,…

INSTALLATION

npx skills add https://github.com/davila7/claude-code-templates --skill nowait-reasoning-optimizer

Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

NOWAIT Reasoning Optimizer

Name: nowait-reasoning-optimizer
Author: davila7

Implements the NOWAIT technique from the paper "Wait, We Don't Need to 'Wait'! Removing Thinking Tokens Improves Reasoning Efficiency" (Wang et al., 2025).

Overview

NOWAIT is a training-free inference-time intervention that suppresses self-reflection tokens (e.g., "Wait", "Hmm", "Alternatively") during generation, reducing chain-of-thought (CoT) trajectory length by 27-51% without compromising model utility.

When to Use

Deploying R1-style reasoning models with limited compute

Reducing inference latency for production systems

Optimizing token costs for reasoning tasks

Working with verbose CoT outputs that need streamlining

Supported Models

Model Series

Type

Token Reduction

QwQ-32B

RL-based

16-31%

Phi4-Reasoning-Plus

RL-based

23-28%

Qwen3-32B

RL-based

13-16%

Kimi-VL-A3B

Multimodal

40-60%

QvQ-72B-Preview

Multimodal

20-30%

Important: NOWAIT works best with RL-based models. Distilled models (Qwen3-4B/8B/14B) show degraded performance when reflection tokens are suppressed.

Quick Start

1. Basic Implementation

from scripts.nowait_processor import NOWAITLogitProcessor

# Initialize processor for your model's tokenizer

processor = NOWAITLogitProcessor(tokenizer)

# Use during generation

outputs = model.generate(

    inputs,

    logits_processor=[processor],

    max_new_tokens=32768

)

2. Keywords Suppressed

See references/keywords.md for the complete list. Core keywords:

wait, alternatively, hmm, but, however, check,

double-check, maybe, verify, again, oh, ah

How It Works

Initialize Keywords: Identify reflection keywords from empirical analysis

Expand to Token Variants: Map keywords to all token variants in vocabulary (e.g., "wait" → " wait", "Wait", " Wait", ".wait", "WAIT")

Suppress During Inference: Set logits of reflection tokens to large negative values during decoding

Logits (Before)         Logits (After)

Wait     0.8     →     Wait     -inf

First    0.6     →     First    0.6

Hmm      0.5     →     Hmm      -inf

Let      0.4     →     Let      0.4

Key Findings

Why It Works

NOWAIT doesn't eliminate self-reflection entirely—it guides models to skip unnecessary "waiting" reasoning

Models still perform essential verification at key decision points

Results in more linear, straightforward reasoning paths

RL vs Distilled Models

Model Type

NOWAIT Effect

Recommendation

RL-based (QwQ, Phi4, Qwen3-32B)

Stable accuracy, significant token reduction

✅ Recommended

Distilled (Qwen3-4B/8B/14B)

Accuracy degradation on hard tasks

⚠️ Use with caution

Distilled models rely heavily on CoT structure from training data—removing reflection tokens disrupts their reasoning patterns.

Integration Examples

HuggingFace Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

from scripts.nowait_processor import NOWAITLogitProcessor

model = AutoModelForCausalLM.from_pretrained("Qwen/QwQ-32B")

tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")

processor = NOWAITLogitProcessor(tokenizer)

response = model.generate(

    tokenizer(prompt, return_tensors="pt").input_ids,

    logits_processor=[processor],

    max_new_tokens=32768,

    do_sample=True,

    temperature=0.7

)

vLLM

from vllm import LLM, SamplingParams

from scripts.nowait_processor import get_nowait_bad_words_ids

llm = LLM(model="Qwen/QwQ-32B")

bad_words_ids = get_nowait_bad_words_ids(llm.get_tokenizer())

sampling_params = SamplingParams(

    max_tokens=32768,

    bad_words_ids=bad_words_ids

)

Expected Results

Task Type

Original Tokens

NOWAIT Tokens

Reduction

Math (AIME)

15,000

10,500

30%

Visual QA (MMMU)

2,900

1,450

50%

Video QA (MMVU)

1,700

1,250

27%

Limitations

Less effective on very simple problems where CoT overhead is already minimal

Distilled models may suffer accuracy loss on challenging tasks

Some domains may require model-specific keyword tuning

References

Paper: arXiv:2506.08343v2

Complete keyword list: references/keywords.md

Implementation: scripts/nowait_processor.py

nowait-reasoning-optimizer

SKILL.md

NOWAIT Reasoning Optimizer

Overview

When to Use

Supported Models

Quick Start

1. Basic Implementation

2. Keywords Suppressed

How It Works

Key Findings

Why It Works

RL vs Distilled Models

Integration Examples

HuggingFace Transformers

vLLM

Expected Results

Limitations

References

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers