aris-autonomous-ml-research

Autonomous ML research workflows using ARIS (Auto-Research-In-Sleep) — Markdown-only skills for cross-model paper review, idea discovery, experiment…

INSTALLATION
npx skills add https://github.com/aradotso/trending-skills --skill aris-autonomous-ml-research
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

$27

Core value: going from research direction → paper ideas → experiments → written paper → rebuttal, autonomously, overnight.

Installation

1. Clone the Repository

git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep.git

cd Auto-claude-code-research-in-sleep

No pip install, no Docker, no daemon. The entire system is Markdown files.

2. Install Claude Code (Primary Agent)

npm install -g @anthropic-ai/claude-code

3. Install Codex MCP (Cross-Model Reviewer)

npm install -g @openai/codex

Configure Claude Code to use the Codex MCP server by adding to your Claude Code config (~/.claude/settings.json):

{

  "mcpServers": {

    "codex": {

      "command": "codex",

      "args": ["mcp"],

      "env": {

        "OPENAI_API_KEY": "$OPENAI_API_KEY"

      }

    }

  }

}

4. Copy Skills into Claude Code

# Copy all skills to Claude Code's custom skills directory

cp -r skills/claude-code/ ~/.claude/skills/

# Or symlink to stay up to date

ln -s $(pwd)/skills/claude-code ~/.claude/skills/aris

5. Set Environment Variables

# Required for Claude Code

export ANTHROPIC_API_KEY=your_anthropic_key

# Required for cross-model review (GPT-5.4 as reviewer)

export OPENAI_API_KEY=your_openai_key

# Optional: alternative reviewer models (no OpenAI needed)

export LLM_REVIEWER_BASE_URL=https://api.minimax.chat/v1

export LLM_REVIEWER_API_KEY=your_minimax_key

export LLM_REVIEWER_MODEL=MiniMax-M2.7

Alternative Model Combinations (No Claude/OpenAI Required)

ARIS works with any OpenAI-compatible API. Configure the llm-chat MCP server:

{

  "mcpServers": {

    "llm-chat": {

      "command": "node",

      "args": ["mcp-servers/llm-chat/index.js"],

      "env": {

        "LLM_BASE_URL": "$LLM_REVIEWER_BASE_URL",

        "LLM_API_KEY": "$LLM_REVIEWER_API_KEY",

        "LLM_MODEL": "$LLM_REVIEWER_MODEL"

      }

    }

  }

}

Tested combinations:

Executor

Reviewer

Config

Claude Code

GPT-5.4 xhigh

Default

Codex CLI

Gemini

Guide

Claude Code

MiniMax-M2.7

LLM_BASE_URL=https://api.minimax.chat/v1

Claude Code

GLM-5

LLM_BASE_URL=https://open.bigmodel.cn/api/paas/v4

MiniMax-M2.7

GLM-5

Guide

Codex CLI

Claude

Swap executor/reviewer

Core Workflows

Workflow 0: Full Pipeline (Start Here)

/research-pipeline "factorized gap in discrete diffusion LMs"

With a reference paper and base repo:

/research-pipeline "improve method X" — ref paper: https://arxiv.org/abs/2406.04329, base repo: https://github.com/org/project

ARIS will:

  • Read the paper → find weaknesses
  • Clone the codebase
  • Generate ideas that fix those weaknesses using that code
  • Run experiments
  • Write the paper

Parameters:

/research-pipeline "topic"

  — ref paper: <arxiv_url>       # Optional: paper to improve

  — base repo: <github_url>      # Optional: codebase to build on

  — venue: ICML                  # Target venue (default: ICML)

  — compact: true                # Lean summaries for short-context models

Workflow 1: Idea Discovery

/idea-discovery "discrete diffusion language models"

Scans literature, identifies gaps, generates novel research directions, scores each idea for novelty/feasibility, and outputs a ranked proposal list.

Workflow 1.5: Experiment Bridge

/experiment-bridge "run ablation on temperature scaling" — code review: true

Cross-model code review before GPU deployment (enabled by default). Catches bugs, confirms experimental validity, then runs.

# Example: what experiment-bridge automates

# 1. Claude Code writes training script

# 2. GPT-5.4 reviews the code (code review gate)

# 3. If approved → submits to GPU cluster

# 4. Monitors via W&#x26;B API

import wandb

api = wandb.Api()

runs = api.runs("your-entity/your-project")

for run in runs:

    print(run.name, run.summary.get("val_loss", None))

Workflow 2: Paper Writing

/paper-writing "results/" — venue: NeurIPS

Generates LaTeX paper from experiment results. Anti-hallucination enforced: every citation verified via DBLP → CrossRef → [VERIFY] tag if unconfirmed.

Venue templates available: ICML, NeurIPS, ICLR, CVPR, ACL, AAAI, ACM MM

Workflow 3: Auto Review Loop

/auto-review "paper.pdf"

The core ARIS loop:

  • Claude Code reads the paper
  • GPT-5.4 reviews as adversarial critic
  • Claude Code rewrites based on critique
  • Score tracked across rounds (target: 8/10 "clear accept")
  • Loop repeats until convergence or max rounds
Score progression: 5.2 → 6.1 → 7.3 → 8.0 ✓

Workflow 4: Rebuttal

/rebuttal "paper/ + reviews" — venue: ICML, character limit: 5000

Parameters:

Parameter

Default

Description

venue

ICML

Target venue

character limit

required

Hard limit for submission

quick mode

false

Stop after parsing + strategy (no draft)

auto experiment

false

Auto-run supplementary experiments

max stress test rounds

1

GPT-5.4 stress-test iterations

max followup rounds

3

Per-reviewer follow-up limit

Three safety gates (rebuttal won't finalize if any fails):

  • 🔒 No fabrication — every claim maps to paper/review/user-confirmed result
  • 🔒 No overpromise — every promise is user-approved
  • 🔒 Full coverage — every reviewer concern is tracked

Outputs:

  • PASTE_READY.txt — exact char count, paste directly to venue
  • REBUTTAL_DRAFT_rich.md — extended version for manual editing

Bonus: Slides and Poster

# Conference presentation

/paper-slides "paper/"     # → Beamer PDF + PPTX + speaker notes + Q&#x26;A prep

# Conference poster

/paper-poster "paper/"     # → A0/A1 poster PDF + editable PPTX + SVG

Standalone Skills

These skills can be invoked independently or are integrated into the core workflows:

Skill

Command

Description

Research Refine

/research-refine

Turn vague ideas into anchored proposals

Experiment Plan

/experiment-plan

Claim-driven experiment roadmaps

Training Check

/training-check

Validate training runs before full launch

Result to Claim

/result-to-claim

Convert raw results to paper claims

Ablation Planner

/ablation-planner

Design ablation study structure

Formula Derivation

/formula-derivation

Research formula development and verification

Grant Proposal

/grant-proposal

Write grant proposals from research

Paper Illustration

/paper-illustration

Generate figures (Gemini-powered)

Citation Claw

/citation-claw

Verify and format citations

Session Recovery &#x26; Compact Mode

For short-context models or after interruption:

/research-pipeline "topic" — compact: true

Generates lean summary files at each checkpoint. Resume after interruption:

/research-refine — resume: true

ARIS auto-checkpoints the research-refine workflow and resumes from last completed phase.

Codex CLI Native Skills

Full skill set available for OpenAI Codex without Claude Code:

cd skills/skills-codex/

codex "run idea-discovery on discrete diffusion"

MCP Server: llm-chat

The llm-chat MCP server bridges any OpenAI-compatible API as a reviewer. Start it manually for debugging:

cd mcp-servers/llm-chat/

node index.js

Environment variables:

export LLM_BASE_URL=https://api.openai.com/v1   # Any OpenAI-compatible endpoint

export LLM_API_KEY=$OPENAI_API_KEY

export LLM_MODEL=gpt-4o                          # Any model name

Free Tier via ModelScope

Zero-cost option — no API key required:

# See full guide: docs/MODELSCOPE_GUIDE.md

export MODELSCOPE_API_KEY=your_modelscope_token

export LLM_BASE_URL=https://api-inference.modelscope.cn/v1

export LLM_MODEL=Qwen/Qwen2.5-72B-Instruct

Input Templates

Templates for every workflow live in templates/:

ls templates/

# idea-discovery.md

# experiment-bridge.md

# paper-writing.md

# auto-review.md

# rebuttal.md

# research-refine.md

Use them to structure your inputs:

cat templates/rebuttal.md

# Fill in: paper path, review text, venue, character limit

# Then: /rebuttal [filled template]

Directory Structure

Auto-claude-code-research-in-sleep/

├── skills/

│   ├── claude-code/          # Claude Code SKILL.md files

│   ├── skills-codex/         # Codex CLI native skills

│   ├── idea-discovery/

│   ├── experiment-bridge/

│   ├── paper-writing/

│   ├── auto-review/

│   ├── rebuttal/             SKILL.md  ← each is a single readable file

│   ├── paper-slides/

│   ├── paper-poster/

│   ├── research-refine/

│   ├── formula-derivation/

│   └── ...

├── mcp-servers/

│   └── llm-chat/             # Universal reviewer bridge

├── templates/                # Input templates for every workflow

├── docs/

│   ├── CURSOR_ADAPTATION.md

│   ├── TRAE_ARIS_RUNBOOK_EN.md

│   ├── ANTIGRAVITY_ADAPTATION.md

│   ├── MODELSCOPE_GUIDE.md

│   ├── MiniMax-GLM-Configuration.md

│   └── CODEX_GEMINI_REVIEW_GUIDE.md

└── README.md

Troubleshooting

Cross-model review not triggering:

  • Check MCP server is running: codex mcp or node mcp-servers/llm-chat/index.js
  • Verify OPENAI_API_KEY or LLM_API_KEY is set
  • Check Claude Code MCP config in ~/.claude/settings.json

W&#x26;B metrics not loading:

import wandb

# Ensure you're logged in

wandb.login(key=os.environ["WANDB_API_KEY"])

api = wandb.Api()

# Use full entity/project path

runs = api.runs("your-entity/your-project")

Context window exceeded mid-workflow:

/research-pipeline "topic" — compact: true

Then resume with — resume: true on the next interrupted skill.

**Citation hallucination warnings ([VERIFY] tags):**

These are intentional — ARIS flags unverified citations rather than silently hallucinating. Manually verify flagged citations before submission.

Rebuttal exceeds character limit:

Increase max stress test rounds — each round trims the draft:

/rebuttal "paper/ + reviews" — character limit: 5000, max stress test rounds: 3

ModelScope free tier rate limits:

Add delay between skill calls or switch to a paid endpoint for overnight runs.

Why Two Models (Not One, Not Four)

  • 1 model self-reviewing → local minima, blind spots (stochastic bandit)
  • 2 models cross-reviewing → adversarial critique breaks blind spots (adversarial bandit)
  • 4+ models → diminishing returns, 2-4× API cost, coordination overhead

Claude Code = fast fluid execution. GPT-5.4/Gemini/GLM = slower, more deliberate critique. Speed × Rigor = better outcomes than either model alone.

Community &#x26; Citation

@software{aris2026,

  title  = {ARIS: Auto-Research-In-Sleep},

  author = {wanshuiyin},

  year   = {2026},

  url    = {https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep}

}

Join the community: GitHub Discussions

Papers accepted using ARIS: CS Conference (8/10 "clear accept"), AAAI 2026 Main Technical (7/10 "good paper, accept").

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card