ai-paper-reproduction

Main orchestrator for README-first AI repo reproduction. Use when the user wants an end-to-end, minimal-trustworthy reproduction flow that reads the repository…

INSTALLATION
npx skills add https://github.com/lllllllama/ai-paper-reproduction-skill --skill ai-paper-reproduction
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

$2c

Success criteria

  • README is treated as the primary source of reproduction intent.
  • A minimum trustworthy target is selected and justified.
  • Documented inference is preferred over evaluation, and evaluation is preferred over training.
  • Any repo edits remain conservative, explicit, and auditable.
  • Assumptions, protocol deviations, and human decision points are surfaced rather than hidden.
  • repro_outputs/ is generated with consistent structure and stable machine-readable fields.
  • Final user-facing explanation is short and follows the user's language when practical.

Interaction and usability policy

  • Keep the workflow simple enough for a new user to understand quickly.
  • Prefer short, concrete plans over exhaustive research.
  • Expose commands, assumptions, blockers, and evidence.
  • Avoid turning the skill into an opaque automation layer.
  • Preserve a low learning cost for both humans and downstream agents.

Language policy

  • Human-readable Markdown outputs should follow the user's language when it is clear.
  • If the user's language is unclear, default to concise English.
  • Machine-readable fields, filenames, keys, and enum values stay in stable English.
  • Paths, package names, CLI commands, config keys, and code identifiers remain unchanged.

See references/language-policy.md.

Reproduction policy

Core priority order:

  • documented inference
  • documented evaluation
  • documented training startup or partial verification
  • full training only when the user explicitly asks later

Rules:

  • README-first: use repository files to clarify, not casually override, the README.
  • Aim for minimal trustworthy reproduction rather than maximum task coverage.
  • Treat smoke tests, startup verification, and early-step checks as valid training evidence when full training is not appropriate.
  • In trusted reproduction, a documented training command should first be checked through startup verification or a short monitoring window, then paused for explicit human confirmation before broader training continues.
  • In explicitly authorized explore-lane execution, the training record can continue without the trusted-lane confirmation pause, but it must stay isolated from trusted conclusions.
  • Record unresolved gaps rather than fabricating confidence.

Patch policy

  • Prefer no code changes.
  • Prefer safer adjustments first:
  • command-line arguments
  • environment variables
  • path fixes
  • dependency version fixes
  • dependency file fixes such as requirements.txt or environment.yml
  • Avoid changing:
  • model architecture
  • core inference semantics
  • core training logic
  • loss functions
  • experiment meaning
  • If repository files must change:
  • create a patch branch first using repro/YYYY-MM-DD-short-task
  • apply low-risk changes before medium-risk changes
  • avoid high-risk changes by default
  • commit only verified groups of changes
  • keep verified patch commits sparse, usually 0-2
  • use commit messages in the form repro: <scope> for documented <command>

See references/patch-policy.md.

Research safety boundary

  • Preserve experiment meaning over convenience.
  • Do not silently change dataset, split, checkpoint, preprocessing, metric, loss, or model semantics.
  • Distinguish direct evidence from inference and from user-approved decisions.
  • Prefer a recorded blocker over an unrecorded workaround.
  • Escalate for explicit human review before any change that could alter scientific meaning or reported conclusions.

See references/research-safety-principles.md.

Workflow

  • Read README and repo signals.
  • Call repo-intake-and-plan to scan the repository and extract documented commands.
  • Select the smallest trustworthy reproduction target.
  • Call env-and-assets-bootstrap to prepare environment assumptions and asset paths.
  • Call analyze-project only when repo structure, insertion points, or suspicious implementation patterns need a read-only pass before continuing.
  • Run a conservative smoke check or documented inference or evaluation command with minimal-run-and-audit.
  • If the selected trustworthy target is documented training startup, short-run verification, or resume, hand execution to run-train instead of minimal-run-and-audit.
  • When training is selected inside trusted reproduction, let run-train capture the startup evidence first, then surface a human review checkpoint before any fuller training claim.
  • Stop for human review if protocol meaning, model semantics, or result interpretation would otherwise be changed implicitly.
  • Use paper-context-resolver only if README and repo files leave a narrow reproduction-critical gap that blocks the current target.
  • Never auto-route into explore-code or explore-run; exploration requires explicit user authorization.
  • Write the standardized outputs with evidence, assumptions, deviations, and next safe action.
  • Give the user a short final note in the user's language.

Required outputs

Always target:

repro_outputs/

  SUMMARY.md

  COMMANDS.md

  LOG.md

  status.json

  PATCHES.md   # only if patches were applied

Use the templates under assets/ and the field rules in references/output-spec.md.

Reporting policy

  • Put the shortest high-value summary in SUMMARY.md.
  • Put copyable commands in COMMANDS.md.
  • Put process evidence, assumptions, failures, and decisions in LOG.md.
  • Put durable machine-readable state in status.json.
  • Put branch, commit, validation, and README-fidelity impact in PATCHES.md when needed.
  • Distinguish verified facts from inferred guesses.

Maintainability notes

  • Keep this skill narrow: README-first AI repo reproduction only.
  • Push specialized logic into sub-skills or helper scripts.
  • Prefer stable templates and simple schemas over ad hoc prose.
  • Keep machine-readable outputs backward compatible when possible.
  • Add new evidence sources only when they improve auditability without raising learning cost.
  • Treat repo-intake-and-plan and paper-context-resolver as narrow helpers, not primary public entrypoints.
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card