Name: ai-paper-reproduction
Author: lllllllama

SKILL.md

$2c

Success criteria

README is treated as the primary source of reproduction intent.

A minimum trustworthy target is selected and justified.

Documented inference is preferred over evaluation, and evaluation is preferred over training.

Any repo edits remain conservative, explicit, and auditable.

Assumptions, protocol deviations, and human decision points are surfaced rather than hidden.

repro_outputs/ is generated with consistent structure and stable machine-readable fields.

Final user-facing explanation is short and follows the user's language when practical.

Interaction and usability policy

Keep the workflow simple enough for a new user to understand quickly.

Prefer short, concrete plans over exhaustive research.

Expose commands, assumptions, blockers, and evidence.

Avoid turning the skill into an opaque automation layer.

Preserve a low learning cost for both humans and downstream agents.

Language policy

Human-readable Markdown outputs should follow the user's language when it is clear.

If the user's language is unclear, default to concise English.

Machine-readable fields, filenames, keys, and enum values stay in stable English.

Paths, package names, CLI commands, config keys, and code identifiers remain unchanged.

See references/language-policy.md.

Reproduction policy

Core priority order:

documented inference

documented evaluation

documented training startup or partial verification

full training only when the user explicitly asks later

Rules:

README-first: use repository files to clarify, not casually override, the README.

Aim for minimal trustworthy reproduction rather than maximum task coverage.

Treat smoke tests, startup verification, and early-step checks as valid training evidence when full training is not appropriate.

In trusted reproduction, a documented training command should first be checked through startup verification or a short monitoring window, then paused for explicit human confirmation before broader training continues.

In explicitly authorized explore-lane execution, the training record can continue without the trusted-lane confirmation pause, but it must stay isolated from trusted conclusions.

Record unresolved gaps rather than fabricating confidence.

Patch policy

Prefer no code changes.

Prefer safer adjustments first:

command-line arguments

environment variables

path fixes

dependency version fixes

dependency file fixes such as requirements.txt or environment.yml

Avoid changing:

model architecture

core inference semantics

core training logic

loss functions

experiment meaning

If repository files must change:

create a patch branch first using repro/YYYY-MM-DD-short-task

apply low-risk changes before medium-risk changes

avoid high-risk changes by default

commit only verified groups of changes

keep verified patch commits sparse, usually 0-2

use commit messages in the form repro: <scope> for documented <command>

See references/patch-policy.md.

Research safety boundary

Preserve experiment meaning over convenience.

Do not silently change dataset, split, checkpoint, preprocessing, metric, loss, or model semantics.

Distinguish direct evidence from inference and from user-approved decisions.

Prefer a recorded blocker over an unrecorded workaround.

Escalate for explicit human review before any change that could alter scientific meaning or reported conclusions.

See references/research-safety-principles.md.

Workflow

Read README and repo signals.

Call repo-intake-and-plan to scan the repository and extract documented commands.

Select the smallest trustworthy reproduction target.

Call env-and-assets-bootstrap to prepare environment assumptions and asset paths.

Call analyze-project only when repo structure, insertion points, or suspicious implementation patterns need a read-only pass before continuing.

Run a conservative smoke check or documented inference or evaluation command with minimal-run-and-audit.

If the selected trustworthy target is documented training startup, short-run verification, or resume, hand execution to run-train instead of minimal-run-and-audit.

When training is selected inside trusted reproduction, let run-train capture the startup evidence first, then surface a human review checkpoint before any fuller training claim.

Stop for human review if protocol meaning, model semantics, or result interpretation would otherwise be changed implicitly.

Use paper-context-resolver only if README and repo files leave a narrow reproduction-critical gap that blocks the current target.

Never auto-route into explore-code or explore-run; exploration requires explicit user authorization.

Write the standardized outputs with evidence, assumptions, deviations, and next safe action.

Give the user a short final note in the user's language.

Required outputs

Always target:

repro_outputs/

  SUMMARY.md

  COMMANDS.md

  LOG.md

  status.json

  PATCHES.md   # only if patches were applied

Use the templates under assets/ and the field rules in references/output-spec.md.

Reporting policy

Put the shortest high-value summary in SUMMARY.md.

Put copyable commands in COMMANDS.md.

Put process evidence, assumptions, failures, and decisions in LOG.md.

Put durable machine-readable state in status.json.

Put branch, commit, validation, and README-fidelity impact in PATCHES.md when needed.

Distinguish verified facts from inferred guesses.

Maintainability notes

Keep this skill narrow: README-first AI repo reproduction only.

Push specialized logic into sub-skills or helper scripts.

Prefer stable templates and simple schemas over ad hoc prose.

Keep machine-readable outputs backward compatible when possible.

Add new evidence sources only when they improve auditability without raising learning cost.

Treat repo-intake-and-plan and paper-context-resolver as narrow helpers, not primary public entrypoints.

ai-paper-reproduction