Initial: pi-skill — 68 skills, 43 extensions, 11 themes for Pi
This commit is contained in:
280
skills/autoresearch/SKILL.md
Normal file
280
skills/autoresearch/SKILL.md
Normal file
@@ -0,0 +1,280 @@
|
||||
---
|
||||
name: autoresearch
|
||||
description: Autonomous Goal-directed Iteration. Apply Karpathy's autoresearch principles to ANY task. Loops autonomously — modify, verify, keep/discard, repeat. Invoke with /skill:autoresearch or when user says "work autonomously", "iterate until done", "keep improving", or "run overnight".
|
||||
allowed-tools: Bash(git:*) Bash(npm:*) Bash(npx:*) Read Write Edit ask_user show_plan show_research subagent_create_batch dispatch_agent commander_task commander_mailbox show_report
|
||||
---
|
||||
|
||||
# Autoresearch — Autonomous Goal-directed Iteration
|
||||
|
||||
Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). Applies constraint-driven autonomous iteration to ANY work — not just ML research.
|
||||
|
||||
**Core idea:** You are an autonomous agent. Modify -> Verify -> Keep/Discard -> Repeat.
|
||||
|
||||
## When to Activate
|
||||
|
||||
- User invokes `/skill:autoresearch` or `/autoresearch`
|
||||
- User says "work autonomously", "iterate until done", "keep improving", "run overnight"
|
||||
- Any task requiring repeated iteration cycles with measurable outcomes
|
||||
|
||||
## Phase 1: Understand (Do This First — Before ANY Work)
|
||||
|
||||
Before touching any files, deeply understand the goal. Do NOT rush into iteration.
|
||||
|
||||
1. **Read relevant files** — Scan the codebase to build context around the user's goal. Understand what exists, what patterns are in use, and what's realistic.
|
||||
|
||||
2. **Identify ambiguities** — Based on the goal and codebase context, what's unclear?
|
||||
- Is the success metric obvious or ambiguous?
|
||||
- Is the scope (which files to modify) clear?
|
||||
- Are there constraints the user hasn't mentioned?
|
||||
- Are there multiple valid interpretations?
|
||||
|
||||
3. **Ask clarifying questions** — If ANY ambiguity exists, use `ask_user` to ask targeted questions:
|
||||
```
|
||||
ask_user {
|
||||
question: "I have a few questions before I build the research plan:",
|
||||
mode: "questions",
|
||||
options: [
|
||||
{ label: "1. What metric should define success? (e.g. test coverage %, build time ms, bundle size KB)" },
|
||||
{ label: "2. Which files/directories are in scope for modification?" },
|
||||
{ label: "3. Are there any approaches to avoid or constraints I should know about?" },
|
||||
{ label: "4. What does 'done' look like — a specific target, or iterate until interrupted?" }
|
||||
]
|
||||
}
|
||||
```
|
||||
**Tailor questions to the specific goal.** Don't ask about what's already clear. Ask about genuine ambiguities.
|
||||
|
||||
4. **Skip if crystal clear** — If the goal is unambiguous (clear metric, scope, exit criteria), skip questions and proceed to Phase 2. State briefly why no questions are needed.
|
||||
|
||||
5. **Synthesize understanding** — Form a concrete statement: Goal, Metric (what + direction + verify command), Scope (in/out), Constraints, Exit criteria.
|
||||
|
||||
6. **Save research session** — Create `.context/research-sessions/<session-id>.json` with the initial session data: goal, metric, scope, clarifying Q&A, status "understanding". Store the `session_id` for updates throughout the lifecycle.
|
||||
|
||||
## Phase 2: Plan (Present Before Executing)
|
||||
|
||||
Now write and present a research plan for user approval. Do NOT start iterating without approval.
|
||||
|
||||
1. **Establish baseline** — Run the verification command to get a starting metric value.
|
||||
|
||||
2. **Write the research plan** — Create `.context/autoresearch-plan.md`:
|
||||
|
||||
```markdown
|
||||
# Autoresearch Plan: <goal summary>
|
||||
|
||||
## Goal
|
||||
<Concrete goal statement>
|
||||
|
||||
## Metric
|
||||
- **Measuring:** <what>
|
||||
- **Direction:** <higher/lower is better>
|
||||
- **Verify command:** `<command>`
|
||||
- **Baseline:** <current value>
|
||||
- **Target:** <target value or "continuous improvement">
|
||||
|
||||
## Scope
|
||||
- **In scope:** <files/directories to modify>
|
||||
- **Read only:** <files for context only>
|
||||
- **Out of scope:** <excluded areas>
|
||||
|
||||
## Strategy
|
||||
Ordered approaches, most to least promising:
|
||||
1. <First approach — why promising>
|
||||
2. <Second approach — what it explores>
|
||||
3. <Third approach — alternative angle>
|
||||
4. <Fourth approach — radical idea>
|
||||
5. <Fifth approach — simplification play>
|
||||
|
||||
## Iteration Plan
|
||||
- **Mode:** <bounded (N) / unbounded>
|
||||
- **Estimated time per iteration:** <seconds/minutes>
|
||||
- **When stuck:** Re-read plan, combine near-misses, try opposites
|
||||
|
||||
## Exit Criteria
|
||||
- <When to stop>
|
||||
```
|
||||
|
||||
3. **Present for approval:**
|
||||
```
|
||||
show_plan { file_path: ".context/autoresearch-plan.md", title: "Autoresearch Plan: <goal>" }
|
||||
```
|
||||
- **Approved** → proceed to Phase 3
|
||||
- **Declined** → revise based on feedback and re-present
|
||||
|
||||
4. **Update session** — Set status to "planning", save plan content and baseline metric.
|
||||
|
||||
## Phase 3: Setup & Begin
|
||||
|
||||
With understanding confirmed and plan approved, set up tracking and start.
|
||||
|
||||
1. **Create results log** — Create `autoresearch-results.tsv` (see `references/results-logging.md`)
|
||||
2. **Record baseline** — Log the baseline metric from Phase 2 as iteration #0
|
||||
3. **Commander tracking** — If available, create task group and broadcast start (see Commander Integration below)
|
||||
4. **Update session** — Set status to "researching"
|
||||
5. **Begin the loop** — Start iterating immediately. No further confirmation needed.
|
||||
|
||||
## The Loop
|
||||
|
||||
Read `references/autonomous-loop-protocol.md` for full protocol details.
|
||||
|
||||
```
|
||||
LOOP (FOREVER or N times):
|
||||
1. Review: Read current state + git history + results log
|
||||
2. Ideate: Pick next change based on goal, past results, what hasn't been tried
|
||||
3. Modify: Make ONE focused change to in-scope files
|
||||
4. Commit: Git commit the change (before verification)
|
||||
5. Verify: Run the mechanical metric (tests, build, benchmark, etc.)
|
||||
6. Decide:
|
||||
- IMPROVED -> Keep commit, log "keep", advance
|
||||
- SAME/WORSE -> Git revert, log "discard"
|
||||
- CRASHED -> Try to fix (max 3 attempts), else log "crash" and move on
|
||||
7. Log: Record result in results log
|
||||
8. Repeat: Go to step 1.
|
||||
- If unbounded: NEVER STOP. NEVER ASK "should I continue?"
|
||||
- If bounded (N): Stop after N iterations, print final summary
|
||||
```
|
||||
|
||||
## Critical Rules
|
||||
|
||||
1. **Loop until done** — Unbounded: loop until interrupted. Bounded: loop N times then summarize.
|
||||
2. **Read before write** — Always understand full context before modifying
|
||||
3. **One change per iteration** — Atomic changes. If it breaks, you know exactly why
|
||||
4. **Mechanical verification only** — No subjective "looks good". Use metrics
|
||||
5. **Automatic rollback** — Failed changes revert instantly. No debates
|
||||
6. **Simplicity wins** — Equal results + less code = KEEP. Tiny improvement + ugly complexity = DISCARD
|
||||
7. **Git is memory** — Every kept change committed. Agent reads history to learn patterns
|
||||
8. **When stuck, think harder** — Re-read files, re-read goal AND `.context/autoresearch-plan.md` for planned strategy, try next untried approach from the plan, combine near-misses, try radical changes. Don't ask for help unless truly blocked by missing access/permissions
|
||||
|
||||
## Principles Reference
|
||||
|
||||
See `references/core-principles.md` for the 7 generalizable principles from autoresearch.
|
||||
|
||||
## Commander Integration (Task Tracking & Visibility)
|
||||
|
||||
When Commander is available, autoresearch MUST track every iteration as a Commander task. This gives the dashboard full visibility into autonomous work — just like the `tasks` extension does for manual workflows.
|
||||
|
||||
### Setup Phase (Phase 3) — Create Task Group
|
||||
|
||||
After establishing the baseline and getting plan approval, create a Commander task group for this research session:
|
||||
|
||||
```
|
||||
commander_task {
|
||||
operation: "group:create",
|
||||
group_name: "Autoresearch: <goal summary>",
|
||||
initiative_summary: "<full goal description with metric and scope>",
|
||||
total_waves: 1,
|
||||
working_directory: "<cwd>",
|
||||
tasks: []
|
||||
}
|
||||
```
|
||||
|
||||
Store the returned `group_id` — all iteration tasks will be added to this group.
|
||||
|
||||
Send an initial mailbox status broadcast:
|
||||
```
|
||||
commander_mailbox {
|
||||
operation: "send",
|
||||
from_agent: "autoresearch",
|
||||
to_agent: "commander",
|
||||
body: "Autoresearch started: <goal>. Baseline metric: <value>. Scope: <files>. Plan approved.",
|
||||
message_type: "status"
|
||||
}
|
||||
```
|
||||
|
||||
### Per-Iteration — Create → Claim → Complete
|
||||
|
||||
**Before modifying** (step 3 of each loop iteration), create and claim a Commander task:
|
||||
|
||||
```
|
||||
commander_task { operation: "create", description: "Iteration #N: <planned change>", working_directory: "<cwd>", group_id: <group_id> }
|
||||
commander_task { operation: "claim", task_id: <task_id>, agent_name: "autoresearch" }
|
||||
```
|
||||
|
||||
**After logging results** (step 7 of each loop iteration), complete the task with the outcome:
|
||||
|
||||
```
|
||||
commander_task { operation: "complete", task_id: <task_id>, result: "<status>: <description>. Metric: <old> → <new> (delta: <delta>)" }
|
||||
```
|
||||
|
||||
Also add a comment to the task with detailed results:
|
||||
```
|
||||
commander_task { operation: "comment:add", task_id: <task_id>, body: "Status: <keep|discard|crash>\nMetric: <value> (delta: <delta>)\nCommit: <hash or '-'>\nDescription: <what was tried>", agent_name: "autoresearch" }
|
||||
```
|
||||
|
||||
**Note:** Use `complete` for ALL outcomes (keep, discard, crash). Discards and crashes are expected in autoresearch — they're not failures. Reserve `fail` only for unrecoverable errors that halt the entire loop.
|
||||
|
||||
### Status Broadcasts — Every ~5 Iterations
|
||||
|
||||
Every 5 iterations, send a mailbox status update AND add a comment to the group:
|
||||
|
||||
```
|
||||
commander_mailbox {
|
||||
operation: "send",
|
||||
from_agent: "autoresearch",
|
||||
to_agent: "commander",
|
||||
body: "Autoresearch progress — Iteration #N: metric at <value> (baseline: <baseline>). Keeps: X | Discards: Y | Crashes: Z",
|
||||
message_type: "status"
|
||||
}
|
||||
```
|
||||
|
||||
### Research Complete — Report & Implementation Handoff (MANDATORY)
|
||||
|
||||
When the loop ends (bounded mode reaching N, or goal achieved):
|
||||
|
||||
1. **Final mailbox broadcast** with full summary:
|
||||
```
|
||||
commander_mailbox {
|
||||
operation: "send",
|
||||
from_agent: "autoresearch",
|
||||
to_agent: "commander",
|
||||
body: "Autoresearch complete (N iterations). Baseline: <X> → Final: <Y> (delta: <Z>). Keeps: A | Discards: B | Crashes: C. Best iteration: #M — <description>",
|
||||
message_type: "result"
|
||||
}
|
||||
```
|
||||
|
||||
2. **Compile findings & next steps** — Extract prioritized, actionable implementation items from the research. Update the session file with findings, next steps array, and final metric.
|
||||
|
||||
3. **Research report** — Present via `show_report` framed as a handoff:
|
||||
```
|
||||
show_report {
|
||||
title: "Research Complete — Ready for Implementation: <goal>",
|
||||
summary: "## Research Results\n\n...\n\n## Prioritized Next Steps\n\n1. <action item>\n2. ...\n\n## Recommended Implementation Approach\n\n<how to implement>"
|
||||
}
|
||||
```
|
||||
|
||||
4. **Ask about implementation** — Use `ask_user` to offer three choices:
|
||||
- **Implement now** → spawn a team of builder agents via `subagent_create_batch`
|
||||
- **Save & pause** → set session to "paused", resume later via `/research`
|
||||
- **Done** → mark session "complete"
|
||||
|
||||
5. **Implementation (if chosen)** — Update session to "implementing", create Commander task group, dispatch builders, track completion. When done, present final comprehensive report covering research results AND implementation work. Set session to "complete".
|
||||
|
||||
6. **Preserve the plan** — Leave `.context/autoresearch-plan.md` intact. Leave the session file for browsing via `/research`.
|
||||
|
||||
### Graceful Degradation
|
||||
|
||||
All Commander calls are **optional**. If Commander is unavailable:
|
||||
- Skip `commander_task` and `commander_mailbox` calls silently
|
||||
- The local `autoresearch-results.tsv` log remains the primary record
|
||||
- The `show_report` call still works (it only needs git, not Commander)
|
||||
- Never let a Commander error interrupt the autonomous loop
|
||||
|
||||
## Adapting to Different Domains
|
||||
|
||||
| Domain | Metric | Scope | Verify Command |
|
||||
|--------|--------|-------|----------------|
|
||||
| Backend code | Tests pass + coverage % | `src/**/*.ts` | `npm test` |
|
||||
| Frontend UI | Lighthouse score | `src/components/**` | `npx lighthouse` |
|
||||
| ML training | val_bpb / loss | `train.py` | `uv run train.py` |
|
||||
| Blog/content | Word count + readability | `content/*.md` | Custom script |
|
||||
| Performance | Benchmark time (ms) | Target files | `npm run bench` |
|
||||
| Refactoring | Tests pass + LOC reduced | Target module | `npm test && wc -l` |
|
||||
|
||||
Adapt the loop to your domain. The PRINCIPLES are universal; the METRICS are domain-specific.
|
||||
|
||||
## Session Persistence
|
||||
|
||||
Every autoresearch session is saved to `.context/research-sessions/<session-id>.json`. This enables:
|
||||
- **Resume later** — pick up where you left off via `/research` command
|
||||
- **Browse history** — see all past research sessions in the research browser
|
||||
- **Track lifecycle** — from understanding through implementation completion
|
||||
|
||||
Update the session file at every major transition: understand → plan → research → implement → complete. On every "keep" iteration or every ~5 iterations, append iteration data to the session. This creates a complete record of the research lifecycle that can be browsed and resumed.
|
||||
246
skills/autoresearch/references/autonomous-loop-protocol.md
Normal file
246
skills/autoresearch/references/autonomous-loop-protocol.md
Normal file
@@ -0,0 +1,246 @@
|
||||
# Autonomous Loop Protocol
|
||||
|
||||
Detailed protocol for the autoresearch iteration loop. SKILL.md has the summary; this file has the full rules.
|
||||
|
||||
## Loop Modes
|
||||
|
||||
Autoresearch supports two loop modes:
|
||||
|
||||
- **Unbounded (default):** Loop forever until manually interrupted (Ctrl+C)
|
||||
- **Bounded:** Loop exactly N times when user specifies a count
|
||||
|
||||
When bounded, track `current_iteration` against `max_iterations`. After the final iteration, print a summary and stop.
|
||||
|
||||
## Phase 1: Review (30 seconds)
|
||||
|
||||
Before each iteration, build situational awareness:
|
||||
|
||||
```
|
||||
1. Read current state of in-scope files (full context)
|
||||
2. Read last 10-20 entries from results log
|
||||
3. Read git log --oneline -20 to see recent changes
|
||||
4. Re-read `.context/autoresearch-plan.md` Strategy section for planned approaches
|
||||
5. Identify: what worked, what failed, what's untried from the plan
|
||||
6. If bounded: check current_iteration vs max_iterations
|
||||
```
|
||||
|
||||
**Why read every time?** After rollbacks, state may differ from what you expect. Never assume — always verify.
|
||||
|
||||
## Phase 2: Ideate (Strategic)
|
||||
|
||||
Pick the NEXT change. Priority order:
|
||||
|
||||
1. **Fix crashes/failures** from previous iteration first
|
||||
2. **Exploit successes** — if last change improved metric, try variants in same direction
|
||||
3. **Explore new approaches** — try something the results log shows hasn't been attempted
|
||||
4. **Combine near-misses** — two changes that individually didn't help might work together
|
||||
5. **Simplify** — remove code while maintaining metric. Simpler = better
|
||||
6. **Radical experiments** — when incremental changes stall, try something dramatically different
|
||||
|
||||
**Anti-patterns:**
|
||||
- Don't repeat exact same change that was already discarded
|
||||
- Don't make multiple unrelated changes at once (can't attribute improvement)
|
||||
- Don't chase marginal gains with ugly complexity
|
||||
|
||||
**Bounded mode consideration:** If remaining iterations are limited (<3 left), prioritize exploiting successes over exploration.
|
||||
|
||||
### Commander: Create + Claim Task
|
||||
|
||||
After ideating the next change, create a Commander task and claim it:
|
||||
|
||||
```
|
||||
commander_task { operation: "create", description: "Iteration #N: <planned change>", working_directory: "<cwd>", group_id: <group_id> }
|
||||
commander_task { operation: "claim", task_id: <returned_task_id>, agent_name: "autoresearch" }
|
||||
```
|
||||
|
||||
This makes the iteration visible on the Commander dashboard as an active task.
|
||||
|
||||
## Phase 3: Modify (One Atomic Change)
|
||||
|
||||
- Make ONE focused change to in-scope files
|
||||
- The change should be explainable in one sentence
|
||||
- Write the description BEFORE making the change (forces clarity)
|
||||
|
||||
## Phase 4: Commit (Before Verification)
|
||||
|
||||
```bash
|
||||
git add <changed-files>
|
||||
git commit -m "experiment: <one-sentence description>"
|
||||
```
|
||||
|
||||
Commit BEFORE running verification so rollback is clean: `git reset --hard HEAD~1`
|
||||
|
||||
## Phase 5: Verify (Mechanical Only)
|
||||
|
||||
Run the agreed-upon verification command. Capture output.
|
||||
|
||||
**Timeout rule:** If verification exceeds 2x normal time, kill and treat as crash.
|
||||
|
||||
**Extract metric:** Parse the verification output for the specific metric number.
|
||||
|
||||
## Phase 6: Decide (No Ambiguity)
|
||||
|
||||
```
|
||||
IF metric_improved:
|
||||
STATUS = "keep"
|
||||
# Do nothing — commit stays
|
||||
ELIF metric_same_or_worse:
|
||||
STATUS = "discard"
|
||||
git reset --hard HEAD~1
|
||||
ELIF crashed:
|
||||
# Attempt fix (max 3 tries)
|
||||
IF fixable:
|
||||
Fix -> re-commit -> re-verify
|
||||
ELSE:
|
||||
STATUS = "crash"
|
||||
git reset --hard HEAD~1
|
||||
```
|
||||
|
||||
**Simplicity override:** If metric barely improved (+<0.1%) but change adds significant complexity, treat as "discard". If metric unchanged but code is simpler, treat as "keep".
|
||||
|
||||
## Phase 7: Log Results
|
||||
|
||||
Append to results log (TSV format):
|
||||
|
||||
```
|
||||
iteration commit metric status description commander_task_id
|
||||
42 a1b2c3d 0.9821 keep increase attention heads from 8 to 12 184
|
||||
43 - 0.9845 discard switch optimizer to SGD 185
|
||||
44 - 0.0000 crash double batch size (OOM) 186
|
||||
```
|
||||
|
||||
### Commander: Complete Task + Add Comment
|
||||
|
||||
After logging, complete the Commander task and add a detailed comment:
|
||||
|
||||
```
|
||||
commander_task { operation: "complete", task_id: <task_id>, result: "<keep|discard|crash>: <description>. Metric: <old> → <new> (delta: <delta>)" }
|
||||
commander_task { operation: "comment:add", task_id: <task_id>, body: "Status: <status>\nMetric: <value> (delta: <delta>)\nCommit: <hash or '-'>\nDescription: <what was tried>", agent_name: "autoresearch" }
|
||||
```
|
||||
|
||||
Use `complete` for all outcomes — discards and crashes are expected in autoresearch, not failures.
|
||||
|
||||
### Session Save
|
||||
|
||||
On every "keep" result or every ~5 iterations, update the research session file (`.context/research-sessions/<session-id>.json`):
|
||||
- Append to the `iterations` array
|
||||
- Update `metric.final` with the current best metric value
|
||||
|
||||
## Phase 8: Repeat
|
||||
|
||||
### Commander: Status Broadcast Every ~5 Iterations
|
||||
|
||||
Every 5 iterations, send a mailbox status update so the Commander dashboard shows live progress:
|
||||
|
||||
```
|
||||
commander_mailbox {
|
||||
operation: "send",
|
||||
from_agent: "autoresearch",
|
||||
to_agent: "commander",
|
||||
body: "Autoresearch progress — Iteration #N: metric at <value> (baseline: <baseline>). Keeps: X | Discards: Y | Crashes: Z",
|
||||
message_type: "status"
|
||||
}
|
||||
```
|
||||
|
||||
### Unbounded Mode (default)
|
||||
|
||||
Go to Phase 1. **NEVER STOP. NEVER ASK IF YOU SHOULD CONTINUE.**
|
||||
|
||||
### Bounded Mode (with iteration count)
|
||||
|
||||
```
|
||||
IF current_iteration < max_iterations:
|
||||
Go to Phase 1
|
||||
ELIF goal_achieved:
|
||||
Print: "Goal achieved at iteration {N}! Final metric: {value}"
|
||||
Print final summary
|
||||
STOP
|
||||
ELSE:
|
||||
Print final summary
|
||||
STOP
|
||||
```
|
||||
|
||||
**Final summary format:**
|
||||
```
|
||||
=== Autoresearch Complete (N/N iterations) ===
|
||||
Baseline: {baseline} -> Final: {current} ({delta})
|
||||
Keeps: X | Discards: Y | Crashes: Z
|
||||
Best iteration: #{n} — {description}
|
||||
```
|
||||
|
||||
### Commander: Final Broadcast + Completion Report (MANDATORY)
|
||||
|
||||
When the loop ends (bounded or goal achieved), ALL three steps are required:
|
||||
|
||||
1. Send a final mailbox result broadcast:
|
||||
```
|
||||
commander_mailbox {
|
||||
operation: "send",
|
||||
from_agent: "autoresearch",
|
||||
to_agent: "commander",
|
||||
body: "Autoresearch complete (N iterations). Baseline: <X> → Final: <Y> (delta: <Z>). Keeps: A | Discards: B | Crashes: C. Best: #M — <description>",
|
||||
message_type: "result"
|
||||
}
|
||||
```
|
||||
|
||||
2. Call `show_report` to open the visual completion report with a rich summary that ties back to the research plan:
|
||||
```
|
||||
show_report {
|
||||
title: "Autoresearch Complete: <goal>",
|
||||
summary: "## Results\n\nBaseline: <X> → Final: <Y> (delta: <Z>)\n\n**Iterations:** N total (A keeps, B discards, C crashes)\n\n**Best:** #M — <description>\n\n## Plan vs. Reality\n\nReference `.context/autoresearch-plan.md` — which planned strategies were tried? Which worked? What was surprising?\n\n## Kept Changes\n\n<list of kept iterations with descriptions>\n\n## What Didn't Work\n\n<Discarded approaches and why>"
|
||||
}
|
||||
```
|
||||
|
||||
3. Preserve `.context/autoresearch-plan.md` as a record of the research session. Do not delete it.
|
||||
|
||||
### Implementation Handoff
|
||||
|
||||
After the completion report, the autoresearch process continues:
|
||||
|
||||
1. **Compile findings** — Extract prioritized next steps from the research results
|
||||
2. **Update session** — Save findings and next steps to the session file
|
||||
3. **Ask user** — Offer to implement now (spawn team), save & pause, or mark done
|
||||
4. **If implementing** — Dispatch builder agents via `subagent_create_batch`, track via Commander, present final report when done
|
||||
5. **Save session** — Update status to "complete" with implementation summary
|
||||
|
||||
See the main SKILL.md or autoresearch.md for full implementation handoff instructions.
|
||||
|
||||
### When Stuck (>5 consecutive discards)
|
||||
|
||||
Applies to both modes:
|
||||
1. Re-read ALL in-scope files from scratch
|
||||
2. Re-read the original goal AND `.context/autoresearch-plan.md` for planned strategy
|
||||
3. Review entire results log for patterns
|
||||
4. Try the next untried approach from the plan's Strategy section
|
||||
5. Try combining 2-3 previously successful changes
|
||||
6. Try the OPPOSITE of what hasn't been working
|
||||
7. Try a radical architectural change
|
||||
|
||||
## Crash Recovery
|
||||
|
||||
- Syntax error: fix immediately, don't count as separate iteration
|
||||
- Runtime error: attempt fix (max 3 tries), then move on
|
||||
- Resource exhaustion (OOM): revert, try smaller variant
|
||||
- Infinite loop/hang: kill after timeout, revert, avoid that approach
|
||||
- External dependency failure: skip, log, try different approach
|
||||
|
||||
## Communication
|
||||
|
||||
- **DO NOT** ask "should I keep going?" — in unbounded mode, YES. ALWAYS. In bounded mode, continue until N is reached.
|
||||
- **DO NOT** summarize after each iteration — just log and continue
|
||||
- **DO** print a brief one-line status every ~5 iterations (e.g., "Iteration 25: metric at 0.95, 8 keeps / 17 discards")
|
||||
- **DO** alert if you discover something surprising or game-changing
|
||||
- **DO** print a final summary when bounded loop completes
|
||||
- **DO** send Commander mailbox status broadcasts every ~5 iterations
|
||||
- **DO** add comments to Commander tasks with detailed iteration results
|
||||
- **DO** send a final Commander mailbox result broadcast when the loop ends
|
||||
- **DO** call `show_report` at loop end for a visual completion report
|
||||
|
||||
## Commander Graceful Degradation
|
||||
|
||||
All Commander integration (task tracking, mailbox broadcasts, comments) is **optional**. If Commander is unavailable:
|
||||
|
||||
- Skip all `commander_task` and `commander_mailbox` calls silently
|
||||
- Never let a Commander error interrupt or break the autonomous loop
|
||||
- The local `autoresearch-results.tsv` remains the primary record
|
||||
- `show_report` still works without Commander (it only needs git)
|
||||
92
skills/autoresearch/references/core-principles.md
Normal file
92
skills/autoresearch/references/core-principles.md
Normal file
@@ -0,0 +1,92 @@
|
||||
# Core Principles — From Karpathy's Autoresearch
|
||||
|
||||
7 universal principles extracted from autoresearch, applicable to ANY autonomous work.
|
||||
|
||||
## 1. Constraint = Enabler
|
||||
|
||||
Autonomy succeeds through intentional constraint, not despite it.
|
||||
|
||||
| Autoresearch | Generalized |
|
||||
|--------------|-------------|
|
||||
| 630-line codebase | Bounded scope that fits agent context |
|
||||
| 5-minute time budget | Fixed iteration cost |
|
||||
| One metric (val_bpb) | Single mechanical success criterion |
|
||||
|
||||
**Why:** Constraints enable agent confidence (full context understood), verification simplicity (no ambiguity), iteration velocity (low cost = rapid feedback loops).
|
||||
|
||||
**Apply:** Before starting, define: what files are in-scope? What's the ONE metric? What's the time budget per iteration?
|
||||
|
||||
## 2. Separate Strategy from Tactics
|
||||
|
||||
Humans set direction. Agents execute iterations.
|
||||
|
||||
| Strategic (Human) | Tactical (Agent) |
|
||||
|-------------------|------------------|
|
||||
| "Improve page load speed" | "Lazy-load images, code-split routes" |
|
||||
| "Increase test coverage" | "Add tests for uncovered edge cases" |
|
||||
| "Refactor auth module" | "Extract middleware, simplify handlers" |
|
||||
|
||||
**Why:** Humans understand WHY. Agents handle HOW. Mixing these roles wastes both human creativity and agent iteration speed.
|
||||
|
||||
**Apply:** Get clear direction from user (or project docs). Then iterate autonomously on implementation.
|
||||
|
||||
## 3. Metrics Must Be Mechanical
|
||||
|
||||
If you can't verify with a command, you can't iterate autonomously.
|
||||
|
||||
- Tests pass/fail (exit code 0)
|
||||
- Benchmark time in milliseconds
|
||||
- Coverage percentage
|
||||
- Lighthouse score
|
||||
- File size in bytes
|
||||
- Lines of code count
|
||||
|
||||
**Anti-pattern:** "Looks better", "probably improved", "seems cleaner" — these KILL autonomous loops because there's no decision function.
|
||||
|
||||
**Apply:** Define the grep command (or equivalent) that extracts your metric BEFORE starting.
|
||||
|
||||
## 4. Verification Must Be Fast
|
||||
|
||||
If verification takes longer than the work itself, incentives misalign.
|
||||
|
||||
| Fast (enables iteration) | Slow (kills iteration) |
|
||||
|-------------------------|----------------------|
|
||||
| Unit tests (seconds) | Full E2E suite (minutes) |
|
||||
| Type check (seconds) | Manual QA (hours) |
|
||||
| Lint check (instant) | Code review (async) |
|
||||
|
||||
**Apply:** Use the FASTEST verification that still catches real problems. Save slow verification for after the loop.
|
||||
|
||||
## 5. Iteration Cost Shapes Behavior
|
||||
|
||||
- Cheap iteration: bold exploration, many experiments
|
||||
- Expensive iteration: conservative, few experiments
|
||||
|
||||
Autoresearch: 5-minute cost = 100 experiments/night.
|
||||
Software: 10-second test = 360 experiments/hour.
|
||||
|
||||
**Apply:** Minimize iteration cost. Use fast tests, incremental builds, targeted verification. Every minute saved = more experiments run.
|
||||
|
||||
## 6. Git as Memory and Audit Trail
|
||||
|
||||
Every successful change is committed. This enables:
|
||||
- **Causality tracking** — which change drove improvement?
|
||||
- **Stacking wins** — each commit builds on prior successes
|
||||
- **Pattern learning** — agent sees what worked in THIS codebase
|
||||
- **Human review** — researcher inspects agent's decision sequence
|
||||
|
||||
**Apply:** Commit before verify. Revert on failure. Agent reads its own git history to inform next experiment.
|
||||
|
||||
## 7. Honest Limitations
|
||||
|
||||
State what the system can and cannot do. Don't oversell.
|
||||
|
||||
Autoresearch CANNOT: change tokenizer, replace human direction, guarantee meaningful improvements.
|
||||
|
||||
**Apply:** At setup, explicitly state constraints. If agent hits a wall it can't solve (missing permissions, external dependency, needs human judgment), say so clearly instead of guessing.
|
||||
|
||||
## The Meta-Principle
|
||||
|
||||
> Autonomy scales when you constrain scope, clarify success, mechanize verification, and let agents optimize tactics while humans optimize strategy.
|
||||
|
||||
This isn't "removing humans." It's reassigning human effort from execution to direction. Humans become MORE valuable by focusing on irreducibly creative/strategic work.
|
||||
65
skills/autoresearch/references/results-logging.md
Normal file
65
skills/autoresearch/references/results-logging.md
Normal file
@@ -0,0 +1,65 @@
|
||||
# Results Logging Protocol
|
||||
|
||||
Track every iteration in a structured log. Enables pattern recognition and prevents repeating failed experiments.
|
||||
|
||||
## Log Format (TSV)
|
||||
|
||||
Create `autoresearch-results.tsv` in the working directory (gitignored):
|
||||
|
||||
```tsv
|
||||
iteration commit metric delta status description commander_task_id
|
||||
```
|
||||
|
||||
### Columns
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| iteration | int | Sequential counter starting at 0 (baseline) |
|
||||
| commit | string | Short git hash (7 chars), "-" if reverted |
|
||||
| metric | float | Measured value from verification |
|
||||
| delta | float | Change from previous best (negative = improved for "lower is better") |
|
||||
| status | enum | `baseline`, `keep`, `discard`, `crash` |
|
||||
| description | string | One-sentence description of what was tried |
|
||||
| commander_task_id | int | Commander task ID for this iteration, "-" if Commander unavailable |
|
||||
|
||||
### Example
|
||||
|
||||
```tsv
|
||||
iteration commit metric delta status description commander_task_id
|
||||
0 a1b2c3d 85.2 0.0 baseline initial state — test coverage 85.2% -
|
||||
1 b2c3d4e 87.1 +1.9 keep add tests for auth middleware edge cases 184
|
||||
2 - 86.5 -0.6 discard refactor test helpers (broke 2 tests) 185
|
||||
3 - 0.0 0.0 crash add integration tests (DB connection failed) 186
|
||||
4 c3d4e5f 88.3 +1.2 keep add tests for error handling in API routes 187
|
||||
5 d4e5f6g 89.0 +0.7 keep add boundary value tests for validators 188
|
||||
```
|
||||
|
||||
## Log Management
|
||||
|
||||
- Create at setup (iteration 0 = baseline)
|
||||
- Append after EVERY iteration (including crashes)
|
||||
- Do NOT commit this file to git (add to .gitignore)
|
||||
- Read last 10-20 entries at start of each iteration for context
|
||||
- Use to detect patterns: what kind of changes tend to succeed?
|
||||
|
||||
## Summary Reporting
|
||||
|
||||
Every 10 iterations (or at loop completion in bounded mode), print a brief summary:
|
||||
|
||||
```
|
||||
=== Autoresearch Progress (iteration 20) ===
|
||||
Baseline: 85.2% -> Current best: 92.1% (+6.9%)
|
||||
Keeps: 8 | Discards: 10 | Crashes: 2
|
||||
Last 5: keep, discard, discard, keep, keep
|
||||
```
|
||||
|
||||
## Metric Direction
|
||||
|
||||
Clarify at setup whether lower or higher is better:
|
||||
- **Lower is better:** val_bpb, response time (ms), bundle size (KB), error count
|
||||
- **Higher is better:** test coverage (%), lighthouse score, throughput (req/s)
|
||||
|
||||
Record direction in first line of results log as a comment:
|
||||
```
|
||||
# metric_direction: higher_is_better
|
||||
```
|
||||
Reference in New Issue
Block a user