Initial: pi-skill — 68 skills, 43 extensions, 11 themes for Pi

2026-05-25 16:38:02 +07:00
commit 69f7d8bdda
1689 changed files with 342427 additions and 0 deletions
--- a/skills/autoresearch/references/autonomous-loop-protocol.md
+++ b/skills/autoresearch/references/autonomous-loop-protocol.md
@@ -0,0 +1,246 @@
+# Autonomous Loop Protocol
+
+Detailed protocol for the autoresearch iteration loop. SKILL.md has the summary; this file has the full rules.
+
+## Loop Modes
+
+Autoresearch supports two loop modes:
+
+- **Unbounded (default):** Loop forever until manually interrupted (Ctrl+C)
+- **Bounded:** Loop exactly N times when user specifies a count
+
+When bounded, track `current_iteration` against `max_iterations`. After the final iteration, print a summary and stop.
+
+## Phase 1: Review (30 seconds)
+
+Before each iteration, build situational awareness:
+
+```
+1. Read current state of in-scope files (full context)
+2. Read last 10-20 entries from results log
+3. Read git log --oneline -20 to see recent changes
+4. Re-read `.context/autoresearch-plan.md` Strategy section for planned approaches
+5. Identify: what worked, what failed, what's untried from the plan
+6. If bounded: check current_iteration vs max_iterations
+```
+
+**Why read every time?** After rollbacks, state may differ from what you expect. Never assume — always verify.
+
+## Phase 2: Ideate (Strategic)
+
+Pick the NEXT change. Priority order:
+
+1. **Fix crashes/failures** from previous iteration first
+2. **Exploit successes** — if last change improved metric, try variants in same direction
+3. **Explore new approaches** — try something the results log shows hasn't been attempted
+4. **Combine near-misses** — two changes that individually didn't help might work together
+5. **Simplify** — remove code while maintaining metric. Simpler = better
+6. **Radical experiments** — when incremental changes stall, try something dramatically different
+
+**Anti-patterns:**
+- Don't repeat exact same change that was already discarded
+- Don't make multiple unrelated changes at once (can't attribute improvement)
+- Don't chase marginal gains with ugly complexity
+
+**Bounded mode consideration:** If remaining iterations are limited (<3 left), prioritize exploiting successes over exploration.
+
+### Commander: Create + Claim Task
+
+After ideating the next change, create a Commander task and claim it:
+
+```
+commander_task { operation: "create", description: "Iteration #N: <planned change>", working_directory: "<cwd>", group_id: <group_id> }
+commander_task { operation: "claim", task_id: <returned_task_id>, agent_name: "autoresearch" }
+```
+
+This makes the iteration visible on the Commander dashboard as an active task.
+
+## Phase 3: Modify (One Atomic Change)
+
+- Make ONE focused change to in-scope files
+- The change should be explainable in one sentence
+- Write the description BEFORE making the change (forces clarity)
+
+## Phase 4: Commit (Before Verification)
+
+```bash
+git add <changed-files>
+git commit -m "experiment: <one-sentence description>"
+```
+
+Commit BEFORE running verification so rollback is clean: `git reset --hard HEAD~1`
+
+## Phase 5: Verify (Mechanical Only)
+
+Run the agreed-upon verification command. Capture output.
+
+**Timeout rule:** If verification exceeds 2x normal time, kill and treat as crash.
+
+**Extract metric:** Parse the verification output for the specific metric number.
+
+## Phase 6: Decide (No Ambiguity)
+
+```
+IF metric_improved:
+    STATUS = "keep"
+    # Do nothing — commit stays
+ELIF metric_same_or_worse:
+    STATUS = "discard"
+    git reset --hard HEAD~1
+ELIF crashed:
+    # Attempt fix (max 3 tries)
+    IF fixable:
+        Fix -> re-commit -> re-verify
+    ELSE:
+        STATUS = "crash"
+        git reset --hard HEAD~1
+```
+
+**Simplicity override:** If metric barely improved (+<0.1%) but change adds significant complexity, treat as "discard". If metric unchanged but code is simpler, treat as "keep".
+
+## Phase 7: Log Results
+
+Append to results log (TSV format):
+
+```
+iteration  commit   metric   status   description   commander_task_id
+42         a1b2c3d  0.9821   keep     increase attention heads from 8 to 12   184
+43         -        0.9845   discard  switch optimizer to SGD   185
+44         -        0.0000   crash    double batch size (OOM)   186
+```
+
+### Commander: Complete Task + Add Comment
+
+After logging, complete the Commander task and add a detailed comment:
+
+```
+commander_task { operation: "complete", task_id: <task_id>, result: "<keep|discard|crash>: <description>. Metric: <old> → <new> (delta: <delta>)" }
+commander_task { operation: "comment:add", task_id: <task_id>, body: "Status: <status>\nMetric: <value> (delta: <delta>)\nCommit: <hash or '-'>\nDescription: <what was tried>", agent_name: "autoresearch" }
+```
+
+Use `complete` for all outcomes — discards and crashes are expected in autoresearch, not failures.
+
+### Session Save
+
+On every "keep" result or every ~5 iterations, update the research session file (`.context/research-sessions/<session-id>.json`):
+- Append to the `iterations` array
+- Update `metric.final` with the current best metric value
+
+## Phase 8: Repeat
+
+### Commander: Status Broadcast Every ~5 Iterations
+
+Every 5 iterations, send a mailbox status update so the Commander dashboard shows live progress:
+
+```
+commander_mailbox {
+  operation: "send",
+  from_agent: "autoresearch",
+  to_agent: "commander",
+  body: "Autoresearch progress — Iteration #N: metric at <value> (baseline: <baseline>). Keeps: X | Discards: Y | Crashes: Z",
+  message_type: "status"
+}
+```
+
+### Unbounded Mode (default)
+
+Go to Phase 1. **NEVER STOP. NEVER ASK IF YOU SHOULD CONTINUE.**
+
+### Bounded Mode (with iteration count)
+
+```
+IF current_iteration < max_iterations:
+    Go to Phase 1
+ELIF goal_achieved:
+    Print: "Goal achieved at iteration {N}! Final metric: {value}"
+    Print final summary
+    STOP
+ELSE:
+    Print final summary
+    STOP
+```
+
+**Final summary format:**
+```
+=== Autoresearch Complete (N/N iterations) ===
+Baseline: {baseline} -> Final: {current} ({delta})
+Keeps: X | Discards: Y | Crashes: Z
+Best iteration: #{n} — {description}
+```
+
+### Commander: Final Broadcast + Completion Report (MANDATORY)
+
+When the loop ends (bounded or goal achieved), ALL three steps are required:
+
+1. Send a final mailbox result broadcast:
+```
+commander_mailbox {
+  operation: "send",
+  from_agent: "autoresearch",
+  to_agent: "commander",
+  body: "Autoresearch complete (N iterations). Baseline: <X> → Final: <Y> (delta: <Z>). Keeps: A | Discards: B | Crashes: C. Best: #M — <description>",
+  message_type: "result"
+}
+```
+
+2. Call `show_report` to open the visual completion report with a rich summary that ties back to the research plan:
+```
+show_report {
+  title: "Autoresearch Complete: <goal>",
+  summary: "## Results\n\nBaseline: <X> → Final: <Y> (delta: <Z>)\n\n**Iterations:** N total (A keeps, B discards, C crashes)\n\n**Best:** #M — <description>\n\n## Plan vs. Reality\n\nReference `.context/autoresearch-plan.md` — which planned strategies were tried? Which worked? What was surprising?\n\n## Kept Changes\n\n<list of kept iterations with descriptions>\n\n## What Didn't Work\n\n<Discarded approaches and why>"
+}
+```
+
+3. Preserve `.context/autoresearch-plan.md` as a record of the research session. Do not delete it.
+
+### Implementation Handoff
+
+After the completion report, the autoresearch process continues:
+
+1. **Compile findings** — Extract prioritized next steps from the research results
+2. **Update session** — Save findings and next steps to the session file
+3. **Ask user** — Offer to implement now (spawn team), save & pause, or mark done
+4. **If implementing** — Dispatch builder agents via `subagent_create_batch`, track via Commander, present final report when done
+5. **Save session** — Update status to "complete" with implementation summary
+
+See the main SKILL.md or autoresearch.md for full implementation handoff instructions.
+
+### When Stuck (>5 consecutive discards)
+
+Applies to both modes:
+1. Re-read ALL in-scope files from scratch
+2. Re-read the original goal AND `.context/autoresearch-plan.md` for planned strategy
+3. Review entire results log for patterns
+4. Try the next untried approach from the plan's Strategy section
+5. Try combining 2-3 previously successful changes
+6. Try the OPPOSITE of what hasn't been working
+7. Try a radical architectural change
+
+## Crash Recovery
+
+- Syntax error: fix immediately, don't count as separate iteration
+- Runtime error: attempt fix (max 3 tries), then move on
+- Resource exhaustion (OOM): revert, try smaller variant
+- Infinite loop/hang: kill after timeout, revert, avoid that approach
+- External dependency failure: skip, log, try different approach
+
+## Communication
+
+- **DO NOT** ask "should I keep going?" — in unbounded mode, YES. ALWAYS. In bounded mode, continue until N is reached.
+- **DO NOT** summarize after each iteration — just log and continue
+- **DO** print a brief one-line status every ~5 iterations (e.g., "Iteration 25: metric at 0.95, 8 keeps / 17 discards")
+- **DO** alert if you discover something surprising or game-changing
+- **DO** print a final summary when bounded loop completes
+- **DO** send Commander mailbox status broadcasts every ~5 iterations
+- **DO** add comments to Commander tasks with detailed iteration results
+- **DO** send a final Commander mailbox result broadcast when the loop ends
+- **DO** call `show_report` at loop end for a visual completion report
+
+## Commander Graceful Degradation
+
+All Commander integration (task tracking, mailbox broadcasts, comments) is **optional**. If Commander is unavailable:
+
+- Skip all `commander_task` and `commander_mailbox` calls silently
+- Never let a Commander error interrupt or break the autonomous loop
+- The local `autoresearch-results.tsv` remains the primary record
+- `show_report` still works without Commander (it only needs git)
--- a/skills/autoresearch/references/core-principles.md
+++ b/skills/autoresearch/references/core-principles.md
@@ -0,0 +1,92 @@
+# Core Principles — From Karpathy's Autoresearch
+
+7 universal principles extracted from autoresearch, applicable to ANY autonomous work.
+
+## 1. Constraint = Enabler
+
+Autonomy succeeds through intentional constraint, not despite it.
+
+| Autoresearch | Generalized |
+|--------------|-------------|
+| 630-line codebase | Bounded scope that fits agent context |
+| 5-minute time budget | Fixed iteration cost |
+| One metric (val_bpb) | Single mechanical success criterion |
+
+**Why:** Constraints enable agent confidence (full context understood), verification simplicity (no ambiguity), iteration velocity (low cost = rapid feedback loops).
+
+**Apply:** Before starting, define: what files are in-scope? What's the ONE metric? What's the time budget per iteration?
+
+## 2. Separate Strategy from Tactics
+
+Humans set direction. Agents execute iterations.
+
+| Strategic (Human) | Tactical (Agent) |
+|-------------------|------------------|
+| "Improve page load speed" | "Lazy-load images, code-split routes" |
+| "Increase test coverage" | "Add tests for uncovered edge cases" |
+| "Refactor auth module" | "Extract middleware, simplify handlers" |
+
+**Why:** Humans understand WHY. Agents handle HOW. Mixing these roles wastes both human creativity and agent iteration speed.
+
+**Apply:** Get clear direction from user (or project docs). Then iterate autonomously on implementation.
+
+## 3. Metrics Must Be Mechanical
+
+If you can't verify with a command, you can't iterate autonomously.
+
+- Tests pass/fail (exit code 0)
+- Benchmark time in milliseconds
+- Coverage percentage
+- Lighthouse score
+- File size in bytes
+- Lines of code count
+
+**Anti-pattern:** "Looks better", "probably improved", "seems cleaner" — these KILL autonomous loops because there's no decision function.
+
+**Apply:** Define the grep command (or equivalent) that extracts your metric BEFORE starting.
+
+## 4. Verification Must Be Fast
+
+If verification takes longer than the work itself, incentives misalign.
+
+| Fast (enables iteration) | Slow (kills iteration) |
+|-------------------------|----------------------|
+| Unit tests (seconds) | Full E2E suite (minutes) |
+| Type check (seconds) | Manual QA (hours) |
+| Lint check (instant) | Code review (async) |
+
+**Apply:** Use the FASTEST verification that still catches real problems. Save slow verification for after the loop.
+
+## 5. Iteration Cost Shapes Behavior
+
+- Cheap iteration: bold exploration, many experiments
+- Expensive iteration: conservative, few experiments
+
+Autoresearch: 5-minute cost = 100 experiments/night.
+Software: 10-second test = 360 experiments/hour.
+
+**Apply:** Minimize iteration cost. Use fast tests, incremental builds, targeted verification. Every minute saved = more experiments run.
+
+## 6. Git as Memory and Audit Trail
+
+Every successful change is committed. This enables:
+- **Causality tracking** — which change drove improvement?
+- **Stacking wins** — each commit builds on prior successes
+- **Pattern learning** — agent sees what worked in THIS codebase
+- **Human review** — researcher inspects agent's decision sequence
+
+**Apply:** Commit before verify. Revert on failure. Agent reads its own git history to inform next experiment.
+
+## 7. Honest Limitations
+
+State what the system can and cannot do. Don't oversell.
+
+Autoresearch CANNOT: change tokenizer, replace human direction, guarantee meaningful improvements.
+
+**Apply:** At setup, explicitly state constraints. If agent hits a wall it can't solve (missing permissions, external dependency, needs human judgment), say so clearly instead of guessing.
+
+## The Meta-Principle
+
+> Autonomy scales when you constrain scope, clarify success, mechanize verification, and let agents optimize tactics while humans optimize strategy.
+
+This isn't "removing humans." It's reassigning human effort from execution to direction. Humans become MORE valuable by focusing on irreducibly creative/strategic work.
--- a/skills/autoresearch/references/results-logging.md
+++ b/skills/autoresearch/references/results-logging.md
@@ -0,0 +1,65 @@
+# Results Logging Protocol
+
+Track every iteration in a structured log. Enables pattern recognition and prevents repeating failed experiments.
+
+## Log Format (TSV)
+
+Create `autoresearch-results.tsv` in the working directory (gitignored):
+
+```tsv
+iteration	commit	metric	delta	status	description	commander_task_id
+```
+
+### Columns
+
+| Column | Type | Description |
+|--------|------|-------------|
+| iteration | int | Sequential counter starting at 0 (baseline) |
+| commit | string | Short git hash (7 chars), "-" if reverted |
+| metric | float | Measured value from verification |
+| delta | float | Change from previous best (negative = improved for "lower is better") |
+| status | enum | `baseline`, `keep`, `discard`, `crash` |
+| description | string | One-sentence description of what was tried |
+| commander_task_id | int | Commander task ID for this iteration, "-" if Commander unavailable |
+
+### Example
+
+```tsv
+iteration	commit	metric	delta	status	description	commander_task_id
+0	a1b2c3d	85.2	0.0	baseline	initial state — test coverage 85.2%	-
+1	b2c3d4e	87.1	+1.9	keep	add tests for auth middleware edge cases	184
+2	-	86.5	-0.6	discard	refactor test helpers (broke 2 tests)	185
+3	-	0.0	0.0	crash	add integration tests (DB connection failed)	186
+4	c3d4e5f	88.3	+1.2	keep	add tests for error handling in API routes	187
+5	d4e5f6g	89.0	+0.7	keep	add boundary value tests for validators	188
+```
+
+## Log Management
+
+- Create at setup (iteration 0 = baseline)
+- Append after EVERY iteration (including crashes)
+- Do NOT commit this file to git (add to .gitignore)
+- Read last 10-20 entries at start of each iteration for context
+- Use to detect patterns: what kind of changes tend to succeed?
+
+## Summary Reporting
+
+Every 10 iterations (or at loop completion in bounded mode), print a brief summary:
+
+```
+=== Autoresearch Progress (iteration 20) ===
+Baseline: 85.2% -> Current best: 92.1% (+6.9%)
+Keeps: 8 | Discards: 10 | Crashes: 2
+Last 5: keep, discard, discard, keep, keep
+```
+
+## Metric Direction
+
+Clarify at setup whether lower or higher is better:
+- **Lower is better:** val_bpb, response time (ms), bundle size (KB), error count
+- **Higher is better:** test coverage (%), lighthouse score, throughput (req/s)
+
+Record direction in first line of results log as a comment:
+```
+# metric_direction: higher_is_better
+```