Files
ALwrity/backend/docs/AGENT_FLAT_CONTEXT_REVIEW.md

198 lines
10 KiB
Markdown

# Agent Flat-File Context System Review
## Scope
This review documents the **current implementation** of ALwrity's onboarding flat-file context system and compares it to the proposed **Direct-to-File Virtual Shell (VFS)** model.
---
## 1) Present Implementation (What Exists Today)
### 1.1 Storage model
- Context is stored per user under:
- `backend/workspace/workspace_<safe_user_id>/agent_context/`
- Files are JSON documents, one per onboarding domain:
- `step2_website_analysis.json`
- `step3_research_preferences.json`
- `step4_persona_data.json`
- `step5_integrations.json`
- `context_manifest.json`
### 1.2 Writer and reader
- `AgentFlatContextStore` is the core component that:
- sanitizes user IDs for path safety,
- writes documents atomically (`tempfile` + `os.replace`),
- sets restrictive file permissions (`0600` best effort),
- generates structured `agent_summary` objects,
- updates a manifest index of available documents.
- Data is loaded by direct file reads from the same class (`load_stepX_context_document`).
### 1.3 Read-path fallback chain
`SIFIntegrationService` uses a strict fallback sequence for onboarding context retrieval:
1. **flat file** (`AgentFlatContextStore`)
2. **database** (`WebsiteAnalysis`, `ResearchPreferences`, `PersonaData`, etc.)
3. **SIF semantic index** (`TxtaiIntelligenceService.search`)
Step 5 uses `flat_file -> sif_semantic`.
### 1.4 Producer flow (onboarding persistence)
`StepManagementService` persists canonical snapshots to flat context when onboarding steps are saved:
- Step 2 website analysis
- Step 3 research preferences (and later competitor-enriched refresh)
- Step 4 persona data
- Step 5 integrations
### 1.5 Context optimization currently implemented
- Sensitive-key redaction in nested payloads (`api_key`, `token`, `secret`, etc.).
- Size budgeting with trimming (`DEFAULT_MAX_BYTES = 300_000`) and trim metadata.
- Generated summaries include:
- quick facts,
- retrieval hints (high-signal terms and suggested agent queries),
- domain-specific focus blocks.
- Document context includes audience, retrieval contract, journey stage, related documents, and context-window guidance.
---
## 2) Comparison vs Proposed Direct-to-File VFS
## Strong alignment
The current system already matches the proposal in important ways:
- **Direct-to-file persistence** instead of DB-backed retrieval for fast reads.
- **Manifest/index concept** (`context_manifest.json`) that can act like a precomputed path map.
- **Agent-first retrieval semantics** (summary-first contract and fallback policy).
- **Operational safety controls** (atomic writes, redaction, path sanitization).
## Gaps vs full virtual shell abstraction
The following pieces are not fully implemented as described in your proposed architecture:
- No explicit **virtual shell provider** (`IFileSystem`) exposing `ls/cat/grep/find` commands.
- No always-live, process-level **in-memory `Map<virtualPath, absolutePath>`** for path lookups.
- No native glob/query command layer for agent shell UX.
- Not currently **read-only enforced at API surface** (writes are intentionally allowed by onboarding services to refresh context).
---
## 3) Practical Recommendation: Incremental VFS Evolution
1. **Introduce a read-only VFS facade for agents**
- Keep `AgentFlatContextStore` as the write path for trusted onboarding services.
- Add `AgentContextVFS` read adapter exposing:
- `ls(path)` from manifest,
- `cat(path)` mapped to underlying JSON,
- `find(glob)` on virtual keys,
- `grep(query)` with path prefilter + stream scan.
2. **Promote manifest to a first-class path map**
- Build and cache an in-memory map on service startup or first access.
- Refresh map when manifest `updated_at` changes.
3. **Add explicit write policy boundaries**
- Agent-facing interface: hard read-only (`EROFS`).
- Internal system service interface: allow writes for onboarding synchronization.
4. **Metadata strategy for grep ranking**
- Prioritize in order:
1) `agent_summary.quick_facts`
2) `agent_summary.retrieval_hints.high_signal_terms`
3) `document_context.context_type` and `journey.stage`
4) full `data` body
---
## 4) Response to the Metadata Header Question
> "Does your current `.txt` optimization include specific metadata headers (like YAML frontmatter) that the grep tool should prioritize?"
For this implementation, context is currently persisted as structured JSON (not `.txt` with YAML frontmatter). Equivalent high-value metadata already exists and should be prioritized for search/ranking:
- `context_type`
- `updated_at`
- `agent_summary.quick_facts`
- `agent_summary.retrieval_hints.high_signal_terms`
- `document_context.journey.stage`
- `document_context.related_documents`
If you later move to `.txt` transport files, mirror these as frontmatter fields to preserve retrieval quality.
---
## 5) Bottom line
Your current onboarding flat-file context implementation is already a strong "shim" architecture and close to the proposed model. The biggest missing piece is a dedicated virtual-shell read interface (`ls/cat/grep/find`) backed by a persistent path-map cache and a clear read-only contract for agent execution contexts.
---
## 6) Implemented Follow-up (VFS Adapter + Workspace Guide)
The following enhancements are now implemented:
1. **Auto-generated workspace map**
- The system now generates `workspace_<user>/README.md` whenever `context_manifest.json` is updated.
- The README includes:
- available context files,
- key signal hints from `agent_summary.retrieval_hints.high_signal_terms`,
- journey-stage hints,
- virtual path mappings and retrieval strategy guidance.
2. **Read-only VFS facade**
- Added `AgentContextVFS` with:
- `list_context()` (`ls` equivalent),
- `search_context()` (`grep` equivalent; prioritizes `high_signal_terms` and `quick_facts`),
- `read_context_file()` (`cat` equivalent; large-file summary mode + subkey drilldown),
- explicit write rejection (`EROFS`).
3. **Virtual path support**
- `/env/summary` maps to `AgentFlatContextStore.generate_total_summary()`.
- `/steps/website`, `/steps/research`, `/steps/persona`, `/steps/integrations` map to step documents.
4. **System-prompt helper**
- Added `build_filesystem_header(user_id)` to inject a compact file availability + priority hint block into agent startup prompts.
5. **Merged context helper in SIF integration**
- `SIFIntegrationService.get_merged_flat_context()` now provides a unified view across all available flat files while preserving existing per-step retrieval methods.
6. **Basic file-level security hardening**
- Workspace and context directories are now explicitly forced to `0700`.
- Context and workspace files are written with strict `0600`.
- Added path sandboxing to ensure requested paths cannot escape user workspace roots.
- Restricted context-file loading to an allowlist of known onboarding context documents.
- Added deterministic per-user secret derivation from `.env` (`FILE_ENCRYPTION_SALT` + `safe_user_id`) with non-sensitive fingerprints for audit/debug and future encryption-at-rest rollout.
7. **Tool-logic enhancement (coarse-to-fine search)**
- `search_context` now performs a two-pass retrieval:
1) high-relevance summary match pass (`high_signal_terms`, `quick_facts`),
2) parallelized stream scan pass over sandboxed allowlisted files for supporting details.
- Results include relevance labels, snippets, and line numbers for body matches.
- Large-result behavior now reports truncation guidance (show top 10 and suggest narrower keywords).
- `inspect_file` now provides token-saving behavior: full return for small files, or `agent_summary` + top-level keys for larger files, with key-level zoom-in support.
8. **Retrieval robustness roadmap (next hardening phase)**
- **Query normalization:** Add synonym expansion and typo-tolerant matching (e.g., `tone``brand voice`) before coarse/fine passes.
- **Confidence scoring:** Return confidence tiers that blend source freshness (`updated_at`), summary-match strength, and match density.
- **Field-aware boosting:** Weight matches by field priority (`high_signal_terms` > `quick_facts` > `data`) and document recency.
- **Deduplicated evidence:** Collapse repeated hits from the same file/key into one clustered result with a single best snippet and hit count.
- **Fallback query reformulation:** If zero hits, automatically retry with narrow/expanded variants and return attempted queries.
- **Answerability contract:** Add a lightweight `can_answer` signal in search responses so orchestrators can decide whether to ask follow-up questions or fetch more context.
- **Evaluation harness:** Track retrieval metrics over golden queries (`precision@k`, `MRR`, zero-hit rate, stale-hit rate) in CI to prevent relevance regressions.
9. **Collaborative VFS namespace (shared memory mode)**
- Added optional `project_id` support to `AgentContextVFS` with isolated root: `workspace/project_<project_id>/`.
- Introduced `scratchpad/` for collaborative writes while keeping onboarding `agent_context` read-first.
- Added `write_shared_note(...)` with advisory locking (`flock`) and strict filename/path validation.
- Added append-only `activity_log.jsonl` via `append_activity_log(...)` for watchdog/event-driven coordination.
- Maintains owner-only permissions (`0700` scratchpad dir, `0600` files) and audit trails for shared writes.
10. **Testing readiness upgrades**
- Added automated tests for:
- query reformulation + `can_answer` behavior in `search_context`,
- large-file progressive disclosure behavior in `inspect_file`,
- collaborative write path (`write_shared_note`) and append-only activity logging.
- Test module: `backend/tests/test_agent_context_vfs.py`.
- These tests provide a baseline regression harness for VFS retrieval quality and shared-memory safety.
11. **Static + Structural retrieval hardening**
- Added a **static triage layer** in `search_context`:
- keyword-density scoring,
- `low_probability` flags for likely-noisy hits,
- `triage_top5` shortlist for router-style pre-filtering.
- Added `read_struct(filename, path_query)`:
- resolves dot/bracket JSON paths to return node-level data only,
- includes lightweight dependency injection (e.g., Step 4 persona reads include Step 2 brand voice context when available),
- keeps output token-efficient for downstream agents.