# Agent Flat-File Context System Review ## Scope This review documents the **current implementation** of ALwrity's onboarding flat-file context system and compares it to the proposed **Direct-to-File Virtual Shell (VFS)** model. --- ## 1) Present Implementation (What Exists Today) ### 1.1 Storage model - Context is stored per user under: - `backend/workspace/workspace_/agent_context/` - Files are JSON documents, one per onboarding domain: - `step2_website_analysis.json` - `step3_research_preferences.json` - `step4_persona_data.json` - `step5_integrations.json` - `context_manifest.json` ### 1.2 Writer and reader - `AgentFlatContextStore` is the core component that: - sanitizes user IDs for path safety, - writes documents atomically (`tempfile` + `os.replace`), - sets restrictive file permissions (`0600` best effort), - generates structured `agent_summary` objects, - updates a manifest index of available documents. - Data is loaded by direct file reads from the same class (`load_stepX_context_document`). ### 1.3 Read-path fallback chain `SIFIntegrationService` uses a strict fallback sequence for onboarding context retrieval: 1. **flat file** (`AgentFlatContextStore`) 2. **database** (`WebsiteAnalysis`, `ResearchPreferences`, `PersonaData`, etc.) 3. **SIF semantic index** (`TxtaiIntelligenceService.search`) Step 5 uses `flat_file -> sif_semantic`. ### 1.4 Producer flow (onboarding persistence) `StepManagementService` persists canonical snapshots to flat context when onboarding steps are saved: - Step 2 website analysis - Step 3 research preferences (and later competitor-enriched refresh) - Step 4 persona data - Step 5 integrations ### 1.5 Context optimization currently implemented - Sensitive-key redaction in nested payloads (`api_key`, `token`, `secret`, etc.). - Size budgeting with trimming (`DEFAULT_MAX_BYTES = 300_000`) and trim metadata. - Generated summaries include: - quick facts, - retrieval hints (high-signal terms and suggested agent queries), - domain-specific focus blocks. - Document context includes audience, retrieval contract, journey stage, related documents, and context-window guidance. --- ## 2) Comparison vs Proposed Direct-to-File VFS ## Strong alignment The current system already matches the proposal in important ways: - **Direct-to-file persistence** instead of DB-backed retrieval for fast reads. - **Manifest/index concept** (`context_manifest.json`) that can act like a precomputed path map. - **Agent-first retrieval semantics** (summary-first contract and fallback policy). - **Operational safety controls** (atomic writes, redaction, path sanitization). ## Gaps vs full virtual shell abstraction The following pieces are not fully implemented as described in your proposed architecture: - No explicit **virtual shell provider** (`IFileSystem`) exposing `ls/cat/grep/find` commands. - No always-live, process-level **in-memory `Map`** for path lookups. - No native glob/query command layer for agent shell UX. - Not currently **read-only enforced at API surface** (writes are intentionally allowed by onboarding services to refresh context). --- ## 3) Practical Recommendation: Incremental VFS Evolution 1. **Introduce a read-only VFS facade for agents** - Keep `AgentFlatContextStore` as the write path for trusted onboarding services. - Add `AgentContextVFS` read adapter exposing: - `ls(path)` from manifest, - `cat(path)` mapped to underlying JSON, - `find(glob)` on virtual keys, - `grep(query)` with path prefilter + stream scan. 2. **Promote manifest to a first-class path map** - Build and cache an in-memory map on service startup or first access. - Refresh map when manifest `updated_at` changes. 3. **Add explicit write policy boundaries** - Agent-facing interface: hard read-only (`EROFS`). - Internal system service interface: allow writes for onboarding synchronization. 4. **Metadata strategy for grep ranking** - Prioritize in order: 1) `agent_summary.quick_facts` 2) `agent_summary.retrieval_hints.high_signal_terms` 3) `document_context.context_type` and `journey.stage` 4) full `data` body --- ## 4) Response to the Metadata Header Question > "Does your current `.txt` optimization include specific metadata headers (like YAML frontmatter) that the grep tool should prioritize?" For this implementation, context is currently persisted as structured JSON (not `.txt` with YAML frontmatter). Equivalent high-value metadata already exists and should be prioritized for search/ranking: - `context_type` - `updated_at` - `agent_summary.quick_facts` - `agent_summary.retrieval_hints.high_signal_terms` - `document_context.journey.stage` - `document_context.related_documents` If you later move to `.txt` transport files, mirror these as frontmatter fields to preserve retrieval quality. --- ## 5) Bottom line Your current onboarding flat-file context implementation is already a strong "shim" architecture and close to the proposed model. The biggest missing piece is a dedicated virtual-shell read interface (`ls/cat/grep/find`) backed by a persistent path-map cache and a clear read-only contract for agent execution contexts. --- ## 6) Implemented Follow-up (VFS Adapter + Workspace Guide) The following enhancements are now implemented: 1. **Auto-generated workspace map** - The system now generates `workspace_/README.md` whenever `context_manifest.json` is updated. - The README includes: - available context files, - key signal hints from `agent_summary.retrieval_hints.high_signal_terms`, - journey-stage hints, - virtual path mappings and retrieval strategy guidance. 2. **Read-only VFS facade** - Added `AgentContextVFS` with: - `list_context()` (`ls` equivalent), - `search_context()` (`grep` equivalent; prioritizes `high_signal_terms` and `quick_facts`), - `read_context_file()` (`cat` equivalent; large-file summary mode + subkey drilldown), - explicit write rejection (`EROFS`). 3. **Virtual path support** - `/env/summary` maps to `AgentFlatContextStore.generate_total_summary()`. - `/steps/website`, `/steps/research`, `/steps/persona`, `/steps/integrations` map to step documents. 4. **System-prompt helper** - Added `build_filesystem_header(user_id)` to inject a compact file availability + priority hint block into agent startup prompts. 5. **Merged context helper in SIF integration** - `SIFIntegrationService.get_merged_flat_context()` now provides a unified view across all available flat files while preserving existing per-step retrieval methods. 6. **Basic file-level security hardening** - Workspace and context directories are now explicitly forced to `0700`. - Context and workspace files are written with strict `0600`. - Added path sandboxing to ensure requested paths cannot escape user workspace roots. - Restricted context-file loading to an allowlist of known onboarding context documents. - Added deterministic per-user secret derivation from `.env` (`FILE_ENCRYPTION_SALT` + `safe_user_id`) with non-sensitive fingerprints for audit/debug and future encryption-at-rest rollout. 7. **Tool-logic enhancement (coarse-to-fine search)** - `search_context` now performs a two-pass retrieval: 1) high-relevance summary match pass (`high_signal_terms`, `quick_facts`), 2) parallelized stream scan pass over sandboxed allowlisted files for supporting details. - Results include relevance labels, snippets, and line numbers for body matches. - Large-result behavior now reports truncation guidance (show top 10 and suggest narrower keywords). - `inspect_file` now provides token-saving behavior: full return for small files, or `agent_summary` + top-level keys for larger files, with key-level zoom-in support. 8. **Retrieval robustness roadmap (next hardening phase)** - **Query normalization:** Add synonym expansion and typo-tolerant matching (e.g., `tone` ≈ `brand voice`) before coarse/fine passes. - **Confidence scoring:** Return confidence tiers that blend source freshness (`updated_at`), summary-match strength, and match density. - **Field-aware boosting:** Weight matches by field priority (`high_signal_terms` > `quick_facts` > `data`) and document recency. - **Deduplicated evidence:** Collapse repeated hits from the same file/key into one clustered result with a single best snippet and hit count. - **Fallback query reformulation:** If zero hits, automatically retry with narrow/expanded variants and return attempted queries. - **Answerability contract:** Add a lightweight `can_answer` signal in search responses so orchestrators can decide whether to ask follow-up questions or fetch more context. - **Evaluation harness:** Track retrieval metrics over golden queries (`precision@k`, `MRR`, zero-hit rate, stale-hit rate) in CI to prevent relevance regressions. 9. **Collaborative VFS namespace (shared memory mode)** - Added optional `project_id` support to `AgentContextVFS` with isolated root: `workspace/project_/`. - Introduced `scratchpad/` for collaborative writes while keeping onboarding `agent_context` read-first. - Added `write_shared_note(...)` with advisory locking (`flock`) and strict filename/path validation. - Added append-only `activity_log.jsonl` via `append_activity_log(...)` for watchdog/event-driven coordination. - Maintains owner-only permissions (`0700` scratchpad dir, `0600` files) and audit trails for shared writes. 10. **Testing readiness upgrades** - Added automated tests for: - query reformulation + `can_answer` behavior in `search_context`, - large-file progressive disclosure behavior in `inspect_file`, - collaborative write path (`write_shared_note`) and append-only activity logging. - Test module: `backend/tests/test_agent_context_vfs.py`. - These tests provide a baseline regression harness for VFS retrieval quality and shared-memory safety. 11. **Static + Structural retrieval hardening** - Added a **static triage layer** in `search_context`: - keyword-density scoring, - `low_probability` flags for likely-noisy hits, - `triage_top5` shortlist for router-style pre-filtering. - Added `read_struct(filename, path_query)`: - resolves dot/bracket JSON paths to return node-level data only, - includes lightweight dependency injection (e.g., Step 4 persona reads include Step 2 brand voice context when available), - keeps output token-efficient for downstream agents.