Add Step 5 flat context and txtai file tools for agents

2026-03-11 10:42:05 +05:30
parent b410ece4ca
commit cbe41ef8c7
13 changed files with 1480 additions and 7 deletions
--- a/docs/SIF/SIF_AGENTS_TEAM_ARCHITECTURE.md
+++ b/docs/SIF/SIF_AGENTS_TEAM_ARCHITECTURE.md
@@ -189,3 +189,20 @@ All orchestration updates are emitted as typed records under a shared schema:
 *   **Inter-Agent Chat**: Allow agents to debate strategy (e.g., SEO Agent vs. Creative Agent).
 *   **Auto-Execution**: Allow agents to *perform* tasks (e.g., fix a broken link) with user approval.
 *   **Voice Interface**: Daily standup meeting via voice.
+
+
+## ⚡ Agent Fast-Context Layer (Onboarding Step 2)
+
+To reduce latency for repetitive agent reads, Step 2 website analysis is now persisted to a per-user flat file in workspace:
+
+- `workspace/workspace_<safe_user_id>/agent_context/step2_website_analysis.json`
+
+**Read order for agents:**
+1. Flat-file context (agent-only, fastest)
+2. Relational database (`website_analyses`)
+3. SIF semantic index retrieval
+
+This preserves SIF intelligence workflows while giving agents deterministic, low-latency access to core onboarding context.
+It also stores agent-optimized `quick_facts`, `retrieval_hints`, and full-fidelity raw payload blocks so both fast inference and deep-dive reasoning are supported.
+
+Reference design docs: `docs/flat_file_context/STEP2_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP3_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP4_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP5_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/FLAT_FILE_CONTEXT_FRAMEWORK_DESIGN.md`, `docs/flat_file_context/FLAT_FILE_CONTEXT_SECURITY_AND_ISOLATION.md`, and `docs/flat_file_context/FLAT_FILE_CONTEXT_PROGRESS_AND_QUICK_WINS.md`.
--- a/docs/flat_file_context/FLAT_FILE_CONTEXT_ENHANCEMENTS_BACKLOG.md
+++ b/docs/flat_file_context/FLAT_FILE_CONTEXT_ENHANCEMENTS_BACKLOG.md
@@ -0,0 +1,69 @@
+# Flat File Context Enhancements Backlog
+
+This document tracks next-phase implementation items for the flat-file context framework.
+
+## 1) TTL/Refresh Hints + Freshness Policy
+### Objective
+Prevent stale agent decisions by adding explicit freshness semantics.
+
+### Proposed additions
+- Add `m.ttl_s` (seconds) and `m.stale_after` (timestamp) to context envelope.
+- Add `m.refresh_recommended` boolean.
+- Define per-context defaults (Step 2 likely long TTL, but still bounded).
+
+### Acceptance criteria
+- Reader utility can classify context as `fresh|stale|expired`.
+- Fallback to DB/SIF triggered automatically when stale policy requires.
+
+---
+
+## 2) Optional `.json.gz` Companion for Large Payloads
+### Objective
+Reduce disk footprint and IO for large context payloads.
+
+### Proposed additions
+- Write primary `.json` always.
+- If payload exceeds threshold (e.g., >256 KB), write `.json.gz` companion.
+- Add pointer metadata (`m.gz=true`, `m.gz_path`).
+
+### Acceptance criteria
+- Reader transparently supports JSON + GZIP variants.
+- No regression for small payloads.
+
+---
+
+## 3) Section Checksums for Drift Detection
+### Objective
+Detect inconsistencies between flat-file context and database state.
+
+### Proposed additions
+- Add checksums per section (`d.brand`, `d.seo`, `d.audience`, etc.) under `m.chk`.
+- Persist DB-row reference (`m.db_ref`) with latest row id/timestamp.
+- Add `verify_drift()` utility.
+
+### Acceptance criteria
+- Drift check can flag `in_sync|partial_drift|out_of_sync`.
+- On drift, reader suggests refresh + fallback path.
+
+---
+
+## 4) Extend Pattern to Step 3 and Step 4
+### Objective
+Standardize agent context retrieval across onboarding steps.
+
+### Proposed additions
+- `step3_research_context.json`
+- `step4_persona_context.json`
+- Shared envelope with step-specific `d/s` contracts.
+
+### Acceptance criteria
+- Same fallback chain works for step-specific readers.
+- SIF agents can consume common interface across Step 2/3/4.
+
+---
+
+## Suggested implementation order
+1. TTL/freshness
+2. Checksums/drift detection
+3. Step 3/4 expansion
+4. Optional gzip optimization
--- a/docs/flat_file_context/FLAT_FILE_CONTEXT_FRAMEWORK_DESIGN.md
+++ b/docs/flat_file_context/FLAT_FILE_CONTEXT_FRAMEWORK_DESIGN.md
@@ -0,0 +1,140 @@
+# Flat File Context Framework Design (Agent-Optimized)
+
+## Purpose
+Design a **compact, machine-first flat-file framework** for ALwrity AI agents.
+
+This framework is optimized for:
+- deterministic structure,
+- minimal token footprint,
+- fast parsing,
+- high-signal retrieval,
+- robust fallback behavior.
+
+## Core Principles
+1. **Agent-first, not human-first**
+   - Keys are short and stable.
+   - Avoid verbose prose in payloads.
+   - Include only fields needed for reasoning and tool actions.
+
+2. **Compact + predictable schema**
+   - Fixed top-level keys in strict order.
+   - Canonical value types (no shape drift).
+   - Avoid polymorphic fields when possible.
+
+3. **Dual-layer context**
+   - `d` (full normalized data for deep reasoning).
+   - `s` (summary/high-signal fast path for most agent reads).
+
+4. **Fallback-safe design**
+   - Every context doc includes source + freshness metadata.
+   - If missing/stale, consumers fall back to DB then SIF semantic.
+
+5. **Multi-tenant isolation**
+   - Per-user file under `workspace/workspace_<safe_user_id>/agent_context/`.
+
+---
+
+## Canonical Context Envelope (compact)
+```json
+{
+  "v": "1.0",
+  "t": "onboarding.step2.website_analysis",
+  "u": "<user_id>",
+  "ts": "<iso8601>",
+  "src": "onboarding_step2",
+  "d": {},
+  "s": {},
+  "m": {
+    "db": 0,
+    "sb": 0,
+    "q": []
+  }
+}
+```
+
+### Field map
+- `v`: schema version
+- `t`: context type
+- `u`: user id
+- `ts`: updated timestamp
+- `src`: source writer
+- `d`: canonical normalized data
+- `s`: high-signal summary for quick agent use
+- `m`: meta (`db`=data bytes, `sb`=summary bytes, `q`=query hints)
+
+---
+
+## Agent Readability Best Practices
+- Prefer enums/controlled vocab over free text.
+- Use compact keys and arrays for repetitive entities.
+- Truncate long textual blobs unless explicitly required.
+- Keep “quick facts” flattened.
+- Separate operational metadata from semantic content.
+- Include retrieval hints (`q`) for consistent query drafting.
+
+---
+
+## Write Pipeline Pattern
+1. Normalize incoming source payload.
+2. Derive compact summary (`s`) from normalized data.
+3. Compute lightweight metadata (`m`).
+4. Atomic write JSON file.
+5. Emit writer version + timestamp.
+
+## Read Pipeline Pattern
+1. Attempt flat-file load.
+2. Validate minimum envelope fields (`v,t,u,ts,d`).
+3. Prefer `s` for quick tasks; use `d` for deeper reasoning.
+4. If invalid/missing/stale: fallback DB -> SIF semantic.
+
+---
+
+## Scope Expansion Pattern
+Apply same envelope for:
+- Step 2: website analysis
+- Step 3: research preferences + competitor snapshots
+- Step 4: persona profile + platform personas
+
+Only `t`, `d`, and `s` payload contracts should vary.
+
+---
+
+## Governance
+- Schema changes require version bump (`v`).
+- Backward compatibility policy: readers support N and N-1.
+- Drift checks should compare canonical hash/checksum vs DB latest row.
+
+
+## Document Context + End-User Journey Metadata
+Each context file should carry explicit machine-oriented document metadata so agents understand *what this file is* before reading full payloads.
+
+Suggested `document_context` fields:
+- `audience`: `ai_agents`
+- `purpose`: `fast_context_retrieval`
+- `context_type`: step-scoped type identifier
+- `journey`: stage/action/agent expectation
+- `retrieval_contract`: preferred source + fallback order
+- `context_window_guidance`: byte budget and summary-first policy
+
+This block is intentionally compact and deterministic to reduce wasted token usage for agent planning.
+
+## Context Window and Length Policy
+- Keep combined `data + summary` under a defined byte budget where practical.
+- Enforce summary-first reads in agent consumers.
+- Truncate long textual fields in summaries; keep full text only in `data` when needed.
+- Flag oversize docs in metadata so readers can skip low-priority sections.
+- Prefer short, stable keys in machine envelopes and avoid natural-language verbosity.
+
+
+## Implemented baseline controls
+- Atomic file writes to avoid partial documents.
+- Best-effort restricted file permissions (`0600`).
+- Recursive sensitive-key redaction for payload snapshots.
+- Payload size budget enforcement with deterministic trimming metadata.
+- Internal document linking via `related_documents` and manifest index.
+
+
+Security and isolation details: `docs/flat_file_context/FLAT_FILE_CONTEXT_SECURITY_AND_ISOLATION.md`
+
+
+Step docs: `docs/flat_file_context/STEP2_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP3_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP4_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP5_FLAT_FILE_CONTEXT_DESIGN.md`
--- a/docs/flat_file_context/FLAT_FILE_CONTEXT_PROGRESS_AND_QUICK_WINS.md
+++ b/docs/flat_file_context/FLAT_FILE_CONTEXT_PROGRESS_AND_QUICK_WINS.md
@@ -0,0 +1,26 @@
+# Flat File Context Progress Review and Quick Wins
+
+## Progress so far
+- Step 2 context: implemented (website analysis fast path + fallback).
+- Step 3 context: implemented (research preferences + competitors fast path + fallback).
+- Step 4 context: implemented (persona data fast path + fallback).
+- Step 5 context: implemented (integrations fast path + fallback).
+- Security baseline: user isolation checks, redaction, atomic writes, file-permission hardening.
+- Size governance: payload budget + deterministic trimming + trim metadata.
+- Internal linking: related-document links + manifest index.
+
+## Quick-win improvements (next 1-2 sprints)
+1. Add explicit TTL/staleness fields and auto-refresh hints per step.
+2. Add lightweight checksums per section to detect DB drift quickly.
+3. Add optional `.json.gz` companion for oversized archives.
+4. Add shared reader utility for summary-first + selective field loading.
+5. Add minimal unit tests for:
+   - redaction
+   - trimming behavior
+   - manifest linking
+   - cross-user load rejection
+6. Add agent telemetry: record which sections are actually read to optimize summaries.
+
+
+## Newly added agent tooling
+- txtai agent tools for flat-file context manifest/read/write-note operations were added to SIF base agent to support file operations in agent workflows.
--- a/docs/flat_file_context/FLAT_FILE_CONTEXT_SECURITY_AND_ISOLATION.md
+++ b/docs/flat_file_context/FLAT_FILE_CONTEXT_SECURITY_AND_ISOLATION.md
@@ -0,0 +1,39 @@
+# Flat File Context Security, Isolation, and Size Controls
+
+## Objective
+Provide minimal but practical security for agent flat-file context with strong end-user isolation and bounded document growth.
+
+## Isolation model
+- Per-user namespace: `workspace/workspace_<safe_user_id>/agent_context/`
+- Sanitized user IDs only (`[a-zA-Z0-9_-]`) to prevent path traversal.
+- Reader-side user check: loaded document `user_id` must match requesting user context.
+
+## Minimal security controls implemented
+1. **Atomic writes**
+   - Context files are written via temporary file + `os.replace`.
+   - Prevents partial/corrupt files under concurrent writes.
+2. **File permissions**
+   - Context files are best-effort set to `0600`.
+3. **Sensitive key redaction**
+   - Recursive redaction for key patterns like `api_key`, `token`, `secret`, `password`, `authorization`, `cookie`.
+4. **Manifest index**
+   - `context_manifest.json` gives agents a controlled map of available docs and relationships.
+
+## Size and context-window controls
+- Byte budget for raw document payloads (`DEFAULT_MAX_BYTES`).
+- If oversize, low-priority/heavy sections are trimmed first (`raw_*`, large snapshots, heavy arrays).
+- Trim metadata is preserved under `meta.trim` for traceability.
+- Agent policy remains summary-first (`agent_summary` before `data`).
+
+## Internal document linking
+- Each context file includes `document_context.related_documents`.
+- Manifest includes per-document `related_documents` links.
+- This enables agents to:
+  1. read one document,
+  2. discover related context files,
+  3. fetch only relevant next documents.
+
+## Recommended next steps
+- Add optional file-level signatures/HMAC for tamper evidence.
+- Add checksum per section to detect DB drift.
+- Add staleness policy (`ttl_s`, `stale_after`) and auto-refresh triggers.
--- a/docs/flat_file_context/STEP2_FLAT_FILE_CONTEXT_DESIGN.md
+++ b/docs/flat_file_context/STEP2_FLAT_FILE_CONTEXT_DESIGN.md
@@ -0,0 +1,54 @@
+# Step 2 Flat File Context Design (Website Analysis)
+
+## Intent
+Step 2 context must be optimized for **AI-agent retrieval speed and token efficiency**, not human readability.
+
+## Current storage location
+- `workspace/workspace_<safe_user_id>/agent_context/step2_website_analysis.json`
+
+## Current retrieval chain
+1. Flat file (fastest)
+2. DB (`website_analyses`)
+3. SIF semantic fallback
+
+## Compactness strategy
+For implementation, keep two logical layers:
+- **`d` equivalent (full canonical data)** for deep reasoning.
+- **`s` equivalent (high-signal summary)** for fast agent prompts and most decisions.
+- **`document_context`** for machine-readable orientation (purpose, journey stage, fallback contract, context-window guidance).
+
+Agents should default to summary-first reads and only open full data when needed.
+
+## Step 2 coverage requirements
+The Step 2 context should preserve these semantic groups:
+- identity/state: website url, timestamps, status/error/warning
+- brand/style: writing style, style patterns/guidelines, brand analysis
+- audience/content: target audience, content type, recommended settings, characteristics
+- strategy/seo: strategy insights, SEO audit, strategic history
+- crawl/discovery: crawl output, meta info, sitemap analysis
+- traceability: raw inbound payload snapshots
+
+## Agent-readability best practices
+- Keep keys stable and deterministic.
+- Prefer arrays/enums over long free text.
+- Keep summary fields flattened and high signal.
+- Avoid duplicate verbose nested structures unless required for correctness.
+- Include retrieval hints for consistent downstream querying.
+
+## Practical guidance for consumers
+- Use summary/high-signal fields first for routing and lightweight reasoning.
+- Pull deep fields only for specialist tasks (SEO, persona fidelity, editorial style checks).
+- If flat-file missing/stale: auto-fallback to DB then SIF.
+
+## Note
+A generalized compact framework is documented in:
+- `docs/flat_file_context/FLAT_FILE_CONTEXT_FRAMEWORK_DESIGN.md`
+
+Future enhancements are tracked in:
+- `docs/flat_file_context/FLAT_FILE_CONTEXT_ENHANCEMENTS_BACKLOG.md`
+
+
+## Context window guidance
+- Keep summary compact and deterministic.
+- Add byte-size metadata to help agents decide whether to expand into full data.
+- Prefer short keys and avoid verbose natural language in machine envelopes.
--- a/docs/flat_file_context/STEP3_FLAT_FILE_CONTEXT_DESIGN.md
+++ b/docs/flat_file_context/STEP3_FLAT_FILE_CONTEXT_DESIGN.md
@@ -0,0 +1,39 @@
+# Step 3 Flat File Context Design (Research Preferences + Competitors)
+
+## Intent
+Provide agent-ready Step 3 context with compact summaries for routing plus full payload for deep analysis.
+
+## Storage location
+- `workspace/workspace_<safe_user_id>/agent_context/step3_research_preferences.json`
+
+## Why this matters for agents
+Step 3 is the bridge from website understanding (Step 2) to competitive strategy and research execution. Agents need this file to understand:
+- depth and quality preference constraints,
+- factuality constraints,
+- content-type priorities,
+- competitor landscape and industry context.
+
+## Document-context block
+Every context file should include machine-readable document metadata to orient agents quickly:
+- audience (`ai_agents`)
+- purpose (`fast_context_retrieval`)
+- journey stage (`onboarding_step_3`)
+- retrieval contract and fallback order
+- context-window guidance (size budget + summary-first policy)
+
+## Minimal Step 3 data groups
+- research config: depth/content types/auto/factual
+- inherited style profile (if present): writing style, target audience, recommended settings
+- competitors: domain/url/title/relevance highlights
+- industry context: compact market framing text
+- traceability: source payload and timestamps
+
+## Agent usage policy
+1. Start with `agent_summary.quick_facts` and `retrieval_hints`.
+2. Use competitor summary before opening full competitor objects.
+3. Read full `data` only for tasks requiring strict evidence/fields.
+4. Fall back to DB, then SIF semantic if missing or stale.
+
+
+## Related-document navigation
+Agents can consult `context_manifest.json` to discover linked context files and traverse only the required documents for the task.
--- a/docs/flat_file_context/STEP4_FLAT_FILE_CONTEXT_DESIGN.md
+++ b/docs/flat_file_context/STEP4_FLAT_FILE_CONTEXT_DESIGN.md
@@ -0,0 +1,25 @@
+# Step 4 Flat File Context Design (Persona Data)
+
+## Intent
+Capture onboarding Step 4 persona outputs in an agent-first flat file so agents can quickly personalize strategy, content, and platform execution.
+
+## Storage location
+- `workspace/workspace_<safe_user_id>/agent_context/step4_persona_data.json`
+
+## Required Step 4 coverage
+- core persona profile (`core_persona`)
+- platform personas (`platform_personas`)
+- quality metrics (`quality_metrics`)
+- selected platforms (`selected_platforms`)
+- research persona/notes when available
+- source payload + timestamps for traceability
+
+## Agent summary expectations
+- quick facts: selected platform count, persona availability flags
+- retrieval hints: persona/profile adaptation queries
+- persona focus: compact actionable slice of core persona + quality constraints
+
+## Usage policy
+1. Start with `agent_summary`.
+2. Expand into `data` only when a task needs full fidelity.
+3. Use `document_context.related_documents` to fetch upstream Step 2/Step 3 context as needed.
--- a/docs/flat_file_context/STEP5_FLAT_FILE_CONTEXT_DESIGN.md
+++ b/docs/flat_file_context/STEP5_FLAT_FILE_CONTEXT_DESIGN.md
@@ -0,0 +1,22 @@
+# Step 5 Flat File Context Design (Integrations)
+
+## Intent
+Capture onboarding Step 5 integration configuration in a compact agent-readable context so agents can reason about connected services and execution constraints.
+
+## Storage location
+- `workspace/workspace_<safe_user_id>/agent_context/step5_integrations.json`
+
+## Required Step 5 coverage
+- integration map (`integrations`)
+- provider list (`providers`)
+- connected account references (`connected_accounts`)
+- integration status and notes
+- source payload and timestamps
+
+## Agent summary expectations
+- connected integration count/list
+- provider count
+- retrieval hints for integration readiness checks
+
+## Linked traversal
+Use `document_context.related_documents` and `context_manifest.json` to navigate Step 2/3/4 upstream dependencies when deciding tool execution paths.