Add Step 5 flat context and txtai file tools for agents

This commit is contained in:
ي
2026-03-11 10:42:05 +05:30
parent b410ece4ca
commit cbe41ef8c7
13 changed files with 1480 additions and 7 deletions

View File

@@ -189,3 +189,20 @@ All orchestration updates are emitted as typed records under a shared schema:
* **Inter-Agent Chat**: Allow agents to debate strategy (e.g., SEO Agent vs. Creative Agent).
* **Auto-Execution**: Allow agents to *perform* tasks (e.g., fix a broken link) with user approval.
* **Voice Interface**: Daily standup meeting via voice.
## ⚡ Agent Fast-Context Layer (Onboarding Step 2)
To reduce latency for repetitive agent reads, Step 2 website analysis is now persisted to a per-user flat file in workspace:
- `workspace/workspace_<safe_user_id>/agent_context/step2_website_analysis.json`
**Read order for agents:**
1. Flat-file context (agent-only, fastest)
2. Relational database (`website_analyses`)
3. SIF semantic index retrieval
This preserves SIF intelligence workflows while giving agents deterministic, low-latency access to core onboarding context.
It also stores agent-optimized `quick_facts`, `retrieval_hints`, and full-fidelity raw payload blocks so both fast inference and deep-dive reasoning are supported.
Reference design docs: `docs/flat_file_context/STEP2_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP3_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP4_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP5_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/FLAT_FILE_CONTEXT_FRAMEWORK_DESIGN.md`, `docs/flat_file_context/FLAT_FILE_CONTEXT_SECURITY_AND_ISOLATION.md`, and `docs/flat_file_context/FLAT_FILE_CONTEXT_PROGRESS_AND_QUICK_WINS.md`.

View File

@@ -0,0 +1,69 @@
# Flat File Context Enhancements Backlog
This document tracks next-phase implementation items for the flat-file context framework.
## 1) TTL/Refresh Hints + Freshness Policy
### Objective
Prevent stale agent decisions by adding explicit freshness semantics.
### Proposed additions
- Add `m.ttl_s` (seconds) and `m.stale_after` (timestamp) to context envelope.
- Add `m.refresh_recommended` boolean.
- Define per-context defaults (Step 2 likely long TTL, but still bounded).
### Acceptance criteria
- Reader utility can classify context as `fresh|stale|expired`.
- Fallback to DB/SIF triggered automatically when stale policy requires.
---
## 2) Optional `.json.gz` Companion for Large Payloads
### Objective
Reduce disk footprint and IO for large context payloads.
### Proposed additions
- Write primary `.json` always.
- If payload exceeds threshold (e.g., >256 KB), write `.json.gz` companion.
- Add pointer metadata (`m.gz=true`, `m.gz_path`).
### Acceptance criteria
- Reader transparently supports JSON + GZIP variants.
- No regression for small payloads.
---
## 3) Section Checksums for Drift Detection
### Objective
Detect inconsistencies between flat-file context and database state.
### Proposed additions
- Add checksums per section (`d.brand`, `d.seo`, `d.audience`, etc.) under `m.chk`.
- Persist DB-row reference (`m.db_ref`) with latest row id/timestamp.
- Add `verify_drift()` utility.
### Acceptance criteria
- Drift check can flag `in_sync|partial_drift|out_of_sync`.
- On drift, reader suggests refresh + fallback path.
---
## 4) Extend Pattern to Step 3 and Step 4
### Objective
Standardize agent context retrieval across onboarding steps.
### Proposed additions
- `step3_research_context.json`
- `step4_persona_context.json`
- Shared envelope with step-specific `d/s` contracts.
### Acceptance criteria
- Same fallback chain works for step-specific readers.
- SIF agents can consume common interface across Step 2/3/4.
---
## Suggested implementation order
1. TTL/freshness
2. Checksums/drift detection
3. Step 3/4 expansion
4. Optional gzip optimization

View File

@@ -0,0 +1,140 @@
# Flat File Context Framework Design (Agent-Optimized)
## Purpose
Design a **compact, machine-first flat-file framework** for ALwrity AI agents.
This framework is optimized for:
- deterministic structure,
- minimal token footprint,
- fast parsing,
- high-signal retrieval,
- robust fallback behavior.
## Core Principles
1. **Agent-first, not human-first**
- Keys are short and stable.
- Avoid verbose prose in payloads.
- Include only fields needed for reasoning and tool actions.
2. **Compact + predictable schema**
- Fixed top-level keys in strict order.
- Canonical value types (no shape drift).
- Avoid polymorphic fields when possible.
3. **Dual-layer context**
- `d` (full normalized data for deep reasoning).
- `s` (summary/high-signal fast path for most agent reads).
4. **Fallback-safe design**
- Every context doc includes source + freshness metadata.
- If missing/stale, consumers fall back to DB then SIF semantic.
5. **Multi-tenant isolation**
- Per-user file under `workspace/workspace_<safe_user_id>/agent_context/`.
---
## Canonical Context Envelope (compact)
```json
{
"v": "1.0",
"t": "onboarding.step2.website_analysis",
"u": "<user_id>",
"ts": "<iso8601>",
"src": "onboarding_step2",
"d": {},
"s": {},
"m": {
"db": 0,
"sb": 0,
"q": []
}
}
```
### Field map
- `v`: schema version
- `t`: context type
- `u`: user id
- `ts`: updated timestamp
- `src`: source writer
- `d`: canonical normalized data
- `s`: high-signal summary for quick agent use
- `m`: meta (`db`=data bytes, `sb`=summary bytes, `q`=query hints)
---
## Agent Readability Best Practices
- Prefer enums/controlled vocab over free text.
- Use compact keys and arrays for repetitive entities.
- Truncate long textual blobs unless explicitly required.
- Keep “quick facts” flattened.
- Separate operational metadata from semantic content.
- Include retrieval hints (`q`) for consistent query drafting.
---
## Write Pipeline Pattern
1. Normalize incoming source payload.
2. Derive compact summary (`s`) from normalized data.
3. Compute lightweight metadata (`m`).
4. Atomic write JSON file.
5. Emit writer version + timestamp.
## Read Pipeline Pattern
1. Attempt flat-file load.
2. Validate minimum envelope fields (`v,t,u,ts,d`).
3. Prefer `s` for quick tasks; use `d` for deeper reasoning.
4. If invalid/missing/stale: fallback DB -> SIF semantic.
---
## Scope Expansion Pattern
Apply same envelope for:
- Step 2: website analysis
- Step 3: research preferences + competitor snapshots
- Step 4: persona profile + platform personas
Only `t`, `d`, and `s` payload contracts should vary.
---
## Governance
- Schema changes require version bump (`v`).
- Backward compatibility policy: readers support N and N-1.
- Drift checks should compare canonical hash/checksum vs DB latest row.
## Document Context + End-User Journey Metadata
Each context file should carry explicit machine-oriented document metadata so agents understand *what this file is* before reading full payloads.
Suggested `document_context` fields:
- `audience`: `ai_agents`
- `purpose`: `fast_context_retrieval`
- `context_type`: step-scoped type identifier
- `journey`: stage/action/agent expectation
- `retrieval_contract`: preferred source + fallback order
- `context_window_guidance`: byte budget and summary-first policy
This block is intentionally compact and deterministic to reduce wasted token usage for agent planning.
## Context Window and Length Policy
- Keep combined `data + summary` under a defined byte budget where practical.
- Enforce summary-first reads in agent consumers.
- Truncate long textual fields in summaries; keep full text only in `data` when needed.
- Flag oversize docs in metadata so readers can skip low-priority sections.
- Prefer short, stable keys in machine envelopes and avoid natural-language verbosity.
## Implemented baseline controls
- Atomic file writes to avoid partial documents.
- Best-effort restricted file permissions (`0600`).
- Recursive sensitive-key redaction for payload snapshots.
- Payload size budget enforcement with deterministic trimming metadata.
- Internal document linking via `related_documents` and manifest index.
Security and isolation details: `docs/flat_file_context/FLAT_FILE_CONTEXT_SECURITY_AND_ISOLATION.md`
Step docs: `docs/flat_file_context/STEP2_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP3_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP4_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP5_FLAT_FILE_CONTEXT_DESIGN.md`

View File

@@ -0,0 +1,26 @@
# Flat File Context Progress Review and Quick Wins
## Progress so far
- Step 2 context: implemented (website analysis fast path + fallback).
- Step 3 context: implemented (research preferences + competitors fast path + fallback).
- Step 4 context: implemented (persona data fast path + fallback).
- Step 5 context: implemented (integrations fast path + fallback).
- Security baseline: user isolation checks, redaction, atomic writes, file-permission hardening.
- Size governance: payload budget + deterministic trimming + trim metadata.
- Internal linking: related-document links + manifest index.
## Quick-win improvements (next 1-2 sprints)
1. Add explicit TTL/staleness fields and auto-refresh hints per step.
2. Add lightweight checksums per section to detect DB drift quickly.
3. Add optional `.json.gz` companion for oversized archives.
4. Add shared reader utility for summary-first + selective field loading.
5. Add minimal unit tests for:
- redaction
- trimming behavior
- manifest linking
- cross-user load rejection
6. Add agent telemetry: record which sections are actually read to optimize summaries.
## Newly added agent tooling
- txtai agent tools for flat-file context manifest/read/write-note operations were added to SIF base agent to support file operations in agent workflows.

View File

@@ -0,0 +1,39 @@
# Flat File Context Security, Isolation, and Size Controls
## Objective
Provide minimal but practical security for agent flat-file context with strong end-user isolation and bounded document growth.
## Isolation model
- Per-user namespace: `workspace/workspace_<safe_user_id>/agent_context/`
- Sanitized user IDs only (`[a-zA-Z0-9_-]`) to prevent path traversal.
- Reader-side user check: loaded document `user_id` must match requesting user context.
## Minimal security controls implemented
1. **Atomic writes**
- Context files are written via temporary file + `os.replace`.
- Prevents partial/corrupt files under concurrent writes.
2. **File permissions**
- Context files are best-effort set to `0600`.
3. **Sensitive key redaction**
- Recursive redaction for key patterns like `api_key`, `token`, `secret`, `password`, `authorization`, `cookie`.
4. **Manifest index**
- `context_manifest.json` gives agents a controlled map of available docs and relationships.
## Size and context-window controls
- Byte budget for raw document payloads (`DEFAULT_MAX_BYTES`).
- If oversize, low-priority/heavy sections are trimmed first (`raw_*`, large snapshots, heavy arrays).
- Trim metadata is preserved under `meta.trim` for traceability.
- Agent policy remains summary-first (`agent_summary` before `data`).
## Internal document linking
- Each context file includes `document_context.related_documents`.
- Manifest includes per-document `related_documents` links.
- This enables agents to:
1. read one document,
2. discover related context files,
3. fetch only relevant next documents.
## Recommended next steps
- Add optional file-level signatures/HMAC for tamper evidence.
- Add checksum per section to detect DB drift.
- Add staleness policy (`ttl_s`, `stale_after`) and auto-refresh triggers.

View File

@@ -0,0 +1,54 @@
# Step 2 Flat File Context Design (Website Analysis)
## Intent
Step 2 context must be optimized for **AI-agent retrieval speed and token efficiency**, not human readability.
## Current storage location
- `workspace/workspace_<safe_user_id>/agent_context/step2_website_analysis.json`
## Current retrieval chain
1. Flat file (fastest)
2. DB (`website_analyses`)
3. SIF semantic fallback
## Compactness strategy
For implementation, keep two logical layers:
- **`d` equivalent (full canonical data)** for deep reasoning.
- **`s` equivalent (high-signal summary)** for fast agent prompts and most decisions.
- **`document_context`** for machine-readable orientation (purpose, journey stage, fallback contract, context-window guidance).
Agents should default to summary-first reads and only open full data when needed.
## Step 2 coverage requirements
The Step 2 context should preserve these semantic groups:
- identity/state: website url, timestamps, status/error/warning
- brand/style: writing style, style patterns/guidelines, brand analysis
- audience/content: target audience, content type, recommended settings, characteristics
- strategy/seo: strategy insights, SEO audit, strategic history
- crawl/discovery: crawl output, meta info, sitemap analysis
- traceability: raw inbound payload snapshots
## Agent-readability best practices
- Keep keys stable and deterministic.
- Prefer arrays/enums over long free text.
- Keep summary fields flattened and high signal.
- Avoid duplicate verbose nested structures unless required for correctness.
- Include retrieval hints for consistent downstream querying.
## Practical guidance for consumers
- Use summary/high-signal fields first for routing and lightweight reasoning.
- Pull deep fields only for specialist tasks (SEO, persona fidelity, editorial style checks).
- If flat-file missing/stale: auto-fallback to DB then SIF.
## Note
A generalized compact framework is documented in:
- `docs/flat_file_context/FLAT_FILE_CONTEXT_FRAMEWORK_DESIGN.md`
Future enhancements are tracked in:
- `docs/flat_file_context/FLAT_FILE_CONTEXT_ENHANCEMENTS_BACKLOG.md`
## Context window guidance
- Keep summary compact and deterministic.
- Add byte-size metadata to help agents decide whether to expand into full data.
- Prefer short keys and avoid verbose natural language in machine envelopes.

View File

@@ -0,0 +1,39 @@
# Step 3 Flat File Context Design (Research Preferences + Competitors)
## Intent
Provide agent-ready Step 3 context with compact summaries for routing plus full payload for deep analysis.
## Storage location
- `workspace/workspace_<safe_user_id>/agent_context/step3_research_preferences.json`
## Why this matters for agents
Step 3 is the bridge from website understanding (Step 2) to competitive strategy and research execution. Agents need this file to understand:
- depth and quality preference constraints,
- factuality constraints,
- content-type priorities,
- competitor landscape and industry context.
## Document-context block
Every context file should include machine-readable document metadata to orient agents quickly:
- audience (`ai_agents`)
- purpose (`fast_context_retrieval`)
- journey stage (`onboarding_step_3`)
- retrieval contract and fallback order
- context-window guidance (size budget + summary-first policy)
## Minimal Step 3 data groups
- research config: depth/content types/auto/factual
- inherited style profile (if present): writing style, target audience, recommended settings
- competitors: domain/url/title/relevance highlights
- industry context: compact market framing text
- traceability: source payload and timestamps
## Agent usage policy
1. Start with `agent_summary.quick_facts` and `retrieval_hints`.
2. Use competitor summary before opening full competitor objects.
3. Read full `data` only for tasks requiring strict evidence/fields.
4. Fall back to DB, then SIF semantic if missing or stale.
## Related-document navigation
Agents can consult `context_manifest.json` to discover linked context files and traverse only the required documents for the task.

View File

@@ -0,0 +1,25 @@
# Step 4 Flat File Context Design (Persona Data)
## Intent
Capture onboarding Step 4 persona outputs in an agent-first flat file so agents can quickly personalize strategy, content, and platform execution.
## Storage location
- `workspace/workspace_<safe_user_id>/agent_context/step4_persona_data.json`
## Required Step 4 coverage
- core persona profile (`core_persona`)
- platform personas (`platform_personas`)
- quality metrics (`quality_metrics`)
- selected platforms (`selected_platforms`)
- research persona/notes when available
- source payload + timestamps for traceability
## Agent summary expectations
- quick facts: selected platform count, persona availability flags
- retrieval hints: persona/profile adaptation queries
- persona focus: compact actionable slice of core persona + quality constraints
## Usage policy
1. Start with `agent_summary`.
2. Expand into `data` only when a task needs full fidelity.
3. Use `document_context.related_documents` to fetch upstream Step 2/Step 3 context as needed.

View File

@@ -0,0 +1,22 @@
# Step 5 Flat File Context Design (Integrations)
## Intent
Capture onboarding Step 5 integration configuration in a compact agent-readable context so agents can reason about connected services and execution constraints.
## Storage location
- `workspace/workspace_<safe_user_id>/agent_context/step5_integrations.json`
## Required Step 5 coverage
- integration map (`integrations`)
- provider list (`providers`)
- connected account references (`connected_accounts`)
- integration status and notes
- source payload and timestamps
## Agent summary expectations
- connected integration count/list
- provider count
- retrieval hints for integration readiness checks
## Linked traversal
Use `document_context.related_documents` and `context_manifest.json` to navigate Step 2/3/4 upstream dependencies when deciding tool execution paths.