Add Step 5 flat context and txtai file tools for agents
This commit is contained in:
@@ -189,3 +189,20 @@ All orchestration updates are emitted as typed records under a shared schema:
|
||||
* **Inter-Agent Chat**: Allow agents to debate strategy (e.g., SEO Agent vs. Creative Agent).
|
||||
* **Auto-Execution**: Allow agents to *perform* tasks (e.g., fix a broken link) with user approval.
|
||||
* **Voice Interface**: Daily standup meeting via voice.
|
||||
|
||||
|
||||
## ⚡ Agent Fast-Context Layer (Onboarding Step 2)
|
||||
|
||||
To reduce latency for repetitive agent reads, Step 2 website analysis is now persisted to a per-user flat file in workspace:
|
||||
|
||||
- `workspace/workspace_<safe_user_id>/agent_context/step2_website_analysis.json`
|
||||
|
||||
**Read order for agents:**
|
||||
1. Flat-file context (agent-only, fastest)
|
||||
2. Relational database (`website_analyses`)
|
||||
3. SIF semantic index retrieval
|
||||
|
||||
This preserves SIF intelligence workflows while giving agents deterministic, low-latency access to core onboarding context.
|
||||
It also stores agent-optimized `quick_facts`, `retrieval_hints`, and full-fidelity raw payload blocks so both fast inference and deep-dive reasoning are supported.
|
||||
|
||||
Reference design docs: `docs/flat_file_context/STEP2_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP3_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP4_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP5_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/FLAT_FILE_CONTEXT_FRAMEWORK_DESIGN.md`, `docs/flat_file_context/FLAT_FILE_CONTEXT_SECURITY_AND_ISOLATION.md`, and `docs/flat_file_context/FLAT_FILE_CONTEXT_PROGRESS_AND_QUICK_WINS.md`.
|
||||
|
||||
@@ -0,0 +1,69 @@
|
||||
# Flat File Context Enhancements Backlog
|
||||
|
||||
This document tracks next-phase implementation items for the flat-file context framework.
|
||||
|
||||
## 1) TTL/Refresh Hints + Freshness Policy
|
||||
### Objective
|
||||
Prevent stale agent decisions by adding explicit freshness semantics.
|
||||
|
||||
### Proposed additions
|
||||
- Add `m.ttl_s` (seconds) and `m.stale_after` (timestamp) to context envelope.
|
||||
- Add `m.refresh_recommended` boolean.
|
||||
- Define per-context defaults (Step 2 likely long TTL, but still bounded).
|
||||
|
||||
### Acceptance criteria
|
||||
- Reader utility can classify context as `fresh|stale|expired`.
|
||||
- Fallback to DB/SIF triggered automatically when stale policy requires.
|
||||
|
||||
---
|
||||
|
||||
## 2) Optional `.json.gz` Companion for Large Payloads
|
||||
### Objective
|
||||
Reduce disk footprint and IO for large context payloads.
|
||||
|
||||
### Proposed additions
|
||||
- Write primary `.json` always.
|
||||
- If payload exceeds threshold (e.g., >256 KB), write `.json.gz` companion.
|
||||
- Add pointer metadata (`m.gz=true`, `m.gz_path`).
|
||||
|
||||
### Acceptance criteria
|
||||
- Reader transparently supports JSON + GZIP variants.
|
||||
- No regression for small payloads.
|
||||
|
||||
---
|
||||
|
||||
## 3) Section Checksums for Drift Detection
|
||||
### Objective
|
||||
Detect inconsistencies between flat-file context and database state.
|
||||
|
||||
### Proposed additions
|
||||
- Add checksums per section (`d.brand`, `d.seo`, `d.audience`, etc.) under `m.chk`.
|
||||
- Persist DB-row reference (`m.db_ref`) with latest row id/timestamp.
|
||||
- Add `verify_drift()` utility.
|
||||
|
||||
### Acceptance criteria
|
||||
- Drift check can flag `in_sync|partial_drift|out_of_sync`.
|
||||
- On drift, reader suggests refresh + fallback path.
|
||||
|
||||
---
|
||||
|
||||
## 4) Extend Pattern to Step 3 and Step 4
|
||||
### Objective
|
||||
Standardize agent context retrieval across onboarding steps.
|
||||
|
||||
### Proposed additions
|
||||
- `step3_research_context.json`
|
||||
- `step4_persona_context.json`
|
||||
- Shared envelope with step-specific `d/s` contracts.
|
||||
|
||||
### Acceptance criteria
|
||||
- Same fallback chain works for step-specific readers.
|
||||
- SIF agents can consume common interface across Step 2/3/4.
|
||||
|
||||
---
|
||||
|
||||
## Suggested implementation order
|
||||
1. TTL/freshness
|
||||
2. Checksums/drift detection
|
||||
3. Step 3/4 expansion
|
||||
4. Optional gzip optimization
|
||||
140
docs/flat_file_context/FLAT_FILE_CONTEXT_FRAMEWORK_DESIGN.md
Normal file
140
docs/flat_file_context/FLAT_FILE_CONTEXT_FRAMEWORK_DESIGN.md
Normal file
@@ -0,0 +1,140 @@
|
||||
# Flat File Context Framework Design (Agent-Optimized)
|
||||
|
||||
## Purpose
|
||||
Design a **compact, machine-first flat-file framework** for ALwrity AI agents.
|
||||
|
||||
This framework is optimized for:
|
||||
- deterministic structure,
|
||||
- minimal token footprint,
|
||||
- fast parsing,
|
||||
- high-signal retrieval,
|
||||
- robust fallback behavior.
|
||||
|
||||
## Core Principles
|
||||
1. **Agent-first, not human-first**
|
||||
- Keys are short and stable.
|
||||
- Avoid verbose prose in payloads.
|
||||
- Include only fields needed for reasoning and tool actions.
|
||||
|
||||
2. **Compact + predictable schema**
|
||||
- Fixed top-level keys in strict order.
|
||||
- Canonical value types (no shape drift).
|
||||
- Avoid polymorphic fields when possible.
|
||||
|
||||
3. **Dual-layer context**
|
||||
- `d` (full normalized data for deep reasoning).
|
||||
- `s` (summary/high-signal fast path for most agent reads).
|
||||
|
||||
4. **Fallback-safe design**
|
||||
- Every context doc includes source + freshness metadata.
|
||||
- If missing/stale, consumers fall back to DB then SIF semantic.
|
||||
|
||||
5. **Multi-tenant isolation**
|
||||
- Per-user file under `workspace/workspace_<safe_user_id>/agent_context/`.
|
||||
|
||||
---
|
||||
|
||||
## Canonical Context Envelope (compact)
|
||||
```json
|
||||
{
|
||||
"v": "1.0",
|
||||
"t": "onboarding.step2.website_analysis",
|
||||
"u": "<user_id>",
|
||||
"ts": "<iso8601>",
|
||||
"src": "onboarding_step2",
|
||||
"d": {},
|
||||
"s": {},
|
||||
"m": {
|
||||
"db": 0,
|
||||
"sb": 0,
|
||||
"q": []
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Field map
|
||||
- `v`: schema version
|
||||
- `t`: context type
|
||||
- `u`: user id
|
||||
- `ts`: updated timestamp
|
||||
- `src`: source writer
|
||||
- `d`: canonical normalized data
|
||||
- `s`: high-signal summary for quick agent use
|
||||
- `m`: meta (`db`=data bytes, `sb`=summary bytes, `q`=query hints)
|
||||
|
||||
---
|
||||
|
||||
## Agent Readability Best Practices
|
||||
- Prefer enums/controlled vocab over free text.
|
||||
- Use compact keys and arrays for repetitive entities.
|
||||
- Truncate long textual blobs unless explicitly required.
|
||||
- Keep “quick facts” flattened.
|
||||
- Separate operational metadata from semantic content.
|
||||
- Include retrieval hints (`q`) for consistent query drafting.
|
||||
|
||||
---
|
||||
|
||||
## Write Pipeline Pattern
|
||||
1. Normalize incoming source payload.
|
||||
2. Derive compact summary (`s`) from normalized data.
|
||||
3. Compute lightweight metadata (`m`).
|
||||
4. Atomic write JSON file.
|
||||
5. Emit writer version + timestamp.
|
||||
|
||||
## Read Pipeline Pattern
|
||||
1. Attempt flat-file load.
|
||||
2. Validate minimum envelope fields (`v,t,u,ts,d`).
|
||||
3. Prefer `s` for quick tasks; use `d` for deeper reasoning.
|
||||
4. If invalid/missing/stale: fallback DB -> SIF semantic.
|
||||
|
||||
---
|
||||
|
||||
## Scope Expansion Pattern
|
||||
Apply same envelope for:
|
||||
- Step 2: website analysis
|
||||
- Step 3: research preferences + competitor snapshots
|
||||
- Step 4: persona profile + platform personas
|
||||
|
||||
Only `t`, `d`, and `s` payload contracts should vary.
|
||||
|
||||
---
|
||||
|
||||
## Governance
|
||||
- Schema changes require version bump (`v`).
|
||||
- Backward compatibility policy: readers support N and N-1.
|
||||
- Drift checks should compare canonical hash/checksum vs DB latest row.
|
||||
|
||||
|
||||
## Document Context + End-User Journey Metadata
|
||||
Each context file should carry explicit machine-oriented document metadata so agents understand *what this file is* before reading full payloads.
|
||||
|
||||
Suggested `document_context` fields:
|
||||
- `audience`: `ai_agents`
|
||||
- `purpose`: `fast_context_retrieval`
|
||||
- `context_type`: step-scoped type identifier
|
||||
- `journey`: stage/action/agent expectation
|
||||
- `retrieval_contract`: preferred source + fallback order
|
||||
- `context_window_guidance`: byte budget and summary-first policy
|
||||
|
||||
This block is intentionally compact and deterministic to reduce wasted token usage for agent planning.
|
||||
|
||||
## Context Window and Length Policy
|
||||
- Keep combined `data + summary` under a defined byte budget where practical.
|
||||
- Enforce summary-first reads in agent consumers.
|
||||
- Truncate long textual fields in summaries; keep full text only in `data` when needed.
|
||||
- Flag oversize docs in metadata so readers can skip low-priority sections.
|
||||
- Prefer short, stable keys in machine envelopes and avoid natural-language verbosity.
|
||||
|
||||
|
||||
## Implemented baseline controls
|
||||
- Atomic file writes to avoid partial documents.
|
||||
- Best-effort restricted file permissions (`0600`).
|
||||
- Recursive sensitive-key redaction for payload snapshots.
|
||||
- Payload size budget enforcement with deterministic trimming metadata.
|
||||
- Internal document linking via `related_documents` and manifest index.
|
||||
|
||||
|
||||
Security and isolation details: `docs/flat_file_context/FLAT_FILE_CONTEXT_SECURITY_AND_ISOLATION.md`
|
||||
|
||||
|
||||
Step docs: `docs/flat_file_context/STEP2_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP3_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP4_FLAT_FILE_CONTEXT_DESIGN.md`, `docs/flat_file_context/STEP5_FLAT_FILE_CONTEXT_DESIGN.md`
|
||||
@@ -0,0 +1,26 @@
|
||||
# Flat File Context Progress Review and Quick Wins
|
||||
|
||||
## Progress so far
|
||||
- Step 2 context: implemented (website analysis fast path + fallback).
|
||||
- Step 3 context: implemented (research preferences + competitors fast path + fallback).
|
||||
- Step 4 context: implemented (persona data fast path + fallback).
|
||||
- Step 5 context: implemented (integrations fast path + fallback).
|
||||
- Security baseline: user isolation checks, redaction, atomic writes, file-permission hardening.
|
||||
- Size governance: payload budget + deterministic trimming + trim metadata.
|
||||
- Internal linking: related-document links + manifest index.
|
||||
|
||||
## Quick-win improvements (next 1-2 sprints)
|
||||
1. Add explicit TTL/staleness fields and auto-refresh hints per step.
|
||||
2. Add lightweight checksums per section to detect DB drift quickly.
|
||||
3. Add optional `.json.gz` companion for oversized archives.
|
||||
4. Add shared reader utility for summary-first + selective field loading.
|
||||
5. Add minimal unit tests for:
|
||||
- redaction
|
||||
- trimming behavior
|
||||
- manifest linking
|
||||
- cross-user load rejection
|
||||
6. Add agent telemetry: record which sections are actually read to optimize summaries.
|
||||
|
||||
|
||||
## Newly added agent tooling
|
||||
- txtai agent tools for flat-file context manifest/read/write-note operations were added to SIF base agent to support file operations in agent workflows.
|
||||
@@ -0,0 +1,39 @@
|
||||
# Flat File Context Security, Isolation, and Size Controls
|
||||
|
||||
## Objective
|
||||
Provide minimal but practical security for agent flat-file context with strong end-user isolation and bounded document growth.
|
||||
|
||||
## Isolation model
|
||||
- Per-user namespace: `workspace/workspace_<safe_user_id>/agent_context/`
|
||||
- Sanitized user IDs only (`[a-zA-Z0-9_-]`) to prevent path traversal.
|
||||
- Reader-side user check: loaded document `user_id` must match requesting user context.
|
||||
|
||||
## Minimal security controls implemented
|
||||
1. **Atomic writes**
|
||||
- Context files are written via temporary file + `os.replace`.
|
||||
- Prevents partial/corrupt files under concurrent writes.
|
||||
2. **File permissions**
|
||||
- Context files are best-effort set to `0600`.
|
||||
3. **Sensitive key redaction**
|
||||
- Recursive redaction for key patterns like `api_key`, `token`, `secret`, `password`, `authorization`, `cookie`.
|
||||
4. **Manifest index**
|
||||
- `context_manifest.json` gives agents a controlled map of available docs and relationships.
|
||||
|
||||
## Size and context-window controls
|
||||
- Byte budget for raw document payloads (`DEFAULT_MAX_BYTES`).
|
||||
- If oversize, low-priority/heavy sections are trimmed first (`raw_*`, large snapshots, heavy arrays).
|
||||
- Trim metadata is preserved under `meta.trim` for traceability.
|
||||
- Agent policy remains summary-first (`agent_summary` before `data`).
|
||||
|
||||
## Internal document linking
|
||||
- Each context file includes `document_context.related_documents`.
|
||||
- Manifest includes per-document `related_documents` links.
|
||||
- This enables agents to:
|
||||
1. read one document,
|
||||
2. discover related context files,
|
||||
3. fetch only relevant next documents.
|
||||
|
||||
## Recommended next steps
|
||||
- Add optional file-level signatures/HMAC for tamper evidence.
|
||||
- Add checksum per section to detect DB drift.
|
||||
- Add staleness policy (`ttl_s`, `stale_after`) and auto-refresh triggers.
|
||||
54
docs/flat_file_context/STEP2_FLAT_FILE_CONTEXT_DESIGN.md
Normal file
54
docs/flat_file_context/STEP2_FLAT_FILE_CONTEXT_DESIGN.md
Normal file
@@ -0,0 +1,54 @@
|
||||
# Step 2 Flat File Context Design (Website Analysis)
|
||||
|
||||
## Intent
|
||||
Step 2 context must be optimized for **AI-agent retrieval speed and token efficiency**, not human readability.
|
||||
|
||||
## Current storage location
|
||||
- `workspace/workspace_<safe_user_id>/agent_context/step2_website_analysis.json`
|
||||
|
||||
## Current retrieval chain
|
||||
1. Flat file (fastest)
|
||||
2. DB (`website_analyses`)
|
||||
3. SIF semantic fallback
|
||||
|
||||
## Compactness strategy
|
||||
For implementation, keep two logical layers:
|
||||
- **`d` equivalent (full canonical data)** for deep reasoning.
|
||||
- **`s` equivalent (high-signal summary)** for fast agent prompts and most decisions.
|
||||
- **`document_context`** for machine-readable orientation (purpose, journey stage, fallback contract, context-window guidance).
|
||||
|
||||
Agents should default to summary-first reads and only open full data when needed.
|
||||
|
||||
## Step 2 coverage requirements
|
||||
The Step 2 context should preserve these semantic groups:
|
||||
- identity/state: website url, timestamps, status/error/warning
|
||||
- brand/style: writing style, style patterns/guidelines, brand analysis
|
||||
- audience/content: target audience, content type, recommended settings, characteristics
|
||||
- strategy/seo: strategy insights, SEO audit, strategic history
|
||||
- crawl/discovery: crawl output, meta info, sitemap analysis
|
||||
- traceability: raw inbound payload snapshots
|
||||
|
||||
## Agent-readability best practices
|
||||
- Keep keys stable and deterministic.
|
||||
- Prefer arrays/enums over long free text.
|
||||
- Keep summary fields flattened and high signal.
|
||||
- Avoid duplicate verbose nested structures unless required for correctness.
|
||||
- Include retrieval hints for consistent downstream querying.
|
||||
|
||||
## Practical guidance for consumers
|
||||
- Use summary/high-signal fields first for routing and lightweight reasoning.
|
||||
- Pull deep fields only for specialist tasks (SEO, persona fidelity, editorial style checks).
|
||||
- If flat-file missing/stale: auto-fallback to DB then SIF.
|
||||
|
||||
## Note
|
||||
A generalized compact framework is documented in:
|
||||
- `docs/flat_file_context/FLAT_FILE_CONTEXT_FRAMEWORK_DESIGN.md`
|
||||
|
||||
Future enhancements are tracked in:
|
||||
- `docs/flat_file_context/FLAT_FILE_CONTEXT_ENHANCEMENTS_BACKLOG.md`
|
||||
|
||||
|
||||
## Context window guidance
|
||||
- Keep summary compact and deterministic.
|
||||
- Add byte-size metadata to help agents decide whether to expand into full data.
|
||||
- Prefer short keys and avoid verbose natural language in machine envelopes.
|
||||
39
docs/flat_file_context/STEP3_FLAT_FILE_CONTEXT_DESIGN.md
Normal file
39
docs/flat_file_context/STEP3_FLAT_FILE_CONTEXT_DESIGN.md
Normal file
@@ -0,0 +1,39 @@
|
||||
# Step 3 Flat File Context Design (Research Preferences + Competitors)
|
||||
|
||||
## Intent
|
||||
Provide agent-ready Step 3 context with compact summaries for routing plus full payload for deep analysis.
|
||||
|
||||
## Storage location
|
||||
- `workspace/workspace_<safe_user_id>/agent_context/step3_research_preferences.json`
|
||||
|
||||
## Why this matters for agents
|
||||
Step 3 is the bridge from website understanding (Step 2) to competitive strategy and research execution. Agents need this file to understand:
|
||||
- depth and quality preference constraints,
|
||||
- factuality constraints,
|
||||
- content-type priorities,
|
||||
- competitor landscape and industry context.
|
||||
|
||||
## Document-context block
|
||||
Every context file should include machine-readable document metadata to orient agents quickly:
|
||||
- audience (`ai_agents`)
|
||||
- purpose (`fast_context_retrieval`)
|
||||
- journey stage (`onboarding_step_3`)
|
||||
- retrieval contract and fallback order
|
||||
- context-window guidance (size budget + summary-first policy)
|
||||
|
||||
## Minimal Step 3 data groups
|
||||
- research config: depth/content types/auto/factual
|
||||
- inherited style profile (if present): writing style, target audience, recommended settings
|
||||
- competitors: domain/url/title/relevance highlights
|
||||
- industry context: compact market framing text
|
||||
- traceability: source payload and timestamps
|
||||
|
||||
## Agent usage policy
|
||||
1. Start with `agent_summary.quick_facts` and `retrieval_hints`.
|
||||
2. Use competitor summary before opening full competitor objects.
|
||||
3. Read full `data` only for tasks requiring strict evidence/fields.
|
||||
4. Fall back to DB, then SIF semantic if missing or stale.
|
||||
|
||||
|
||||
## Related-document navigation
|
||||
Agents can consult `context_manifest.json` to discover linked context files and traverse only the required documents for the task.
|
||||
25
docs/flat_file_context/STEP4_FLAT_FILE_CONTEXT_DESIGN.md
Normal file
25
docs/flat_file_context/STEP4_FLAT_FILE_CONTEXT_DESIGN.md
Normal file
@@ -0,0 +1,25 @@
|
||||
# Step 4 Flat File Context Design (Persona Data)
|
||||
|
||||
## Intent
|
||||
Capture onboarding Step 4 persona outputs in an agent-first flat file so agents can quickly personalize strategy, content, and platform execution.
|
||||
|
||||
## Storage location
|
||||
- `workspace/workspace_<safe_user_id>/agent_context/step4_persona_data.json`
|
||||
|
||||
## Required Step 4 coverage
|
||||
- core persona profile (`core_persona`)
|
||||
- platform personas (`platform_personas`)
|
||||
- quality metrics (`quality_metrics`)
|
||||
- selected platforms (`selected_platforms`)
|
||||
- research persona/notes when available
|
||||
- source payload + timestamps for traceability
|
||||
|
||||
## Agent summary expectations
|
||||
- quick facts: selected platform count, persona availability flags
|
||||
- retrieval hints: persona/profile adaptation queries
|
||||
- persona focus: compact actionable slice of core persona + quality constraints
|
||||
|
||||
## Usage policy
|
||||
1. Start with `agent_summary`.
|
||||
2. Expand into `data` only when a task needs full fidelity.
|
||||
3. Use `document_context.related_documents` to fetch upstream Step 2/Step 3 context as needed.
|
||||
22
docs/flat_file_context/STEP5_FLAT_FILE_CONTEXT_DESIGN.md
Normal file
22
docs/flat_file_context/STEP5_FLAT_FILE_CONTEXT_DESIGN.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# Step 5 Flat File Context Design (Integrations)
|
||||
|
||||
## Intent
|
||||
Capture onboarding Step 5 integration configuration in a compact agent-readable context so agents can reason about connected services and execution constraints.
|
||||
|
||||
## Storage location
|
||||
- `workspace/workspace_<safe_user_id>/agent_context/step5_integrations.json`
|
||||
|
||||
## Required Step 5 coverage
|
||||
- integration map (`integrations`)
|
||||
- provider list (`providers`)
|
||||
- connected account references (`connected_accounts`)
|
||||
- integration status and notes
|
||||
- source payload and timestamps
|
||||
|
||||
## Agent summary expectations
|
||||
- connected integration count/list
|
||||
- provider count
|
||||
- retrieval hints for integration readiness checks
|
||||
|
||||
## Linked traversal
|
||||
Use `document_context.related_documents` and `context_manifest.json` to navigate Step 2/3/4 upstream dependencies when deciding tool execution paths.
|
||||
Reference in New Issue
Block a user