emdash-patch-imageupload/infra/perf-monitor/README.md

# Emdash Perf Monitor

Tracks cold start / TTFB of the emdash demo sites over time from multiple regions. Two sites are measured in parallel so the effect of Astro's experimental cache provider can be compared head-to-head:

- `blog` -- `blog-demo.emdashcms.com` (baseline, catalog Astro)
- `cache` -- `cache-demo.emdashcms.com` (prerelease Astro with `cacheCloudflare()` enabled)

Each measurement row is tagged with a `site` column matching one of those ids.

## Architecture

- **Coordinator Worker** (`emdash-perf-coordinator`) owns the D1 database, cron trigger, queue consumer, HTTP API, and frontend dashboard. Served at `https://perf.emdashcms.com`.
- **4 Probe Workers** (`emdash-perf-probe-{use,euw,ape,aps}`) are placed near AWS regions via `placement.region`. They receive measurement requests from the coordinator via service bindings and run `fetch()` timing from their placed location.
- **D1 database** (`emdash_perf`) stores all measurements, tagged by `source`: `deploy` (queue-triggered, has SHA + PR) or `cron` (ambient baseline, untagged).
- **Cloudflare Queue** (`emdash-perf-deploy-events`) subscribes to `cf.workersBuilds.worker.build.succeeded` events. The coordinator consumes these, filters for the baseline demo Worker, resolves the PR via the GitHub API, and runs a measurement against every registered site. This is the primary attribution path; see `src/routes.ts` for the site registry.

All five Workers are built from this directory by the Cloudflare Vite plugin -- the coordinator entry is `src/index.ts` and the four probes are defined as `auxiliaryWorkers` in `vite.config.ts`.

## Measurement triggers

| Trigger                | When                               | `source` | Sites        | SHA        | PR       | On graph? | Persisted? |
| ---------------------- | ---------------------------------- | -------- | ------------ | ---------- | -------- | --------- | ---------- |
| Queue event            | Every successful `blog-demo` build | `deploy` | all          | from event | resolved | yes       | yes        |
| Cron (`*/30 * * * *`)  | Every 30 min                       | `cron`   | all          | null       | null     | yes       | yes        |
| `pnpm trigger`         | Private/quiet check (default)      | n/a      | all (or one) | n/a        | n/a      | no        | **no**     |
| `pnpm trigger --store` | Manual, persisted                  | `manual` | all (or one) | optional   | optional | **no**    | yes        |

The queue is the deploy-attribution path. The cron is a safety net that fills gaps between deploys and catches regressions the queue might miss.

`pnpm trigger` defaults to ephemeral: the probes run for real, but the coordinator skips the database insert and just returns the results to stdout. Use this for private/local checks you don't want on the dashboard.

Passing `--store` persists the run as `source=manual`. Stored manual runs land in the results table with a yellow `manual` badge but are excluded from the line chart, the summary cards, and the 7-day rolling medians so they don't skew the baseline.

## Manual triggers

```bash
# Default: run the probes, print results, record nothing.
# First invocation opens a browser for Cloudflare Access login; subsequent
# invocations reuse the token until the Access session expires.
pnpm trigger

# Persist the run as source=manual (appears in the results table)
pnpm trigger -- --store --note "pre-cold-start-fix baseline"

# Attach a SHA and/or PR number to a persisted run
pnpm trigger -- --store --sha 1a2b3c4 --pr 532 --note "PR #532 preview"
```

Auth is handled by a Cloudflare Access policy on `POST /api/trigger`

## First-time setup

```bash
# 1. Create the D1 database and apply the initial schema
wrangler d1 create emdash_perf
# copy the database_id into wrangler.jsonc

wrangler d1 execute emdash_perf --remote --file=schema.sql
pnpm db:migrations:apply  # any incremental migrations on top

# 2. Create the deploy events queue and DLQ
wrangler queues create emdash-perf-deploy-events
wrangler queues create emdash-perf-deploy-events-dlq

# 3. Build and deploy all 5 Workers
pnpm deploy

# 4. Subscribe the queue to Workers Builds events.
# (No wrangler command for this yet -- use the CF dashboard or API:
# https://developers.cloudflare.com/queues/event-subscriptions/manage-event-subscriptions/)
# Source: Workers Builds
# Events: build.succeeded (at minimum)
# Queue: emdash-perf-deploy-events

# 5. (Optional, to enable manual triggers) Add a Cloudflare Access policy
# on POST /api/trigger. See "Access setup" above.
```

No secrets required. PR lookup hits the public GitHub API unauthenticated
(60 req/hr limit, plenty for one lookup per deploy).

## Deploy order

The coordinator's service bindings require the probes to exist first. `pnpm deploy` handles this: it builds, deploys all 4 probes, then deploys the coordinator.

## Dev

```bash
pnpm dev  # Vite dev server, all 5 Workers via Miniflare
```

Open `http://localhost:5173` for the dashboard. API is at `/api/*`. Queue events can't be exercised locally without manual message publishing -- rely on the live environment or the next cron tick to verify the measurement path.

Local manual trigger (no Access locally):

```bash
curl -sS -X POST http://localhost:5173/api/trigger \
  -H 'content-type: application/json' \
  -d '{"note":"local test"}'
```

## Endpoints

| Endpoint       | Method | Auth      | Purpose                                           |
| -------------- | ------ | --------- | ------------------------------------------------- |
| `/`            | GET    | none      | Dashboard                                         |
| `/api/config`  | GET    | none      | Target URL, available routes and regions          |
| `/api/summary` | GET    | none      | Latest result per route/region + rolling medians  |
| `/api/results` | GET    | none      | Filtered historical results                       |
| `/api/chart`   | GET    | none      | Time series for charting (with PR markers)        |
| `/api/trigger` | POST   | CF Access | Run an ad-hoc measurement, tagged `source=manual` |

All GET endpoints are read-only. `POST /api/trigger` is the only state-changing endpoint and is expected to be protected by a Cloudflare Access policy at the edge.

## Schema changes

D1's native migrations are wired up (`migrations_dir` in `wrangler.jsonc`).

```bash
pnpm db:migrations:list    # show pending migrations
pnpm db:migrations:apply   # apply pending migrations
pnpm db:migrations:create  # scaffold a new migration file
```

`schema.sql` is the desired end state for fresh installs only. For incremental changes on an existing database, add a file under `migrations/` and apply it -- don't rely on editing `schema.sql` to take effect.

## Types

Binding types come from `wrangler types`, which reads `wrangler.jsonc` and writes `worker-configuration.d.ts`. The generated file is committed so `tsc` doesn't need wrangler to run first.

Re-run after any binding change:

```bash
pnpm cf-typegen
```

## Operational notes

- **Trigger worker name**: `TRIGGER_WORKER_NAME` in `src/routes.ts` is the Worker whose `build.succeeded` event drives deploy-attributed runs. Events for any other Worker are discarded (the cron job still measures every site on its own schedule). Since every registered site rebuilds from the same main-branch commit, one event triggers a measurement for all of them. If the baseline demo is ever renamed, update this constant.
- **Adding a site**: add an entry to `SITES` in `src/routes.ts` with a stable `id` (stored in `perf_results.site`), `targetUrl`, and Worker name. Existing rows continue to use their recorded site id.
- **PR lookup**: hits the public GitHub API unauthenticated (60 req/hr per IP). One call per deploy, so rate limits are a non-issue. If deploy rate ever gets anywhere near that, add a fine-grained PAT via `wrangler secret put GITHUB_TOKEN` and pass it in `src/github.ts`.
- **DLQ**: failed messages retry 3x, then go to `emdash-perf-deploy-events-dlq`. Check this periodically if deploy-attributed results stop appearing.