Self-Monitoring

cf-monitor monitors itself. It tracks cron execution, counts its own errors, writes self-telemetry to Analytics Engine, and alerts you if it becomes unhealth...

cf-monitor monitors itself. It tracks cron execution, counts its own errors, writes self-telemetry to Analytics Engine, and alerts you if it becomes unhealthy. If cf-monitor breaks, you’ll know.

How it works

Every time cf-monitor runs a handler (tail, scheduled, or fetch), it records:

Cron execution timestamps — a single KV JSON blob with the last run time, duration, and success status of each cron handler
Error counts — per-handler and total daily counters in KV
AE telemetry — a data point per handler invocation for historical analysis

All self-monitoring operations are fail-open. If KV or AE is unreachable, cf-monitor continues working normally — it just can’t report on its own health until KV recovers.

Cron staleness detection

cf-monitor knows the expected schedule and maximum staleness for each cron handler:

Handler	Schedule	Max staleness
`gap-detection`	Every 15 min	45 min
`cost-spike`	Every 15 min	45 min
`collect-metrics`	Hourly	150 min (2.5 hr)
`collect-account-usage`	Hourly	150 min (2.5 hr)
`budget-check`	Hourly	150 min (2.5 hr)
`synthetic-health`	Hourly	150 min (2.5 hr)
`daily-rollup`	Daily	1500 min (25 hr)
`worker-discovery`	Daily	1500 min (25 hr)

If a handler hasn’t run within its max staleness window, it’s flagged as stale. Staleness thresholds are 3x the expected interval, giving generous margin for transient issues.

First boot: After initial deployment, all handlers show lastRun: null — this is reported as healthy, not stale. Handlers populate their timestamps on first execution.

The /self-health endpoint

GET /self-health

Returns structured health status with HTTP status codes:

200 — healthy (no stale crons, fewer than 50 errors today)
503 — unhealthy (stale crons detected or 50+ errors today)

Example response (healthy)

{
  "healthy": true,
  "staleCrons": [],
  "todayErrors": 2,
  "handlerErrors": {
    "tail": 2
  },
  "crons": {
    "gap-detection": { "lastRun": "2026-03-22T14:15:03Z", "stale": false },
    "cost-spike": { "lastRun": "2026-03-22T14:15:03Z", "stale": false },
    "collect-metrics": { "lastRun": "2026-03-22T14:00:01Z", "stale": false },
    "collect-account-usage": { "lastRun": "2026-03-22T14:00:02Z", "stale": false },
    "budget-check": { "lastRun": "2026-03-22T14:00:03Z", "stale": false },
    "synthetic-health": { "lastRun": "2026-03-22T14:00:04Z", "stale": false },
    "daily-rollup": { "lastRun": "2026-03-22T00:00:01Z", "stale": false },
    "worker-discovery": { "lastRun": "2026-03-22T00:00:02Z", "stale": false }
  }
}

Example response (unhealthy)

{
  "healthy": false,
  "staleCrons": ["collect-metrics", "budget-check"],
  "todayErrors": 12,
  "handlerErrors": {
    "scheduled:collect-metrics": 8,
    "scheduled:budget-check": 4
  },
  "crons": {
    "collect-metrics": { "lastRun": "2026-03-22T08:00:01Z", "stale": true },
    "budget-check": { "lastRun": "2026-03-22T08:00:03Z", "stale": true }
  }
}

Slack alerts

When cron staleness is detected, cf-monitor sends a Slack alert:

:warning: cf-monitor self-check: stale crons detected — collect-metrics, budget-check

Alerts are deduplicated once per day (self:stale:{date} key, 24-hour TTL) — you won’t be spammed if the same handlers stay stale.

Self-telemetry in Analytics Engine

Every handler invocation writes a data point to AE with a special format:

blob1: 'cf-monitor' (worker name)
blob2: 'self:{durationMs}:{1|0}' (e.g. self:250:1 for a 250ms success)
blob3: handler name (e.g. scheduled:budget-check, tail, fetch)
doubles[0]: 1 (invocation count)
index: cf-monitor:self:{handlerName}

Querying self-telemetry

Invocation counts per handler:

SELECT blob3 AS handler, count() AS invocations
FROM "cf-monitor"
WHERE blob2 LIKE 'self:%'
GROUP BY handler
ORDER BY invocations DESC

Average duration per handler (last 24 hours):

SELECT
  blob3 AS handler,
  count() AS invocations,
  AVG(CAST(SPLIT(blob2, ':')[2] AS INT)) AS avg_duration_ms
FROM "cf-monitor"
WHERE blob2 LIKE 'self:%'
  AND timestamp > NOW() - INTERVAL '24' HOUR
GROUP BY handler

Error rate (success=0 invocations):

SELECT blob3 AS handler, count() AS errors
FROM "cf-monitor"
WHERE blob2 LIKE 'self:%:0'
GROUP BY handler

KV state

Self-monitoring uses three KV key patterns, all with 48-hour TTL:

Key	Value	Purpose
`self:v1:cron:last_run`	JSON blob	All handler timestamps in a single key
`self:v1:error:{handler}:{YYYY-MM-DD}`	Integer string	Per-handler daily error count
`self:v1:errors:count:{YYYY-MM-DD}`	Integer string	Total daily error count

The single JSON blob for cron timestamps minimises KV writes (1 write per handler execution vs 1 per handler).

Manually triggering staleness check

curl -X POST https://cf-monitor.YOUR_SUBDOMAIN.workers.dev/admin/cron/staleness-check \
  -H "Authorization: Bearer YOUR_ADMIN_TOKEN"

This runs the staleness detection logic immediately and sends a Slack alert if any handlers are stale.

Cost impact

Self-monitoring adds approximately:

~115 KV writes/day — cron timestamps + error counters across all handlers
~150 KV reads/day — health checks and staleness detection
~310 AE writes/day — self-telemetry data points (free tier)

Total: ~265 KV operations/day, well under the 1,000 KV ops/day budget.

Troubleshooting

/self-health returns 503 with stale crons: See Troubleshooting — Self-monitoring shows stale crons.

Error counts seem high: Check wrangler tail cf-monitor for the handler name that’s erroring. Common causes: expired API token, GitHub rate limit, Slack webhook revoked.

No self-telemetry in AE: Self-telemetry uses the same CF_MONITOR_AE binding as SDK telemetry. If AE writes work for consumer workers but not self-telemetry, check wrangler tail cf-monitor for [cf-monitor:self] warning messages.

/self-health shows all lastRun: null: Normal after first deploy. Wait for each handler’s schedule to fire (up to 24 hours for daily handlers). To accelerate, trigger handlers manually via admin endpoints.