AOF SLA Guide
Audience: Project leads, PMs, SREs
Scope: SLA configuration, alerting, and tuning (Phase 1.5)
Overview
AOF’s SLA primitive detects in-progress tasks that exceed expected duration and emits advisory alerts. Phase 1.5 is alert-only — no blocking or automatic transitions.
Defaults (Phase 1.5):
- Normal tasks: 1 hour (
3600000ms) - Research tasks: 4 hours (
14400000ms) whenrouting.agent: swe-researcher - Alert rate limit: 1 per task per 15 minutes
1. SLA Configuration Hierarchy
SLA limits are resolved in this order:
- Per-task override (task frontmatter)
- Per-project defaults (
org/org-chart.yaml→aof.projects.<projectId>.sla) - Global fallback (1 hour)
2. Per-Task Overrides
Add SLA overrides in task frontmatter:
---id: AOF-123title: Deep research spikestatus: in-progressrouting: agent: swe-researchersla: maxInProgressMs: 14400000 # 4 hours onViolation: alert # Phase 1: only 'alert' supported---Validation rules:
maxInProgressMsmust be between 1 minute and 24 hoursonViolationmust bealert(Phase 1 constraint)
3. Project Defaults (org/org-chart.yaml)
Per-project defaults live in the org chart under aof.projects.<projectId>.sla:
aof: projects: backend: sla: defaultMaxInProgressMs: 3600000 # 1 hour researchMaxInProgressMs: 14400000 # 4 hours (applies to routing.agent: swe-researcher) onViolation: alert # Phase 1: advisory only alerting: rateLimitMinutes: 154. onViolation Modes
onViolation controls what happens when a task exceeds its SLA:
alert— emit an advisory alert (Phase 1.5 default and only supported mode)block— transition task toblocked(Phase 2)deadletter— transition task todeadletter(Phase 2)
⚠️ Phase 1 constraint: block and deadletter are rejected by validation in Phase 1.5.
5. How SLA Checks Work
- Scheduler checks every poll cycle (default 30s)
- Duration is computed as:
now - task.updatedAt - If duration > limit →
sla.violationevent + console alert - Alerts are rate-limited to 1 per task per 15 minutes
6. Alert Format (Console)
[AOF] SLA VIOLATION: Task AOF-123 (Implement auth middleware)[AOF] SLA VIOLATION: Duration: 1.3h (limit: 1.0h)[AOF] SLA VIOLATION: Agent: swe-backend[AOF] SLA VIOLATION: Action: Check if agent is stuck or task needs SLA overridePhase 2: External alert channels (Slack/Discord/email) are planned.
7. Querying SLA Violations
Use event logs to find violations:
# Tail latest eventsLATEST=$(ls -t ~/.openclaw/aof/events/*.jsonl 2>/dev/null | head -1)tail -f "$LATEST"
# Filter SLA violationscat "$LATEST" | jq 'select(.type == "sla.violation")'Example event:
{ "timestamp": "2026-02-14T09:00:00.000Z", "type": "sla.violation", "actor": "scheduler", "taskId": "AOF-123", "payload": { "duration": 4500000, "limit": 3600000, "agent": "swe-backend", "timestamp": 1707901200000 }}8. Tuning SLAs
Start conservative:
- Keep 1h/4h defaults initially
- Observe alerts over 1–2 weeks
- Adjust based on real duration data
Red flags:
- Too many alerts → limits too aggressive
- No alerts ever → limits too generous
- Alerts for expected long tasks → add per-task overrides
Example: increase backend defaults to 2 hours
sla: defaultMaxInProgressMs: 7200000 # 2 hours9. Troubleshooting
SLA alerts too noisy:
- Increase project defaults
- Add per-task overrides
- Verify task is correctly marked
in-progress
No SLA alerts at all:
- Confirm daemon is running
- Confirm tasks are in
in-progress - Check event logs for
scheduler.poll
References
- Recovery runbook:
docs/RECOVERY-RUNBOOK.md - SLA design:
docs/SLA-PRIMITIVE-DESIGN.md - Event logs:
docs/event-logs.md