OpenWatch monitoring and operations guide
Updated 2026-06-10
Last updated: 2026-06-10
This guide describes how you monitor a running OpenWatch deployment and how you
respond to common operational incidents. OpenWatch ships as a single Go binary
(/usr/bin/openwatch) that serves the REST API and the embedded React UI over
HTTPS on port 8443, backed by PostgreSQL and managed by systemd. There is no
container runtime, no separate web tier, and no message broker.
For installation, configuration layering, and first-run setup, see
docs/guides/INSTALLATION.md. This guide does not repeat those steps; it
focuses on observing the service and running it day to day.
Contents
- What you can observe today
- Health and version endpoints
- Logs via journald
- Audit events
- Fleet and connectivity signals
- Service lifecycle
- Operational runbooks
- Not yet implemented
1. What you can observe today
OpenWatch exposes operational signals through four channels:
| Channel | Source | Authentication |
|---|---|---|
| Health probe | GET /api/v1/health | None |
| Version metadata | GET /api/v1/version | None |
| Structured logs | systemd journal (journalctl -u openwatch) | Host access |
| Audit and fleet APIs | GET /api/v1/audit/events, /api/v1/fleet/*, /api/v1/system/connectivity/status | Bearer token |
OpenWatch does not currently expose a Prometheus /metrics endpoint and does
not ship a Prometheus, Grafana, Jaeger, or exporter stack. See
Not yet implemented.
2. Health and version endpoints
Health probe
The health endpoint is anonymous and is the right target for an external uptime
check or a load-balancer probe. It is implemented in
internal/server/handlers.go (GetHealth) and pings PostgreSQL with a
two-second timeout.
curl -k https://localhost:8443/api/v1/healthA healthy response returns 200 OK:
{"status": "healthy", "db_connected": true, "version": "0.2.0-rc.13"}When the database ping fails, the endpoint returns 503 Service Unavailable
with an error envelope. Treat a non-200 status, or a connection failure, as
service-down.
The response schema (status, db_connected, version) is defined in
api/openapi.yaml under HealthResponse. The current contract reports only a
binary healthy/degraded status driven by database reachability.
Version metadata
The version endpoint is also anonymous and reports build metadata sourced from
ldflags and Go build info (internal/server/handlers.go, GetVersion):
curl -k https://localhost:8443/api/v1/version{
"openwatch": "0.2.0-rc.13",
"kensa": "<embedded engine version>",
"go": "<go toolchain>",
"commit": "<abbrev commit>",
"build_time": "<ISO-8601 build timestamp>"
}Use this to confirm which build is running after an upgrade. The same metadata
prints from the CLI with openwatch --version.
3. Logs via journald
The systemd unit (packaging/common/openwatch.service) sends both stdout and
stderr to the journal. With format = "json" in [logging] (the packaged
default in packaging/common/openwatch.toml), every line is a structured JSON
record that carries a correlation ID.
sudo journalctl -u openwatch -f # tail live
sudo journalctl -u openwatch --since '15 min ago' # recent window
sudo journalctl -u openwatch -o cat | jq . # pretty-print JSON
sudo journalctl -u openwatch -p err --since today # errors onlyTo trace one request or one boot across log lines, filter on its correlation ID:
sudo journalctl -u openwatch -o cat | jq 'select(.correlation_id == "<id>")'Set level = "debug" in [logging] (or pass --log-level debug, or set
OPENWATCH_LOGGING_LEVEL=debug) to raise verbosity, then restart the service.
Log level precedence follows the standard config layering documented in
docs/guides/INSTALLATION.md.
4. Audit events
Every server action that mutates state, authenticates, or authorizes emits a
row to the audit_events PostgreSQL table (migrations 0001_initial.sql and
0002_audit_events_taxonomy.sql). This is the durable record for security
review; the journal is the operational record.
Query audit events through the API (requires a bearer token with the appropriate permission):
curl -k -H "Authorization: Bearer $TOKEN" \
'https://localhost:8443/api/v1/audit/events'The endpoint (getAuditEvents in api/openapi.yaml) is cursor-paginated. For
direct inspection during an incident you can also read the table with psql:
psql "$OPENWATCH_DATABASE_DSN" -c \
"SELECT recorded_at, action, severity, actor_type, actor_id, outcome
FROM audit_events
ORDER BY recorded_at DESC
LIMIT 50;"Indexed columns include recorded_at, action, severity, and
(actor_type, actor_id), so filtered queries on those fields stay fast. For the
event taxonomy (action names and severities), see
docs/engineering/audit_event_taxonomy.md.
5. Fleet and connectivity signals
OpenWatch continuously probes managed hosts (the liveness loop wired in
cmd/openwatch/main.go). These endpoints expose the resulting fleet state and
require a bearer token:
| Endpoint | Reports |
|---|---|
GET /api/v1/fleet/liveness | Counts: reachable, unreachable, unknown, never_probed |
GET /api/v1/fleet/connectivity/breakdown | 4-state breakdown: online, degraded, critical, down, never_probed |
GET /api/v1/system/connectivity/status | In-process connectivity-monitor metrics and the maintenance flag |
curl -k -H "Authorization: Bearer $TOKEN" \
https://localhost:8443/api/v1/fleet/livenessA rising unreachable/down count is a useful early signal that either the
monitored fleet or the OpenWatch host's network path is degrading. The schemas
(FleetLiveness, ConnectivityBreakdown) are defined in api/openapi.yaml.
6. Service lifecycle
OpenWatch runs as the openwatch.service systemd unit, which executes
openwatch serve --config /etc/openwatch/openwatch.toml.
sudo systemctl status openwatch # current state
sudo systemctl restart openwatch # restart
sudo systemctl stop openwatch # stop
sudo systemctl enable --now openwatch # start now and at bootBefore restarting after a config change, validate the resolved configuration:
sudo -u openwatch openwatch check-config --config /etc/openwatch/openwatch.tomlOther CLI subcommands (cmd/openwatch/main.go): migrate applies pending
database migrations, create-admin bootstraps the first admin user, and
worker runs the background scan-job loop. The packaged systemd unit runs only
serve; the in-process schedulers and liveness loop run inside the serve
process.
7. Operational runbooks
These runbooks assume the single binary on systemd with a PostgreSQL backend. Run the commands from the OpenWatch host unless noted.
SERVICE_DOWN
The service is unreachable or GET /api/v1/health does not return 200.
-
Check the unit state and recent errors:
sudo systemctl status openwatch sudo journalctl -u openwatch --since '10 min ago' -p err -
Confirm the local probe:
curl -k https://localhost:8443/api/v1/health -
If the journal shows a database ping failure (for example
db: ping: ... connection refused), check PostgreSQL:sudo systemctl status postgresql psql "$OPENWATCH_DATABASE_DSN" -c 'SELECT 1;' -
If the config is suspect, validate it before restarting:
sudo -u openwatch openwatch check-config --config /etc/openwatch/openwatch.toml -
Restart and confirm recovery:
sudo systemctl restart openwatch curl -k https://localhost:8443/api/v1/health
The unit is configured with Restart=on-failure and RestartSec=5s, so a
crashing process restarts automatically; persistent restart loops show up in
systemctl status as repeated restarts and warrant the steps above.
DISK_FULL
Disk pressure on the OpenWatch or PostgreSQL data volume.
-
Find what is full:
df -h sudo du -xh /var/log/openwatch /var/lib/openwatch | sort -h | tail -
The journal is a common consumer. Inspect and cap it:
journalctl --disk-usage sudo journalctl --vacuum-time=7d # drop entries older than 7 days sudo journalctl --vacuum-size=500M # or cap total size -
Check the PostgreSQL data directory and database size:
psql "$OPENWATCH_DATABASE_DSN" -c \ "SELECT pg_size_pretty(pg_database_size(current_database()));"The
audit_eventstable grows over time. Confirm its size before pruning, and follow your retention policy:psql "$OPENWATCH_DATABASE_DSN" -c \ "SELECT pg_size_pretty(pg_total_relation_size('audit_events'));" -
After freeing space, confirm the service is healthy (
curl -k https://localhost:8443/api/v1/health).
OpenWatch does not currently rotate or prune audit_events automatically; apply
your own retention if the table dominates database size.
HIGH_CPU
The OpenWatch process is consuming excessive CPU.
-
Confirm which process and how much:
top -b -n1 | head -20 sudo systemctl status openwatch # shows the main PID -
Correlate with request and scan activity in the journal:
sudo journalctl -u openwatch --since '15 min ago' -o cat | jq -r '.msg' | sort | uniq -c | sort -rn | head -
Check whether background work is driving load. The liveness loop and the intelligence and discovery schedulers run inside
serve. If a scheduler is misconfigured, pause it via its config endpoint, for example:curl -k -H "Authorization: Bearer $TOKEN" \ https://localhost:8443/api/v1/system/intelligence/config -
Check PostgreSQL for long-running or stuck queries:
psql "$OPENWATCH_DATABASE_DSN" -c \ "SELECT pid, now()-query_start AS runtime, state, left(query,80) FROM pg_stat_activity WHERE state <> 'idle' ORDER BY runtime DESC NULLS LAST LIMIT 10;" -
If the process is wedged rather than merely busy, capture the journal context first, then
sudo systemctl restart openwatch.
SECURITY_INCIDENT
Suspected unauthorized access, credential misuse, or anomalous authorization failures.
-
Pull recent authentication and authorization events from the audit log. These are the durable security record:
psql "$OPENWATCH_DATABASE_DSN" -c \ "SELECT recorded_at, action, severity, actor_type, actor_id, outcome FROM audit_events WHERE severity IN ('warning','critical') OR action LIKE 'auth.%' ORDER BY recorded_at DESC LIMIT 100;" -
Cross-reference with the journal for the same window, filtering by correlation ID where you have one:
sudo journalctl -u openwatch --since '1 hour ago' -o cat | jq 'select(.level=="WARN" or .level=="ERROR")' -
If you must contain immediately, stop the service to halt all access while you investigate:
sudo systemctl stop openwatch -
Rotate any potentially exposed secrets in
/etc/openwatch/secrets.env(for exampleOPENWATCH_DATABASE_DSN) and the keys under/etc/openwatch/, then restart. Preserve the journal and a copy of relevantaudit_eventsrows before you prune anything.
For role and permission definitions referenced by audit action/actor fields,
see docs/engineering/rbac_registry.md.
8. Not yet implemented
The following observability capabilities described in earlier (Python-era) documentation do not exist in the current Go build. They are recorded here so operators do not look for them:
- Prometheus
/metricsendpoint — not exposed. The only health signal isGET /api/v1/health. - Bundled monitoring stack (Prometheus, Grafana, Jaeger, Alertmanager,
node/redis/postgres exporters, cAdvisor) — not shipped. There is no
monitoring/Compose stack and no container runtime in this architecture. - Distributed tracing — not implemented. Correlation IDs in the JSON logs are the current mechanism for following a request across log lines.
- Detailed authenticated health endpoints (per-service, content, history) —
not implemented. The current health contract is a single binary
healthy/degradedstatus.
If and when metrics or tracing land, this section and the contract in
api/openapi.yaml will be updated together.