observabilitylisted
Install: claude install-skill proyecto26/system-design-skills
# Observability
Decide *what to measure* so a system can be seen, alerted on, and debugged in
production. Getting this wrong is failure mode #6 — ignoring failure: a design
that works on the whiteboard but goes dark under load, where the first signal of
an outage is a user complaint instead of a page.
## When to reach for this
Any production design needs an answer to "how would we know this broke, and how
fast?" Reach for this when defining what the system measures, what pages a human,
what an acceptable level of service is (SLO), or how a request is traced across
services. It is the design move that makes every *other* block's stress section
real — you cannot mitigate a thundering herd or a hot shard you can't see.
## When NOT to
Do not build a full metrics-logs-traces stack for a prototype or an internal tool
with no users to disappoint (YAGNI) — a health check and error logging are
enough. Do not invent SLOs nobody will defend, or wire alerts before knowing the
symptom that matters; an alert with no owner and no runbook is noise that trains
the team to ignore pages. This skill owns *what* to measure and alert on; the
**high-volume log pipeline** (collect → buffer → ship → index → retain) lives in
`distributed-logging` — summarize and link, don't rebuild it here.
## Clarify first
- **What is "healthy" from a user's view?** The symptom that defines a bad
experience (slow checkout, failed upload) — alerts target this, not CPU.
- **SLO target and window?** e.g. 99.9% of