monitoringlisted
Install: claude install-skill ryukyagamilight/terminal-skills
# 监控与告警
## 概述
Prometheus、Grafana、告警规则配置等技能。
## Prometheus
### 基础查询(PromQL)
```promql
# 即时向量
http_requests_total
http_requests_total{job="api", status="200"}
# 范围向量
http_requests_total[5m]
# 偏移
http_requests_total offset 1h
# 聚合
sum(http_requests_total)
sum by (job) (http_requests_total)
sum without (instance) (http_requests_total)
# 速率
rate(http_requests_total[5m])
irate(http_requests_total[5m])
# 增量
increase(http_requests_total[1h])
# 直方图分位数
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
```
### 常用查询
```promql
# CPU 使用率
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# 磁盘使用率
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100
# 网络流量
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
# HTTP 请求速率
sum(rate(http_requests_total[5m])) by (status)
# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# 延迟 P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
### 配置文件
```yaml
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "rules/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'