Grafana

otel-queries

Analyze gh-aw OpenTelemetry traces from JSONL mirrors or OTLP backends.

4,811 Updated today

github

devops

DevOps - Docker, CI/CD, cloud infra, monitoring.

736 Updated today

sipyourdrink-ltd

observability-monitoring

Design, audit, and troubleshoot production monitoring and observability using user-impact checks, layered telemetry, USE/RED, SLI/SLO/SLA, error budgets, cardinality controls, actionable alerting, burn-rate response, and postmortems. Use when asked about monitoring, наблюдаемость, алерты, Prometheus, Grafana, OpenTelemetry, logs, traces, profiles, service health, or incident evidence. Do not use for generic dashboard styling, frontend-only UI work, or unrelated code review.

138 Updated today

AnastasiyaW

monitoring-alerting

Monitoring and alerting design reviewer for production backend services. ALWAYS use when writing Prometheus alerting rules, designing Grafana dashboards, defining SLI/SLO, configuring alert routing (PagerDuty/OpsGenie/Slack), or reviewing existing monitoring setups. Covers SLI/SLO definition, alert rule quality (sensitivity/specificity tradeoff), burn-rate alerting, alert fatigue prevention, dashboard design principles, label cardinality management, and on-call routing configuration. Use even for "just add an alert" — a poorly designed alert either pages at 3AM for non-issues (alert fatigue) or stays silent during real outages (false confidence).

27 Updated yesterday

johnqtcg

Data & Documents Solid

grafana-foundation-sdk

Build Grafana dashboards as code with the grafana-foundation-sdk typed builders (TypeScript or Go). Use when creating, modifying, or generating Grafana dashboard JSON programmatically, converting hand-written dashboard JSON to typed code, building monitoring dashboards, or working with Prometheus/Loki queries in dashboards.

31 Updated 3 days ago

tenequm

Lifecycle-Innovations-Limited

ops-monitor

Unified APM and monitoring surface. Polls Datadog, New Relic, and OpenTelemetry backends for active alerts, error traces, and entity health. Use --watch for live polling every 60 seconds. Use --setup to configure monitoring credentials.

20 Updated yesterday

llm-self-loop

Restructures human-gated workflows into autonomous LLM loops with file-based outputs. Use when a task needs a button click, dashboard check, or human verdict inside its iteration loop.

33 Updated today

OutlineDriven

hunt-cloud-misconfig

Hunt cloud / infrastructure misconfigurations. AWS: public S3 buckets (s3:GetObject anonymous), permissive bucket policies (PutObjectAcl public-write), exposed CloudFront origin, public Lambda function URL, public RDS snapshot, IAM credentials in JS bundles, AWS metadata accessible via SSRF. GCP: public GCS buckets, exposed Cloud Run services, leaked service account JSON. Azure: public blob containers, exposed Function App. (Kubernetes/Docker exposure is owned by hunt-k8s; CI/CD pipeline attacks by hunt-cicd; post-credential IAM escalation by cloud-iam-deep.) Detection: targeted dorking, certificate transparency, JS bundle secret extraction, port scan for known service ports. Validate: actual data read / write / RCE. Use when hunting cloud-native storage and compute misconfig (S3/GCS/Blob, IMDS-via-SSRF, serverless, public managed services).

3,176 Updated 4 days ago

elementalsouls

smithers-observability

Start the local observability stack (Grafana, Prometheus, Tempo, OTLP Collector) via Docker Compose. Run `smithers observability --help` for usage details.

338 Updated today

smithersai

promql-cli

CLI for querying Prometheus and PromQL-compatible engines (Thanos, Cortex, VictoriaMetrics, Grafana Mimir, Grafana Tempo...) — instant queries, range queries, metric discovery (metrics/labels/meta subcommands), output formats (table/csv/json/graph). Apply when executing PromQL queries, troubleshooting performance issues on a software having observability, investigating latency/error rates/saturation, or analyzing time series data.

2 Updated today

hssh8917

Data & Documents Solid

build-grafana-dashboards

Create production-ready Grafana dashboards with reusable panels, template variables, annotations, and provisioning for version-controlled dashboard deployment. Use when creating visual representations of Prometheus, Loki, or other data source metrics, building operational dashboards for SRE teams, migrating from manual dashboard creation to version-controlled provisioning, or establishing executive-level SLO compliance reporting.

26 Updated today

pjt222

monitoring-expert

Use when setting up monitoring systems, logging, metrics, tracing, or alerting. Invoke for dashboards, Prometheus/Grafana, load testing, profiling, capacity planning.

4 Updated today

zacklecon

k8s-components-checker

Survey an RKE2 community cluster against an embedded compatibility registry of 19 stack components and produce a verdict for upgrade-readiness, drift-review, and version-skew questions. Components: RKE2, Rancher, Harvester, Cilium, Tetragon, cert-manager, Kyverno, KEDA, Argo CD, Harbor, Traefik, Rook, Ceph, OpenEBS, GitLab, ECK, Zalando postgres-operator, Grafana Mimir, NVIDIA GPU Operator. Works air-gapped — compatibility data lives in `references/compat/`. Surveys run via `kubectl` + `helm` + `pluto` + the apiserver `apiserver_requested_deprecated_apis` metric from the operator's workstation. Community editions only — Prime/EE-gated content is ignored. NOT for installing components, NOT for executing upgrades, NOT for tracking per-cluster running state (the registry is methodology, not inventory).

mimir-upgrade

Plan and run a controlled, COMMUNITY-edition Grafana Mimir upgrade on the `mimir-distributed` Helm chart, air-gap first — the chart↔app co-pinned ladder (5.7→5.8→6.0.6→6.1.0 = app 2.16→2.17→3.0.4→3.1.2), the classic-vs-ingest-storage decision (the chart ships a supported `classic-architecture.yaml`; `kafka.enabled: false` alone is NOT the switch and causes an ingestion outage), the community-specific nginx→gateway rename that silently moves the proxy's DNS name and breaks every remote_write client, the silent-no-op vs crashloop asymmetry between stale chart keys and stale app config, rollout-operator sequencing and the abort levers that deadlock a namespace, per-hop verification, and air-gap image/CRD/egress work. Companion to k8s-components-checker.

prometheus-mimir-grafana

Query Prometheus and Grafana Mimir, write and debug PromQL, and build or fix Grafana dashboards — for agents solving problems from metrics. Covers the Prometheus HTTP API (`/api/v1/query`, `query_range`, `series`, `labels`, `metadata`), Mimir multi-tenancy (`X-Scope-OrgID`, federation `a|b|c`, per-tenant 422/429 limits), the PromQL surface (selectors, rate family, classic + native histograms, `histogram_quantile`, vector matching `on()`/`group_left`, recording rules), Grafana dashboard JSON (panels, targets, variables + interpolation specifiers, legacy `/api/dashboards/db` vs Grafana-12 `/apis/dashboard.grafana.app/v1beta1/…`), KPI frameworks (RED, USE, Golden Signals, SLO burn-rate), connection recipes, MCP servers vs curl, and the PromQL trap list.

snmp-exporter

Best practices for Prometheus snmp_exporter (v0.30.x): writing generator.yml modules, curating MIB walks, SNMPv2c/v3 auth, timeout tuning, Kubernetes deployment (Probe/ScrapeConfig CRDs, secrets, UDP egress), local docker testing, and debugging failed scrapes. Includes worked device references for Dell iDRAC 9/10, Cisco CBS250/350 (+ Catalyst 1200/1300), and NVIDIA/Mellanox Onyx switches.

vllm-observability

Observe production vLLM — `/metrics` Prometheus surface (V1 engine), SLO-driven alerting on TTFT/ITL/queue/KV/preemption/aborts/corrupted-logits, shipping Grafana dashboards in `examples/observability/`, OTLP tracing with `--otlp-traces-endpoint` and `--collect-detailed-traces={model,worker,all}`, diagnostic rules to triage from /metrics alone — queue-grows + TPOT-stable means capacity, queue-stable + TPOT-grows means context/model, DCGM `SM_OCCUPANCY` is the real GPU-saturation signal not `GPU_UTIL`. V1 metric names (kv_cache_usage_perc), gpu_→kv_ rename saga, DCGM-exporter pairing, dashboard-lying pitfalls.

platform-skills

Use when troubleshooting, implementing, reviewing, or auditing platform infrastructure as a system — where Kubernetes, GitOps, CI/CD, and security concerns intersect. Provides structured diagnosis with blast radius, validation steps, and rollback plan for: Kubernetes, Flux CD, Argo CD, Terraform, GitHub Actions (composite actions, OIDC, SHA pinning), AWS, Azure, GKE, Linkerd, KEDA, Karpenter, supply chain security (Cosign, SBOM, SLSA), Falco, Chaos Engineering, DORA metrics, Datadog/Dynatrace/LLM observability, SOC 2, and PR review.

36 Updated today

nitinjain999