reliability-improvement-plan

Solid

Identify single points of failure, assess recovery capabilities, and produce a prioritized remediation plan aligned with the Well-Architected Reliability pillar.

AI & Automation 141 stars 21 forks Updated yesterday MIT-0

Install

View on GitHub

Quality Score: 86/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Reliability Improvement Plan ## Step 1: Gather context Ask the user: > What workload would you like me to assess for reliability? Please share: > - **Architecture overview** (services, regions, AZs, dependencies) > - **Availability target** (99.9%, 99.95%, 99.99%, etc.) > - **Recovery objectives** (RTO and RPO if defined) > - **Past incidents** (optional — recent outages or near-misses) If context is already provided, proceed directly. ## Step 2: Identify single points of failure For each component, ask: "What happens if this fails?" Classify each SPOF by severity: - 🔴 **High Risk** — total outage or data loss if this component fails - 🟡 **Medium Risk** — degraded experience or partial outage - 🟢 **Low Risk** — minimal impact, graceful degradation Check for: - Single-AZ deployments (databases, compute, caches) - Single-region dependencies with no failover - Unreplicated data stores (no backups, no read replicas) - Hard dependencies on third-party services without fallback - Single NAT Gateway, single bastion, single load balancer - Shared-nothing vs shared-everything bottlenecks **If the workload is a data pipeline** (S3/Lambda/Step Functions/Glue/EMR/Kinesis/Kafka/Redshift): - Prioritize **data durability** over compute availability — message/data loss is worse than temporary processing delay - Check: DLQ on every async invocation (Lambda, SQS, EventBridge) - Check: retry policies with exponential backoff and max attempts - Check: idempotency guarantees (duplic...

Details

Author: aws-samples
Repository: aws-samples/sample-well-architected-skills-and-steering
Created: 1 weeks ago
Last Updated: yesterday
Language: Python
License: MIT-0

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

reliability

This skill should be used when designing, planning, implementing, or reviewing any non-trivial change, or when the user asks to "add retries", "add error handling", "add circuit breaker", "handle failures" — enforces graceful degradation, proper error handling, retry strategies, and fault-tolerant patterns so systems stay up when things go wrong

5 Updated today

alo-exp

DevOps & Infrastructure Listed

alibaba-waf-reliability-review

Assess Alibaba Cloud workload reliability: multi-AZ ECS topology, SLB/ALB/NLB load balancing, Auto Scaling health policies, RDS/PolarDB HA failover, backup and cross-region DR, and Cloud Monitor/ARMS observability coverage.

12 Updated today

Raishin

DevOps & Infrastructure Solid

performance-efficiency

Evaluate a workload's performance efficiency against the Well-Architected Performance Efficiency pillar, covering resource selection, scaling, monitoring, and optimization opportunities.

141 Updated yesterday

aws-samples

AI & Automation Solid

migration-readiness

Assess a workload's readiness to migrate to AWS using Well-Architected principles, covering the 7 Rs, dependencies, risks, and a migration plan.

141 Updated yesterday

aws-samples

AI & Automation Solid

backup-and-disaster-recovery

Plan and run backups, set recovery objectives, and run disaster recovery drills. Use this skill when defining RPO/RTO targets, designing backup architecture, deciding what to back up and how often, planning for full-region or platform outages, or running a restoration drill. Triggers on backup, restore, RPO, RTO, disaster recovery, DR, business continuity, what if the database is gone, what if our hosting goes down, recovery drill, ransomware planning. Also triggers when an incident reveals a gap in restoration capability.

280 Updated 2 days ago

rampstackco