cluster-operations

Solid

Day-2 cluster operations — node management, etcd backup/restore, certificate rotation, namespace lifecycle.

AI & Automation 14 stars 3 forks Updated 3 days ago MIT

Install

View on GitHub

Quality Score: 86/100

Stars 20%
39
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
80
License 10%
100
Description 5%
100

Skill Content

# Skill: Cluster Operations > **Expertise:** Safe day-2 operations on self-hosted Kubernetes clusters — node drain, etcd ops, cert rotation. ## When to load When draining nodes for maintenance, rotating certificates, backing up etcd, or troubleshooting control plane issues. ## Node Lifecycle Operations ```bash # --- CORDON (stop scheduling new pods, don't evict existing) --- kubectl cordon <node-name> # Use case: pre-drain notification, temporary maintenance hold # --- DRAIN (evict all pods, mark unschedulable) --- kubectl drain <node-name> \ --ignore-daemonsets \ # DaemonSet pods can't be evicted --delete-emptydir-data \ # required for pods using emptyDir --grace-period=60 \ # give pods time to shut down cleanly --timeout=300s # abort if takes > 5 minutes # After drain: node is unschedulable and empty (except daemonsets) # --- UNCORDON (return to service) --- kubectl uncordon <node-name> # --- Verify node is empty before maintenance --- kubectl get pods -A --field-selector=spec.nodeName=<node-name> ``` ## etcd Backup (bare-metal / kubeadm) ```bash # --- Take snapshot (run on a control plane node) --- ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key # --- Verify snapshot --- ETCDCTL_API=3 etcd...

Details

Author
sawrus
Repository
sawrus/agent-guides
Created
3 months ago
Last Updated
3 days ago
Language
Shell
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

DevOps & Infrastructure Listed

operating-kubernetes

Operating production Kubernetes clusters effectively with resource management, advanced scheduling, networking, storage, security hardening, and autoscaling. Use when deploying workloads to Kubernetes, configuring cluster resources, implementing security policies, or troubleshooting operational issues.

368 Updated 5 months ago
ancoleman
AI & Automation Listed

openstack-backup

OpenStack backup operations skill for protecting cloud infrastructure through systematic backup strategies and disaster recovery procedures. Covers database backups (MariaDB full and incremental with mariabackup), configuration backups (globals.yml, inventory, Fernet keys), volume snapshots (Cinder LVM snapshots), image exports (Glance), instance snapshots (Nova), backup encryption (GPG/OpenSSL), retention policies (daily/weekly/monthly rotation), restore procedures (database point-in-time recovery, service rebuild), RPO/RTO planning, and disaster recovery drills. Use when planning backup strategy, scheduling automated backups, testing restore procedures, or executing disaster recovery.

62 Updated today
Tibsfox
DevOps & Infrastructure Solid

kubernetes-ops

Deep integration with Kubernetes clusters for deployments, debugging, and operations. Execute kubectl commands, analyze pod logs/events/resources, generate and validate manifests, and debug cluster issues.

1,034 Updated today
a5c-ai
DevOps & Infrastructure Listed

openstack-kolla-ansible-ops

Kolla-Ansible day-2 operations skill for post-deployment infrastructure lifecycle management. Covers service reconfiguration (globals.yml changes, config overrides, prechecks, targeted reconfigure with --tags), minor and major OpenStack upgrades (image pull, upgrade procedure, rollback), container management (restart, logs, health inspection), maintenance mode (compute disable, instance drain, host maintenance), password rotation, certificate renewal, and rolling updates. This skill is for operations after initial deployment -- the kolla-ansible deployment skill covers initial bootstrap and deploy.

62 Updated today
Tibsfox
DevOps & Infrastructure Solid

eks

AWS EKS Kubernetes management for clusters, node groups, and workloads. Use when creating clusters, configuring IRSA, managing node groups, deploying applications, or integrating with AWS services.

1,111 Updated 5 days ago
itsmostafa