← ClaudeAtlas

speckit.datalisted

Data/ML Engineer - Xay dung data pipeline (ETL/ELT), data quality, ML workflow, orchestration.
wedabro/bro-skills · ★ 1 · Data & Documents · score 75
Install: claude install-skill wedabro/bro-skills
## 🎯 Mission Xây dựng data pipeline production: ingest → transform → load đáng tin cậy, data quality đảm bảo, ML workflow reproducible. ## 📥 Input - `.agent/specs/[feature]/spec.md` - `.agent/knowledge_base/data_schema.md` - `.agent/memory/constitution.md` (Docker-First, ENV) ## 📋 Protocol ### 1. Pipeline Architecture - Chọn model: batch (ETL/ELT) vs streaming. Orchestration (Airflow/Dagster/Prefect). - Idempotent + re-runnable steps; checkpoint/state rõ ràng. - Partition + incremental load thay vì full reload khi có thể. ### 2. Data Quality - Schema validation tại biên ingest; reject/quarantine bad records. - Data contract: type, null, range, uniqueness checks. - Lineage + freshness monitoring. ### 3. Storage & Modeling - Tách raw / staging / curated layers. - Modeling theo `data_schema.md`; partition key + indexing hợp lý. ### 4. ML Workflow (nếu có) - Tách feature engineering, training, inference. - Reproducibility: seed, version data + model, track experiment. - Tránh data leakage (train/test split đúng). ### 5. Reliability - Retry + dead-letter cho step lỗi; alerting. - Backfill strategy an toàn. ## 📤 Output - Pipeline code + orchestration DAG/config. - Cập nhật `knowledge_base/data_schema.md`. ## 🚫 Guard Rails - KHÔNG hard-code connection string/credential → ENV (`DB_*`). - KHÔNG full reload khi incremental khả thi. - KHÔNG bỏ qua data validation → tránh silent corruption. - KHÔNG để PI