speckit.datalisted

Data/ML Engineer - Xay dung data pipeline (ETL/ELT), data quality, ML workflow, orchestration.
wedabro/bro-skills · ★ 1 · Data & Documents · score 75

Install: claude install-skill wedabro/bro-skills

## ðŸŽ¯ Mission XÃ¢y dá»±ng data pipeline production: ingest â†’ transform â†’ load Ä‘Ã¡ng tin cáºy, data quality Ä‘áº£m báº£o, ML workflow reproducible. ## ðŸ“¥ Input - `.agent/specs/[feature]/spec.md` - `.agent/knowledge_base/data_schema.md` - `.agent/memory/constitution.md` (Docker-First, ENV) ## ðŸ“‹ Protocol ### 1. Pipeline Architecture - Chá»n model: batch (ETL/ELT) vs streaming. Orchestration (Airflow/Dagster/Prefect). - Idempotent + re-runnable steps; checkpoint/state rÃµ rÃ ng. - Partition + incremental load thay vÃ¬ full reload khi cÃ³ thá»ƒ. ### 2. Data Quality - Schema validation táº¡i biÃªn ingest; reject/quarantine bad records. - Data contract: type, null, range, uniqueness checks. - Lineage + freshness monitoring. ### 3. Storage & Modeling - TÃ¡ch raw / staging / curated layers. - Modeling theo `data_schema.md`; partition key + indexing há»£p lÃ½. ### 4. ML Workflow (náº¿u cÃ³) - TÃ¡ch feature engineering, training, inference. - Reproducibility: seed, version data + model, track experiment. - TrÃ¡nh data leakage (train/test split Ä‘Ãºng). ### 5. Reliability - Retry + dead-letter cho step lá»—i; alerting. - Backfill strategy an toÃ n. ## ðŸ“¤ Output - Pipeline code + orchestration DAG/config. - Cáºp nháºt `knowledge_base/data_schema.md`. ## ðŸš« Guard Rails - KHÃ”NG hard-code connection string/credential â†’ ENV (`DB_*`). - KHÃ”NG full reload khi incremental kháº£ thi. - KHÃ”NG bá» qua data validation â†’ trÃ¡nh silent corruption. - KHÃ”NG Ä‘á»ƒ PI