dvc-dataset-versioning

Solid

Dataset versioning skill using DVC for tracking data changes, managing data pipelines, and ensuring reproducibility.

Data & Documents 814 stars 53 forks Updated today MIT

Install

View on GitHub

Quality Score: 95/100

Stars 20%
97
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# dvc-dataset-versioning ## Overview Dataset versioning skill using DVC (Data Version Control) for tracking data changes, managing data pipelines, and ensuring reproducibility in ML workflows. ## Capabilities - Dataset version tracking - Data pipeline definition and execution - Remote storage management (S3, GCS, Azure, etc.) - Reproducibility enforcement - Data lineage tracking - Experiment comparison with data versions - Cache management for large datasets ## Target Processes - Data Collection and Validation Pipeline - ML Model Retraining Pipeline - Feature Store Implementation ## Tools and Libraries - DVC - Git - Remote storage SDKs (boto3, google-cloud-storage, etc.) ## Input Schema ```json { "type": "object", "required": ["action"], "properties": { "action": { "type": "string", "enum": ["init", "add", "push", "pull", "diff", "checkout", "run", "repro"], "description": "DVC action to perform" }, "paths": { "type": "array", "items": { "type": "string" }, "description": "File or directory paths to track" }, "remote": { "type": "string", "description": "Remote storage name" }, "revision": { "type": "string", "description": "Git revision for checkout/diff" }, "pipeline": { "type": "object", "description": "Pipeline stage definition for run action" } } } ``` ## Output Schema ```json { "type": "object", "required": ["status", "action"], "prop...

Details

Author
a5c-ai
Repository
a5c-ai/babysitter
Created
4 months ago
Last Updated
today
Language
JavaScript
License
MIT

Related Skills