messydata

Solid

Use this skill when working with MessyData — a synthetic dirty data generator. Covers writing and validating YAML configs, using the CLI, and the Python API. Trigger on: "generate synthetic data", "messydata config", "create a dataset schema", "add anomalies", "fake data", "dirty data".

Data & Documents 34 stars 1 forks Updated 2 months ago Apache-2.0

Install

View on GitHub

Quality Score: 78/100

Stars 20%
51
Recency 20%
75
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
80
License 10%
100
Description 5%
100

Skill Content

# MessyData Skill MessyData generates realistic messy DataFrames from a YAML config. It produces structured data with configurable anomalies (missing values, duplicates, invalid categories, bad dates, outliers). ## Workflow Always follow this order: 1. Write or edit the YAML config 2. **Validate it** with the CLI before generating 3. Generate data ```bash # 1. Validate (fast — no generation, exits 0/1) uv run messydata validate my_config.yaml # 2. Generate to stdout uv run messydata generate my_config.yaml --rows 1000 --seed 42 # 3. Generate to file (format inferred from extension) uv run messydata generate my_config.yaml --rows 1000 --output data.csv uv run messydata generate my_config.yaml --rows 1000 --output data.parquet uv run messydata generate my_config.yaml --rows 1000 --output data.json uv run messydata generate my_config.yaml --rows 1000 --output data.jsonl ``` Always run `validate` after writing a config. Fix any errors before generating. --- ## YAML Config Structure ```yaml name: <string> # required — dataset identifier primary_key: <string> # optional, default: "id" records_per_primary_key: <distribution> # required — rows per PK group (continuous only) fields: [<field_spec>, ...] # required — column definitions anomalies: [<anomaly_spec>, ...] # optional — data quality injections ``` **Row count:** `--rows N` is approximate. Each PK group is sampled from `records_per_primary_key`; actual co...

Details

Author
sodadata
Repository
sodadata/messydata
Created
10 months ago
Last Updated
2 months ago
Language
Python
License
Apache-2.0

Similar Skills

Semantically similar based on skill content — not just same category

Data & Documents Featured

data-lineage-mapper

Extracts and maps data lineage from various sources including SQL, dbt, Airflow, and Spark, generating comprehensive lineage graphs for impact analysis.

809 Updated today
a5c-ai
Code & Development Solid

vespertide

Define database schemas in JSON and generate migration plans. Use this skill when creating or modifying database models, defining tables with columns, constraints, and ENUM types for Vespertide-based projects.

22 Updated today
dev-five-git
Data & Documents Solid

content-creator

Self-configuring content creation pipeline with geo-lint validation. On first use, scans your project to learn its framework, content schema, categories, and authors, then creates SEO & GEO-optimized content matched to your brand voice. Validates every piece with geo-lint's 92 rules until zero violations. Triggers on: "content-creator", "create content", "write blog", "new post", "content calendar", "brand voice", "content strategy".

22 Updated 2 months ago
IJONIS
Data & Documents Featured

sync-skills

(project) Use when a skill in skills/ has its name or description changed, or is added or removed — syncs README.md, settings.json, and hal_dotfiles.json

114 Updated yesterday
vinta
Web & Frontend Solid

webiny-skill-generator

Generate, update, and maintain abstraction catalogs from the Webiny platform's public API. Use this skill whenever you need to: scan the `webiny` package to discover exported EventHandlers and UseCases, regenerate catalog JSON files after a platform release, check which abstractions are available, or add support for a new abstraction type.

7,980 Updated today
webiny