← ClaudeAtlas

dataset-generatorlisted

Use this when the user wants to generate, normalize, verify, deduplicate, or export training datasets for Codex, Antigravity, or Claude Code from topics, URLs, reference material, web research, or existing JSONL/CSV files. Supports SFT and DPO workflows, custom export schemas, and deterministic local pipeline scripts.
Bhanunamikaze/AI-Dataset-Generator · ★ 15 · Data & Documents · score 75
Install: claude install-skill Bhanunamikaze/AI-Dataset-Generator
# Dataset Generator This skill is a tool-native dataset pipeline for Codex, Antigravity, and Claude Code. - Use the IDE's own tools for browsing, reading, search, and reasoning. - Use local Python scripts for deterministic normalization, state tracking, verification, deduplication, and export. - Do not call external LLM-provider APIs as part of this skill. ## Command surface - `dataset generate "<request>" [--count <n>]` - `dataset collect "<topic or query>" [--urls url1 url2] [--paths ./dir]` - `dataset verify <path/to/file>` - `dataset audit [<path/to/file>]` - `dataset export --format <openai|huggingface|csv|jsonl|all> [--schema-file path] [--split 0.1]` If `dataset generate` does not include a size, default to `500` records. If `dataset collect` does not include `--max-results`, default to `10` results per query. ## Core architecture - `sub-skills/` contains the cognitive instructions. - `scripts/` contains deterministic helpers. - `resources/internal-schema/canonical_schema.json` is the fixed pipeline backbone. - `resources/target-schemas/` contains preset export profiles. - `resources/templates/custom_flat_schema.json` is the starting point for custom headers. ## Fixed vs flexible schema - The canonical internal schema is fixed. - The final export schema is not universal and must be chosen per user request. - For custom CSV or flat JSONL headers, create or update a schema file and pass it to `scripts/export.py`. Read `sub-skills/dataset-strategy.md` first when