dataset-generatorlisted
Install: claude install-skill Bhanunamikaze/AI-Dataset-Generator
# Dataset Generator
This skill is a tool-native dataset pipeline for Codex, Antigravity, and Claude Code.
- Use the IDE's own tools for browsing, reading, search, and reasoning.
- Use local Python scripts for deterministic normalization, state tracking, verification, deduplication, and export.
- Do not call external LLM-provider APIs as part of this skill.
## Command surface
- `dataset generate "<request>" [--count <n>]`
- `dataset collect "<topic or query>" [--urls url1 url2] [--paths ./dir]`
- `dataset verify <path/to/file>`
- `dataset audit [<path/to/file>]`
- `dataset export --format <openai|huggingface|csv|jsonl|all> [--schema-file path] [--split 0.1]`
If `dataset generate` does not include a size, default to `500` records.
If `dataset collect` does not include `--max-results`, default to `10` results per query.
## Core architecture
- `sub-skills/` contains the cognitive instructions.
- `scripts/` contains deterministic helpers.
- `resources/internal-schema/canonical_schema.json` is the fixed pipeline backbone.
- `resources/target-schemas/` contains preset export profiles.
- `resources/templates/custom_flat_schema.json` is the starting point for custom headers.
## Fixed vs flexible schema
- The canonical internal schema is fixed.
- The final export schema is not universal and must be chosen per user request.
- For custom CSV or flat JSONL headers, create or update a schema file and pass it to `scripts/export.py`.
Read `sub-skills/dataset-strategy.md` first when