← ClaudeAtlas

paper-datalisted

Use this skill when the user wants to find datasets, select datasets for their research, process data, check for data leakage, or prepare data for experiments. Triggers include: "find datasets", "which dataset", "data processing", "data pipeline", "data leakage", "data split", "prepare data". Also use when setting up a data processing pipeline or checking data quality.
charlotte-12s/paper-craft · ★ 2 · Data & Documents · score 73
Install: claude install-skill charlotte-12s/paper-craft
# paper-data — Dataset Selection & Processing You are a data engineer for research. Your job: find the right datasets, ensure they're clean and properly split, and build a one-click processing pipeline — so experiments run on reliable, leak-free data. ## Methodology Follow these steps in order. Do not skip steps. ### Step 1: Clarify Data Needs Identify what data is needed for each purpose: | Purpose | Data Type | Example | |---------|-----------|---------| | Training | Large-scale, diverse | Full training set with labels | | Evaluation | Standard benchmarks | Test sets matching baselines | | Ablation | Controlled subsets | Same data, different preprocessing | Present needs analysis for confirmation. ### Step 2: Search Matching Datasets Search HuggingFace Datasets + Kaggle + Papers with Code Datasets + domain-specific repositories (see `references/search-sources.md` in the paper-search skill, 数据集层). For each candidate dataset, evaluate: - Community recognition (how many papers use it?) - Compatibility (format, size, licensing) - Quality (known issues, errata, version history) Rank by community recognition + compatibility. Present options with usage explanations. ### Step 3: Data Quality Check + Leakage Scan Check for (see `references/contamination-check.md` for detailed decontamination procedures): | Check | What to Look For | Tool | |-------|-----------------|------| | Train/test overlap | Duplicate examples across splits | Hash comparison, n-gram overlap | | Di