bio-stats-ml-reportinglisted
Install: claude install-skill fmschulz/omics-skills
# Bio Stats ML Reporting
Aggregate results, train ML models, and produce reports with validated references.
## Instructions
1. Join outputs in DuckDB v1.1+ and build feature tables. Arrow / DuckLake integration is the recommended bridge into ML pipelines for large datasets.
2. Train baseline models and evaluate with cross-validation.
- CPU baseline: scikit-learn v1.5+ for linear/tree/clustering baselines; XGBoost v2.1.4+ for gradient boosting.
- GPU node available (CUDA): set `device="cuda"` on XGBoost (native since v2.0) by default. For sklearn-compatible estimators (random forest, k-means, PCA, UMAP), use **RAPIDS cuML** as a drop-in replacement and record the device in the run log.
3. Generate reports and validate references.
4. For exploratory omics projects, aggregate discovery evidence across the literature-derived analysis playbook, annotation, phylogenomics, viromics, and comparative-genomics outputs.
5. **Comparative-axes rollup** — join the per-axis comparison artifacts produced by upstream skills into a single `comparative_axes_summary.tsv`. The rollup must have one row per (query genome, axis) and include:
- `genome-property frontier` (size, gene count, etc. — link to `relative_genome_metrics.tsv` and `genome_size_frontier.tsv`)
- `marker-gene census` (link to `marker_census.tsv`)
- `family copy-number expansions/contractions` (link to `family_copy_number_comparison.tsv` and `family_expansion_candidates.tsv`)
- `synteny / conserved neighborhoo