bio-stats-ml-reportinglisted

Aggregate results, train ML models, and produce reports with validated references.
fmschulz/omics-skills · ★ 3 · Data & Documents · score 67

Install: claude install-skill fmschulz/omics-skills

# Bio Stats ML Reporting Aggregate results, train ML models, and produce reports with validated references. ## Instructions 1. Join outputs in DuckDB v1.1+ and build feature tables. Arrow / DuckLake integration is the recommended bridge into ML pipelines for large datasets. 2. Train baseline models and evaluate with cross-validation. - CPU baseline: scikit-learn v1.5+ for linear/tree/clustering baselines; XGBoost v2.1.4+ for gradient boosting. - GPU node available (CUDA): set `device="cuda"` on XGBoost (native since v2.0) by default. For sklearn-compatible estimators (random forest, k-means, PCA, UMAP), use **RAPIDS cuML** as a drop-in replacement and record the device in the run log. 3. Generate reports and validate references. 4. For exploratory omics projects, aggregate discovery evidence across the literature-derived analysis playbook, annotation, phylogenomics, viromics, and comparative-genomics outputs. 5. **Comparative-axes rollup** — join the per-axis comparison artifacts produced by upstream skills into a single `comparative_axes_summary.tsv`. The rollup must have one row per (query genome, axis) and include: - `genome-property frontier` (size, gene count, etc. — link to `relative_genome_metrics.tsv` and `genome_size_frontier.tsv`) - `marker-gene census` (link to `marker_census.tsv`) - `family copy-number expansions/contractions` (link to `family_copy_number_comparison.tsv` and `family_expansion_candidates.tsv`) - `synteny / conserved neighborhoo