bio-protein-clustering-pangenomelisted
Install: claude install-skill fmschulz/omics-skills
# Bio Protein Clustering Pangenome
Cluster proteins into orthogroups and derive pangenome matrices.
## Instructions
1. Cluster proteins. Choose the tool by dataset size and goal:
- Default for orthology inference up to a few hundred genomes: **OrthoFinder v3** (improved accuracy and lower RAM at scale, supports MSA-based gene trees; supersedes OrthoFinder v2 and OrthoMCL workflows).
- Very large pangenomes where OrthoFinder is too RAM-heavy: **ProteinOrtho v6.1.7+** as the fast, scalable alternative.
- Sequence clustering (not strict orthology) and similarity search backbones: **MMseqs2** v15-6f452+. Enable GPU mode (`mmseqs ... --gpu`) on CUDA Turing+ nodes for a ~20× speedup at near-identical sensitivity.
2. Build presence/absence matrix AND an integer copy-number matrix (orthogroup × genome) covering the query AND the close relatives produced by `/bio-phylogenomics`.
3. Compute core/accessory/cloud/singleton partitions.
4. Identify single-copy orthologs for phylogenetic analysis.
5. Discriminate paralogs from orthologs in multi-copy gene families.
6. Calculate pangenome statistics (completeness, orthogroup occupancy).
7. When a query genome or genome set is under study, use the literature-derived analysis playbook to choose an appropriate comparison baseline: closest relatives, a broader clade, environmental references, or a negative/control set.
8. **Genome-property frontier table** — produce `relative_genome_metrics.tsv` with one row per (query + relative) an