bio-protein-clustering-pangenomelisted

Cluster proteins into orthogroups and derive pangenome matrices.
fmschulz/omics-skills · ★ 3 · AI & Automation · score 67

Install: claude install-skill fmschulz/omics-skills

# Bio Protein Clustering Pangenome Cluster proteins into orthogroups and derive pangenome matrices. ## Instructions 1. Cluster proteins. Choose the tool by dataset size and goal: - Default for orthology inference up to a few hundred genomes: **OrthoFinder v3** (improved accuracy and lower RAM at scale, supports MSA-based gene trees; supersedes OrthoFinder v2 and OrthoMCL workflows). - Very large pangenomes where OrthoFinder is too RAM-heavy: **ProteinOrtho v6.1.7+** as the fast, scalable alternative. - Sequence clustering (not strict orthology) and similarity search backbones: **MMseqs2** v15-6f452+. Enable GPU mode (`mmseqs ... --gpu`) on CUDA Turing+ nodes for a ~20× speedup at near-identical sensitivity. 2. Build presence/absence matrix AND an integer copy-number matrix (orthogroup × genome) covering the query AND the close relatives produced by `/bio-phylogenomics`. 3. Compute core/accessory/cloud/singleton partitions. 4. Identify single-copy orthologs for phylogenetic analysis. 5. Discriminate paralogs from orthologs in multi-copy gene families. 6. Calculate pangenome statistics (completeness, orthogroup occupancy). 7. When a query genome or genome set is under study, use the literature-derived analysis playbook to choose an appropriate comparison baseline: closest relatives, a broader clade, environmental references, or a negative/control set. 8. **Genome-property frontier table** — produce `relative_genome_metrics.tsv` with one row per (query + relative) an