← ClaudeAtlas

bio-fasta-database-curatorlisted

Curate, validate, and standardize FASTA/FAA sequence databases: standardize headers, merge databases, remove duplicates, convert GenBank to FASTA, and generate statistics. Use when preparing reference databases for HMM searches, MMseqs2, BLAST, and other bioinformatics workflows.
fmschulz/omics-skills · ★ 3 · Data & Documents · score 67
Install: claude install-skill fmschulz/omics-skills
# FASTA Database Curator ## Overview Automate the curation and standardization of biological sequence databases. This skill handles the tedious work of processing FASTA/FAA files, ensuring consistent header formats, removing duplicates, and preparing databases for downstream analysis. Supplementary version-grounded tool notes: [tools.md](tools.md). **Key Capabilities:** - Header format standardization (pipe separators, prefixes) - Duplicate detection and removal (by sequence or ID) - Format conversion (GenBank → FASTA, multi-line → single-line) - Database merging with conflict resolution - Statistics generation (counts, lengths, taxonomy, GC content) - Validation (no whitespace in headers, proper formatting) - Taxonomy label extraction and standardization ## When to Use This Skill Use this skill when: - User needs to standardize sequence headers - User wants to merge multiple FASTA files - User needs to remove duplicate sequences - User is preparing a database for HMM/BLAST/MMseqs2 - User wants database statistics and quality metrics - User needs to convert between sequence formats ## Header Format Standards ### Recommended Format Use pipe-separated fields with consistent prefixes: ``` >PREFIX|ACCESSION|DESCRIPTION SEQUENCE... ``` **Examples:** ``` >VP|Mavirus_MCP|Major capsid protein [Virophage] >PLV|NC_021333_1|Polinton-like virus hypothetical protein >NCLDV|YP_009173877.1|DNA polymerase [Marseilleviridae] ``` ### Common Transformations ```python # Remove white