jgi-lakehouselisted
Install: claude install-skill fmschulz/omics-skills
# JGI Lakehouse Skill
## Quick Start
**What is it?** JGI's unified data warehouse (651 tables) + filesystem access to genome files.
**Two data access methods:**
1. **Lakehouse (Dremio)** → Metadata, annotations, taxonomy (no sequences)
2. **JGI Filesystem** → Actual genome files (FNA, FAA, GFF) via taxon OID
**SQL Dialect:** ANSI SQL (not PostgreSQL)
- Use `CAST(x AS type)` not `::`
- Use `REGEXP_LIKE()` not `~`
- Identifiers with dashes need double quotes: `"gold-db-2 postgresql"`
```sql
-- Quick test
SELECT gold_id, project_name FROM "gold-db-2 postgresql".gold.project
WHERE is_public = 'Yes' LIMIT 5;
```
---
## When to Use
- Query JGI genomics metadata (GOLD, IMG, Mycocosm, Phytozome)
- Find genomes and/or metagenomes by taxonomy, ecosystem, or phenotype.
- Download microbial genomes with IMG taxon OIDs
- Cross-reference GOLD projects with IMG annotations
## Instructions
1. Decide whether the task needs metadata, files, or read recovery.
2. Use Lakehouse SQL for metadata/annotations and the JGI filesystem or JAMO for sequence files.
3. Inspect schemas with a small `LIMIT`; remove `LIMIT` for comprehensive results.
4. Record source, table, fields, filters, and access route in every result summary.
## Quick Reference
| Task | Action |
|---|---|
| Test Lakehouse | Query `"gold-db-2 postgresql".gold.project` |
| Query IMG metadata | Use `"img-db-2 postgresql".img_core_v400.*` tables |
| Query NUMG proteins | Join `faa` and `gene2pfam` on both `oid` and `gene_oid`