media-processorlisted
Install: claude install-skill avibebuilder/claude-prime
# Media Processor
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
## Core Capabilities
### Image Understanding
- Image captioning and description
- Object detection with bounding boxes (2.0+)
- Pixel-level segmentation (2.5+)
- Visual question answering
- Multi-image comparison (up to 3,600 images)
- OCR and text extraction
### Video Analysis
- Scene detection and summarization
- Video Q&A with temporal understanding
- Transcription with visual descriptions
- YouTube URL support
- Long video processing (up to 6 hours)
- Frame-level analysis
### Document Extraction
- Native PDF vision processing (up to 1,000 pages)
- Table and form extraction
- Chart and diagram analysis
- Multi-page document understanding
- Structured data output (JSON schema)
- Format conversion (PDF to HTML/JSON)
### Image Generation
- Text-to-image generation
- High-quality generation variant (`generate-hq`) for detailed outputs
- Image editing and modification
- Multi-image composition (up to 3 images)
- Iterative refinement
- Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
- Controllable style and quality
### Audio Processing
- Transcription with timestamps (up to 9.5 hours)
- Audio summarization and analysis
- Speech understanding and speaker identification
- Music and environmental sound analysis
- Text-to-speech generation with controllable voice
## Supported Tasks