blip-2-vision-language

Featured

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.

AI & Automation 27,984 stars 2901 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# BLIP-2: Vision-Language Pre-training Comprehensive guide to using Salesforce's BLIP-2 for vision-language tasks with frozen image encoders and large language models. ## When to use BLIP-2 **Use BLIP-2 when:** - Need high-quality image captioning with natural descriptions - Building visual question answering (VQA) systems - Require zero-shot image-text understanding without task-specific training - Want to leverage LLM reasoning for visual tasks - Building multimodal conversational AI - Need image-text retrieval or matching **Key features:** - **Q-Former architecture**: Lightweight query transformer bridges vision and language - **Frozen backbone efficiency**: No need to fine-tune large vision/language models - **Multiple LLM backends**: OPT (2.7B, 6.7B) and FlanT5 (XL, XXL) - **Zero-shot capabilities**: Strong performance without task-specific training - **Efficient training**: Only trains Q-Former (~188M parameters) - **State-of-the-art results**: Beats larger models on VQA benchmarks **Use alternatives instead:** - **LLaVA**: For instruction-following multimodal chat - **InstructBLIP**: For improved instruction-following (BLIP-2 successor) - **GPT-4V/Claude 3**: For production multimodal chat (proprietary) - **CLIP**: For simple image-text similarity without generation - **Flamingo**: For few-shot visual learning ## Quick start ### Installation ```bash # HuggingFace Transformers (recommended) pip install transformers accelerate torch Pillow # Or LAVIS library (Sa...

Details

Author
davila7
Repository
davila7/claude-code-templates
Created
11 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category