open-webui-embeddings

Solid

Wire HuggingFace embedding + reranker models (BGE-M3, BGE-Reranker-v2-m3, etc.) into Open WebUI's RAG pipeline via LiteLLM proxying HuggingFace Text Embeddings Inference (TEI). Covers the exact wire shapes Open WebUI sends (URL auto-append on embed but NOT rerank; payload + response shapes for both modes), the LiteLLM-TEI gotchas (encoding_format=null trap, HF-driver task_type misdetection, openai vs huggingface driver tradeoffs), TEI config cliffs (max-client-batch-size 422 under hybrid search, max-batch-tokens AS the auto-truncate boundary, arch-specific Docker images), and the end-to-end production config. BGE-M3 + BGE-Reranker-v2-m3 are worked examples; patterns generalise to any TEI encoder.

AI & Automation 3 stars 1 forks Updated 2 days ago MIT

Install

View on GitHub

Quality Score: 79/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Open WebUI embeddings + reranking — operator reference Target: operators wiring Open WebUI's RAG pipeline to HuggingFace Text Embeddings Inference (TEI) via LiteLLM. Three hops, each with its own wire-shape quirks. Most failure modes silently degrade to "answer quality dropped" rather than visible errors — this skill is a triage for catching them at config-time. **Siblings in the `open-webui` plugin.** Setting these values *through the REST API* rather than the settings UI — and the knowledge/RAG endpoints that ingest documents — is the **`open-webui-api`** skill. Running more than one Open WebUI replica, where RAG requests and their WebSocket streams must survive hitting a different pod, is **`open-webui-valkey-websocket`**. ## The architecture in 30 seconds ``` Open WebUI → LiteLLM proxy → TEI (GPU) └ embed: openai-driver → /v1/embeddings └ rerank: huggingface-driver → /rerank (Cohere↔TEI translation) ``` Why proxy through LiteLLM rather than point Open WebUI at TEI directly? - **Embed:** TEI exposes `/v1/embeddings` natively (OpenAI-compat) — direct path works. LiteLLM adds: virtual-key auth, per-model rate limits, request logging, optional caching. - **Rerank:** TEI's native `/rerank` is `{query, texts}` → `[{index, score}]`. Open WebUI's `ExternalReranker` sends Cohere shape `{query, documents, top_n}` → `{results: [{index, relevance_score}]}`. **Direct path fails with HTTP 422** — wire shapes do not match. LiteLLM's HuggingFace rerank han...

Details

Author: air-gapped
Repository: air-gapped/skills
Created: 3 months ago
Last Updated: 2 days ago
Language: Python
License: MIT

Integrates with

OpenAI · AI Anthropic · AI Hugging Face · AI Docker · Infrastructure Kubernetes · Infrastructure WebSocket · API

Bundled in these plugins

skills

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

open-webui-api

Administer Open WebUI entirely via its REST API (v0.10.x): user/group lifecycle, permissions, model catalog GitOps (export/import/sync), knowledge/RAG pipelines, config-as-code, SCIM provisioning, event webhooks, and backup surfaces. Grounded in the v0.10.2 source — covers the 458-path surface the official docs leave ~96% undocumented, the auth bootstrapping traps (ENABLE_API_KEYS default-off, JWT 4-week expiry, one unscoped key per user), and the 0.10.0 breaking changes (access_control→access_grants with inverted public/private defaults, flat dot-keyed config) that silently break every pre-0.10 script and most LLM training-data knowledge.

3 Updated 2 days ago

air-gapped

AI & Automation Solid

open-webui-valkey-websocket

Deploy Open WebUI multi-pod with WebSockets and Valkey/Redis Sentinel at 1000+ user scale on Kubernetes. Centerpiece is the structural Socket.IO+Redis frame-amplification bug (#23733) that cripples multi-pod streaming, and the maintainer-endorsed mitigation (`CHAT_RESPONSE_STREAM_DELTA_CHUNK_SIZE`). Covers all multi-pod env vars, the custom-model-icon perf history (base64-in-/api/models, fixed late 2025–Apr 2026), the official helm chart's gaps (bundled Redis is unsuitable for production; no HPA/PDB/probes/sticky sessions), and the catalog of known multi-pod issues with current status.

3 Updated 2 days ago

air-gapped

AI & Automation Listed

inference-engineer

Productize an open-source model into a hosted inference endpoint the researcher (or their agent) can call. Picks the right hardware, the right serving stack (vLLM / Triton / TEI / BentoML), wraps it in an OpenAI-compatible gateway (LiteLLM) with per-tenant auth, exposes it as an MCP tool in chat, and runs a quality + latency + cost probe so the user knows what they actually shipped. Triggers on `/inference-engineer [model-url]`, or on natural intent like "host this model", "serve inference for X", "deploy model Y", "get me an API for Z", "I want to reproduce paper P on a small GPU", "what do I do with this trained model".

20 Updated 1 weeks ago

Rockielab