gemma4-local-deploy
Solid在本机 Mac 或 Apple Silicon 上部署 Gemma 4 12B。本地安装/升级 llama.cpp,下载 GGUF 量化模型,用 llama-server 暴露 OpenAI-compatible API,或用 Ollama 暴露本地模型服务;按用户需求在默认 Q4_K_M、64K/128K 长上下文、QAT Q4_0 @ 256K、左右对比演示之间选择,配置 tmux 后台运行,验证健康检查、问答接口、资源占用和常见故障。当用户说部署 Gemma 4、Gemma 4 12B、本地大模型、长上下文、QAT、量化、llama-server、Ollama、GGUF、Mac 本地模型服务时使用。
Install
Quality Score: 90/100
Skill Content
Details
- Author
- majiayu000
- Repository
- majiayu000/spellbook
- Created
- 6 months ago
- Last Updated
- today
- Language
- Python
- License
- MIT
Integrates with
Similar Skills
Semantically similar based on skill content — not just same category
vllm-gemma-4-31b
Operating-point reference for serving Gemma 4 31B on vLLM — TP sizing, max_model_len, max_num_seqs, gpu_memory_utilization, kv_cache_dtype, EAGLE3 spec-dec, chat_template choice.
llama-cpp
Secondary local LLM inference engine via llama.cpp. This skill should be used when running GGUF models directly, loading LoRA adapters for Kothar, benchmarking inference speed, or serving models via llama-server. Includes dedicated Qwen 3.5 serve scripts (9B dense with F16 option, 35B MoE) with asymmetric KV cache and thinking mode. Complements Ollama (which remains primary for RLAMA and general use).
dispatch
多模型调用器 — 把任务或 prompt 派发给其他 AI 模型(Codex / Gemini / Kimi / DeepSeek / 豆包 / Qwen / GLM / MiniMax)执行并取回结果。当你想用某个或某几个其他模型跑任务、需要多模型交叉对比验证、或想用更便宜的模型省钱时触发。提供两类调用通道:API 直调(只需 API key)和 CLI 调用(需本地装对应 CLI)。触发信号:「用 Kimi/Codex/Gemini 跑一下」「交给其他 AI」「换个模型试试」「让几个模型都看看」「这个不用最贵的模型」。