llama-cpplisted

Secondary local LLM inference engine via llama.cpp. This skill should be used when running GGUF models directly, loading LoRA adapters for Kothar, benchmarking inference speed, or serving models via llama-server. Includes dedicated Qwen 3.5 serve scripts (9B dense with F16 option, 35B MoE) with asymmetric KV cache and thinking mode. Complements Ollama (which remains primary for RLAMA and general use).
tdimino/claude-code-minoan · ★ 32 · AI & Automation · score 85

Install: claude install-skill tdimino/claude-code-minoan

# llama.cpp - Secondary Inference Engine Direct access to llama.cpp for faster inference, LoRA adapter loading, and benchmarking on Apple Silicon. Ollama remains primary for RLAMA and general use; llama.cpp is the power tool. ## Prerequisites ```bash brew install llama.cpp ``` Binaries: `llama-cli`, `llama-server`, `llama-embedding`, `llama-quantize` ## Quick Reference ### Resolve Ollama Model to GGUF Path To avoid duplicating model files, resolve an Ollama model name to its GGUF blob path: ```bash ~/.claude/skills/llama-cpp/scripts/ollama_model_path.sh qwen2.5:7b ``` ### Run Inference ```bash GGUF=$(~/.claude/skills/llama-cpp/scripts/ollama_model_path.sh qwen2.5:7b) llama-cli -m "$GGUF" -p "Your prompt here" -n 128 --n-gpu-layers all --single-turn --simple-io --no-display-prompt ``` ### Start API Server To start an OpenAI-compatible server (port 8081, avoids Ollama's 11434): ```bash ~/.claude/skills/llama-cpp/scripts/llama_serve.sh <model.gguf> # Or with options: PORT=8082 CTX=8192 ~/.claude/skills/llama-cpp/scripts/llama_serve.sh <model.gguf> ``` Test the server: ```bash curl http://localhost:8081/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}' ``` ### Serve Qwen3.5 Dedicated servers for Qwen3.5 models with asymmetric KV cache, jinja templates, and thinking mode. **9B Dense (recommended for 24-36GB systems):** ```bash # Default: Qwen3.5-9B, thinking mode, 32K context ~