gguf-quantization

Install

View on GitHub

Quality Score: 93/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

70

Documentation 15%

100

Issue Health 10%

50

License 10%

100

Description 5%

100

Skill Content

# GGUF - Quantization Format for llama.cpp The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options. ## When to use GGUF **Use GGUF when:** - Deploying on consumer hardware (laptops, desktops) - Running on Apple Silicon (M1/M2/M3) with Metal acceleration - Need CPU inference without GPU requirements - Want flexible quantization (Q2_K to Q8_0) - Using local AI tools (LM Studio, Ollama, text-generation-webui) **Key advantages:** - **Universal hardware**: CPU, Apple Silicon, NVIDIA, AMD support - **No Python runtime**: Pure C/C++ inference - **Flexible quantization**: 2-8 bit with various methods (K-quants) - **Ecosystem support**: LM Studio, Ollama, koboldcpp, and more - **imatrix**: Importance matrix for better low-bit quality **Use alternatives instead:** - **AWQ/GPTQ**: Maximum accuracy with calibration on NVIDIA GPUs - **HQQ**: Fast calibration-free quantization for HuggingFace - **bitsandbytes**: Simple integration with transformers library - **TensorRT-LLM**: Production NVIDIA deployment with maximum speed ## Quick start ### Installation ```bash # Clone llama.cpp git clone https://github.com/ggml-org/llama.cpp cd llama.cpp # Build (CPU) make # Build with CUDA (NVIDIA) make GGML_CUDA=1 # Build with Metal (Apple Silicon) make GGML_METAL=1 # Install Python bindings (optional) pip install llama-cpp-python ``` ### Convert model to GGUF ```bas...

Details

Author: NousResearch
Repository: NousResearch/hermes-agent
Created: 10 months ago
Last Updated: today
Language: Python
License: MIT

Install

Quality Score: 93/100

Skill Content

Details

Integrates with

Similar Skills

gguf-quantization

gguf-quantization

llama-cpp