evaluating-llms-harness

Solid

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

AI & Automation 5 stars 0 forks Updated yesterday MIT

Install

View on GitHub

Quality Score: 83/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# lm-evaluation-harness - LLM Benchmarking ## Quick start lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics. **Installation**: ```bash pip install lm-eval ``` **Evaluate any HuggingFace model**: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-2-7b-hf \ --tasks mmlu,gsm8k,hellaswag \ --device cuda:0 \ --batch_size 8 ``` **View available tasks**: ```bash lm_eval --tasks list ``` ## Common workflows ### Workflow 1: Standard benchmark evaluation Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval). Copy this checklist: ``` Benchmark Evaluation: - [ ] Step 1: Choose benchmark suite - [ ] Step 2: Configure model - [ ] Step 3: Run evaluation - [ ] Step 4: Analyze results ``` **Step 1: Choose benchmark suite** **Core reasoning benchmarks**: - **MMLU** (Massive Multitask Language Understanding) - 57 subjects, multiple choice - **GSM8K** - Grade school math word problems - **HellaSwag** - Common sense reasoning - **TruthfulQA** - Truthfulness and factuality - **ARC** (AI2 Reasoning Challenge) - Science questions **Code benchmarks**: - **HumanEval** - Python code generation (164 problems) - **MBPP** (Mostly Basic Python Problems) - Python coding **Standard suite** (recommended for model releases): ```bash --tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge ``` **Step 2: Configure model** **HuggingFace model**: ```bash lm_eval --model hf \ --model_args pretrained=meta-llama/Ll...

Details

Author: immacualate
Repository: immacualate/claude-forge
Created: 1 years ago
Last Updated: yesterday
Language: Shell
License: MIT

Integrates with

Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

compare-llm-models

Use this to pick or switch the LLM behind a feature, based on evidence instead of hype or the newest release. Trigger on "which model should I use", "is GPT/Claude/Gemini/Llama better for this", "should I switch models", "can a cheaper model do this", "compare models for my use case". Evaluate on YOUR task, not on leaderboards alone.

26 Updated yesterday

ContextJet-ai

AI & Automation Solid

llm-evaluation

Use when measuring the quality of an LLM feature. Covers building an evaluation set, choosing metrics, LLM-as-judge and its pitfalls, regression testing prompts, and evaluating in production.

23 Updated yesterday

nimadorostkar

AI & Automation Listed

hugging-face-evaluation

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.

3 Updated yesterday

tayyabexe