evaluating-code-models

Solid

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

AI & Automation 9,609 stars 724 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%

100

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# BigCode Evaluation Harness - Code Model Benchmarking ## Quick Start BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages). **Installation**: ```bash git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git cd bigcode-evaluation-harness pip install -e . accelerate config ``` **Evaluate on HumanEval**: ```bash accelerate launch main.py \ --model bigcode/starcoder2-7b \ --tasks humaneval \ --max_length_generation 512 \ --temperature 0.2 \ --n_samples 20 \ --batch_size 10 \ --allow_code_execution \ --save_generations ``` **View available tasks**: ```bash python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)" ``` ## Common Workflows ### Workflow 1: Standard Code Benchmark Evaluation Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+). **Checklist**: ``` Code Benchmark Evaluation: - [ ] Step 1: Choose benchmark suite - [ ] Step 2: Configure model and generation - [ ] Step 3: Run evaluation with code execution - [ ] Step 4: Analyze pass@k results ``` **Step 1: Choose benchmark suite** **Python code generation** (most common): - **HumanEval**: 164 handwritten problems, function completion - **HumanEval+**: Same 164 problems with 80× more tests (stricter) - **MBPP**: 500 crowd-sourced problems, entry-level difficulty - **MBPP+**: 399 curated problems with 35× more tests **Multi-language** (18 languages): - **MultiPL-E**: ...

Details

Author: Orchestra-Research
Repository: Orchestra-Research/AI-Research-SKILLs
Created: 7 months ago
Last Updated: 1 months ago
Language: TeX
License: MIT

Integrates with

Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured