aiperf

Solid

NVIDIA AIPerf — vendor-neutral generative-AI inference benchmarking (genai-perf successor). Covers `aiperf profile` with concurrency / request-rate / fixed-schedule trace replay / user-centric / multi-run confidence, 17 endpoint types (chat, completions, embeddings, rankings, responses, image-gen, image-edit, video-gen, NIM, HF-TGI, raw, template, etc.), 10 custom dataset formats (single_turn, multi_turn, mooncake_trace, bailian_trace, burst_gpt_trace, random_pool, dag_jsonl, raw_payload, inputs_json, sagemaker_data_capture) plus the SPEED-Bench family, 20+ public datasets, goodput SLOs, GPU + Prometheus telemetry, plot/analyze-trace/synthesize/service subcommands, plugin extensibility, and reasoning-token TTFT/TTFO split.

AI & Automation 3 stars 1 forks Updated 2 days ago MIT

Install

View on GitHub

Quality Score: 79/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# AIPerf — NVIDIA generative-AI inference benchmarking Target audience: operators producing defensible latency/throughput/goodput numbers against any OpenAI-compatible inference server (vLLM, SGLang, TensorRT-LLM, NVIDIA Dynamo, NIM, Triton, HF TGI, Ollama), and developers extending AIPerf with custom endpoints, datasets, exporters, or metrics. ## Why this matters `aiperf` is the open-source successor to `genai-perf`, written by NVIDIA's AI-Dynamo team. It is **the** vendor-neutral way to: 1. **Replay production traces** (Mooncake / Bailian / BurstGPT) at exact timestamps — synthetic load lies about cache reuse and tail behavior. 2. **Measure goodput**, not just throughput — the percentage of requests that meet **all** SLOs simultaneously. A system at 1000 req/s throughput and 28% goodput is mis-provisioned by ~3.5×. 3. **Account for reasoning tokens correctly.** GPT-OSS / DeepSeek-R1 / Qwen3 emit `reasoning_content` before the answer. genai-perf ignored those; aiperf splits **TTFT** (any first token) from **TTFO** (first non-reasoning token). Numbers are not directly comparable across the two tools — see migration notes. 4. **Collect server + GPU telemetry alongside client-side timings** in one run — DCGM / pynvml GPU metrics + Prometheus `/metrics` scrape, all aligned in the same artifact dir. 5. **Extend cleanly** — 25 plugin categories with a YAML manifest + entry-point system. New endpoints, dataset formats, exporters, accuracy graders, plot types all go through the ...

Details

Author: air-gapped
Repository: air-gapped/skills
Created: 3 months ago
Last Updated: 2 days ago
Language: Python
License: MIT

Integrates with

OpenAI · AI Anthropic · AI Ollama · AI Kubernetes · Infrastructure

Bundled in these plugins

skills

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

performance-profiler

Performance Profiler

1 Updated yesterday

nariatrip191

AI & Automation Listed

perf-review

Performance gates.

42 Updated yesterday

lemoncrow-lab

Data & Documents Solid

perf

Performance regression gate. Detects N+1 queries, sync-in-async, missing indexes, memory leaks, and bundle bloat before they reach production. Ranks findings by Cost Impact Hierarchy (architecture > data transfer > compute > DB > caching) so fix priority maps to actual unit-cost reduction.

81 Updated 4 days ago

Rune-kit