benchmark-kernellisted
Install: claude install-skill aiskillstore/marketplace
# Tutorial: Benchmarking FlashInfer Kernels
This tutorial shows you how to accurately benchmark FlashInfer kernels.
## Goal
Measure the performance of FlashInfer kernels:
- Get accurate GPU kernel execution time
- Compare multiple backends (FlashAttention2/3, cuDNN, CUTLASS, TensorRT-LLM)
- Generate reproducible benchmark results
- Save results to CSV for analysis
## Timing Methods
FlashInfer supports two timing methods:
1. **CUPTI (Preferred)**: Hardware-level profiling for most accurate GPU kernel time
- Measures pure GPU compute time without host-device overhead
- Requires `cupti-python >= 13.0.0` (CUDA 13+)
2. **CUDA Events (Fallback)**: Standard CUDA event timing
- Automatically used if CUPTI is not available
- Good accuracy, slight overhead from host synchronization
**The framework automatically uses CUPTI if available, otherwise falls back to CUDA events.**
## Installation
### Install CUPTI (Recommended)
For the most accurate benchmarking:
```bash
pip install -U cupti-python
```
**Requirements**: CUDA 13+ (CUPTI version 13+)
### Without CUPTI
If you don't install CUPTI, the framework will:
- Print a warning: `CUPTI is not installed. Falling back to CUDA events.`
- Automatically use CUDA events for timing
- Still provide good benchmark results
## Method 1: Using flashinfer_benchmark.py (Recommended)
### Step 1: Choose Your Test Routine
Available routines:
- **Attention**: `BatchDecodeWithPagedKVCacheWrapper`, `BatchPrefillWithPagedKVCacheWr