benchmark-kernellisted

Guide for benchmarking FlashInfer kernels with CUPTI timing
aiskillstore/marketplace · ★ 329 · Testing & QA · score 79

Install: claude install-skill aiskillstore/marketplace

# Tutorial: Benchmarking FlashInfer Kernels This tutorial shows you how to accurately benchmark FlashInfer kernels. ## Goal Measure the performance of FlashInfer kernels: - Get accurate GPU kernel execution time - Compare multiple backends (FlashAttention2/3, cuDNN, CUTLASS, TensorRT-LLM) - Generate reproducible benchmark results - Save results to CSV for analysis ## Timing Methods FlashInfer supports two timing methods: 1. **CUPTI (Preferred)**: Hardware-level profiling for most accurate GPU kernel time - Measures pure GPU compute time without host-device overhead - Requires `cupti-python >= 13.0.0` (CUDA 13+) 2. **CUDA Events (Fallback)**: Standard CUDA event timing - Automatically used if CUPTI is not available - Good accuracy, slight overhead from host synchronization **The framework automatically uses CUPTI if available, otherwise falls back to CUDA events.** ## Installation ### Install CUPTI (Recommended) For the most accurate benchmarking: ```bash pip install -U cupti-python ``` **Requirements**: CUDA 13+ (CUPTI version 13+) ### Without CUPTI If you don't install CUPTI, the framework will: - Print a warning: `CUPTI is not installed. Falling back to CUDA events.` - Automatically use CUDA events for timing - Still provide good benchmark results ## Method 1: Using flashinfer_benchmark.py (Recommended) ### Step 1: Choose Your Test Routine Available routines: - **Attention**: `BatchDecodeWithPagedKVCacheWrapper`, `BatchPrefillWithPagedKVCacheWr