add-cuda-kernellisted

Step-by-step tutorial for adding new CUDA kernels to FlashInfer
aiskillstore/marketplace · ★ 329 · AI & Automation · score 79

Install: claude install-skill aiskillstore/marketplace

# Tutorial: Adding a New Kernel to FlashInfer This tutorial walks through adding a simple element-wise scale operation to FlashInfer. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow. ## Goal Add a new operation that scales each element of a tensor by a scalar factor: - Input: tensor `x` and scalar `factor` - Output: `x * factor` (element-wise) - Support multiple dtypes (FP16, BF16, FP32) ## Step 1: Define CUDA Kernel in `include/` Create `include/flashinfer/scale.cuh`: ```cpp #pragma once #include <cuda_runtime.h> #include <cuda_fp16.h> #include <cuda_bf16.h> namespace flashinfer { /*! * \brief Element-wise scale kernel * \tparam T Data type (half, __nv_bfloat16, float) * \param input Input tensor * \param output Output tensor * \param factor Scale factor * \param n Number of elements */ template <typename T> __global__ void ScaleKernel(const T* input, T* output, T factor, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < n) { output[idx] = input[idx] * factor; } } /*! * \brief Launch scale kernel * \tparam T Data type * \param input Input pointer * \param output Output pointer * \param factor Scale factor * \param n Number of elements * \param stream CUDA stream */ template <typename T> cudaError_t ScaleLauncher(const T* input, T* output, T factor, int n, cudaStream_t stream = nullptr) { const int threads = 256; const int blocks = (n + threads - 1) / thread