add-cuda-kernellisted
Install: claude install-skill aiskillstore/marketplace
# Tutorial: Adding a New Kernel to FlashInfer
This tutorial walks through adding a simple element-wise scale operation to FlashInfer. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow.
## Goal
Add a new operation that scales each element of a tensor by a scalar factor:
- Input: tensor `x` and scalar `factor`
- Output: `x * factor` (element-wise)
- Support multiple dtypes (FP16, BF16, FP32)
## Step 1: Define CUDA Kernel in `include/`
Create `include/flashinfer/scale.cuh`:
```cpp
#pragma once
#include <cuda_runtime.h>
#include <cuda_fp16.h>
#include <cuda_bf16.h>
namespace flashinfer {
/*!
* \brief Element-wise scale kernel
* \tparam T Data type (half, __nv_bfloat16, float)
* \param input Input tensor
* \param output Output tensor
* \param factor Scale factor
* \param n Number of elements
*/
template <typename T>
__global__ void ScaleKernel(const T* input, T* output, T factor, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
output[idx] = input[idx] * factor;
}
}
/*!
* \brief Launch scale kernel
* \tparam T Data type
* \param input Input pointer
* \param output Output pointer
* \param factor Scale factor
* \param n Number of elements
* \param stream CUDA stream
*/
template <typename T>
cudaError_t ScaleLauncher(const T* input, T* output, T factor, int n,
cudaStream_t stream = nullptr) {
const int threads = 256;
const int blocks = (n + threads - 1) / thread