The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries
Published in arXiv preprint, 2025
The Fused Kernel Library (FKL) provides a C++ API designed to simplify the development of highly-efficient fused GPU kernels. Kernel fusion is a critical optimization technique for GPU computing — by combining multiple operations into single kernel launches, FKL eliminates redundant memory transfers and kernel launch overhead, enabling significant performance improvements.
FKL is particularly relevant for:
- High-performance inference — Fusing operations in neural network inference pipelines
- Scientific computing — Optimizing GPU-accelerated numerical computations
- Custom CUDA development — Providing a higher-level API while maintaining low-level performance
This work reflects my deep involvement in CUDA programming and GPU optimization, contributing to the tooling that powers efficient AI computation.
Recommended citation: O. Amoros, A. Andaluz, J. Nunez, A.J. Pena. (2025). "The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries." arXiv preprint arXiv:2508.07071.
Download Paper | Download Slides
