The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries

Published in arXiv preprint, 2025

The Fused Kernel Library (FKL) provides a C++ API designed to simplify the development of highly-efficient fused GPU kernels. Kernel fusion is a critical optimization technique for GPU computing — by combining multiple operations into single kernel launches, FKL eliminates redundant memory transfers and kernel launch overhead, enabling significant performance improvements.

FKL is particularly relevant for:

High-performance inference — Fusing operations in neural network inference pipelines
Scientific computing — Optimizing GPU-accelerated numerical computations
Custom CUDA development — Providing a higher-level API while maintaining low-level performance

This work reflects my deep involvement in CUDA programming and GPU optimization, contributing to the tooling that powers efficient AI computation.

Recommended citation: O. Amoros, A. Andaluz, J. Nunez, A.J. Pena. (2025). "The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries." arXiv preprint arXiv:2508.07071.
Download Paper | Download Slides

Share on

Twitter Facebook LinkedIn

Johnny Núñez Cano

Share on