back
Get SIGNAL/NOISE in your inbox daily
We implemented a sophisticated matrix multiplication engine in CubeCL that rivals the performance of cuBLAS and CUTLASS while supporting a wider range of GPUs. Leveraging double buffering, tensor cores, and vectorization, it compiles seamlessly to CUDA, ROCm, WebGPU, Metal, and Vulkan backends without relying on proprietary or third-party binaries. Matrix multiplication is central to modern AI workloads, especially transformers, and optimizing it ourselves was essential to enable kernel fusion and achieve state-of-the-art performance across platforms in a deep learning framework.
Recent Stories
Jan 19, 2026
OpenAI CFO Friar: 2026 is year for ‘practical adoption’ of AI
OpenAI CFO Sarah Friar said the company is focused on "practical adoption" in 2026, especially in health, science, and enterprise.
Jan 19, 2026OpenAI’s 2026 ‘focus’ is ‘practical adoption’
As the company spends a huge amount of money on infrastructure, OpenAI is working to close the gap on what AI can do and how people actually use it.
Jan 19, 2026Chef Robotics and Packline Partner for Automated Food Manufacturing Solution
The companies have developed a wireless integration that enables seamless end-to-end communication between Chef’s and Packline’s equipment throughout the production line.