Products

Resources

Company

Try for free

Products

Resources

Company

Pricing

Try for free

Our products

MakoraGenerate

MakoraOptimize

RESOURCES

Blog

CASE STUDIES

Code Translation

Performance Optimization

COMPANY

About

Careers

Pricing

Try for free

Our products

MakoraGenerate

MakoraOptimize

RESOURCES

Blog

CASE STUDIES

Code Translation

Performance Optimization

COMPANY

About

Careers

Pricing

Try for free

How MakoraGenerate Leverages PTX and Tensor Cores for Fast Matrix Multiplication

MakoraGenerate writes inline PTX to achieve near-optimial GEMM performance

Written by

Cătălin Milu

Published on

Jul 29, 2025

At Makora, our mission is to revolutionize GPU development. Our flagship product, MakoraGenerate, is an intelligent agent that automatically produces highly optimized GPU kernel code. Today, we're excited to show how MakoraGenerate leverages PTX code for CUDA and NVIDIA's Tensor Cores, delivering unparalleled performance for intensive tasks like matrix multiplication.

The Challenge: Manual GPU Optimization is Hard

Achieving peak GPU performance requires a deep understanding of hardware, memory, and low-level languages like PTX. Even for experienced CUDA developers, hand-optimizing a Tensor Core-enabled GEMM kernel is time-consuming and error-prone, often leading to subtle inefficiencies or "mismatches."

A Glimpse Under the Hood: Our Inline PTX GEMM Example

To illustrate, consider a manually written CUDA kernel using inline PTX for Tensor Cores. While functional, such code highlights the complexities and potential for "simple mismatches" inherent in low-level manual optimization.

__global__ void gemm_tensor_core_kernel(
    const half* __restrict__ A,
    const half* __restrict__ B,
    half* __restrict__ C,
    int M_dim, int N_dim, int K_dim
) {
    // Shared memory, block/warp/lane indices, register storage initialization
    // ...
    
    // Main computation loop over K dimension
    for (int block_k = 0; block_k < K_dim; block_k += BK) {
        // Load A and B tiles from global to shared memory
        // ...
        __syncthreads();
        
        // Process warp tiles in K dimension
        for (int warp_k = 0; warp_k < BK; warp_k += WK) {
            // Calculate and convert pointers for warp tiles in shared memory
            uint32_t A_warp_base = cvta_to_shared_u32(A_warp_tile);
            uint32_t B_warp_base = cvta_to_shared_u32(B_warp_tile);
            
            // Load A tiles into registers using ldmatrix (inline PTX)
            asm volatile (
                "ldmatrix.sync.aligned.m8n8.x2.shared.b16 "
                "{%0, %1}, [%2];"
                : "=r"(A_register[mma_m][mma_k][0]), 
                  "=r"(A_register[mma_m][mma_k][1])
                : "r"(thread_offset_bytes)
            );
            
            
            
            // Load B tiles into registers using ldmatrix (transposed, inline PTX)
            asm volatile (
                "ldmatrix.sync.aligned.m8n8.x1.trans.shared.b16 "
                "{%0}, [%1];"
                : "=r"(B_register[mma_k][mma_n])
                : "r"(thread_offset_bytes)
            );
            
            // Perform MMA operations (inline PTX)
            asm volatile (
                "mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16 "
                "{%0, %1}, {%2, %3}, {%4}, {%5, %6};"
                : "=r"(C_register[mma_m][mma_n][0]), 
                  "=r"(C_register[mma_m][mma_n][1])
                : "r"(A_register[mma_m][mma_k][0]),
                  "r"(A_register[mma_m][mma_k][1]),
                  "r"(B_register[mma_k][mma_n]),
                  "r"(C_register[mma_m][mma_n][0]),
                  "r"(C_register[mma_m][mma_n][1])
            );
        }
    }
    // Store results back to global memory
    // ...

This kernel uses direct PTX instructions like ldmatrix and mma.sync for Tensor Core interaction. While powerful, manually managing register allocation, shared memory, and parallelism via inline assembly invites subtle bugs and performance issues ("simple mismatches").

Enter MakoraGenerate: Your AI-Powered Optimization Agent

MakoraGenerate sidesteps this by intelligently generating highly optimized PTX code, including efficient Tensor Core and shared memory utilization, handling complexities that plague manual efforts.

How MakoraGenerate works:

High-Level Specification: You provide MakoraGenerate with your computational task (e.g., matrix multiplication).
Architectural Awareness: It understands your GPU's architecture, including Tensor Cores and memory hierarchy.
PTX Generation: MakoraGenerate produces tailored PTX code, ensuring optimal shared memory, precise Tensor Core utilization, efficient thread/warp scheduling, and automatic resolution of common "mismatches."
Integration: The generated PTX seamlessly integrates into your CUDA C++ application.

The Benefits: Performance and Productivity

MakoraGenerate offers clear advantages:

Superior Performance: It generates expertly optimized PTX, consistently achieving near-peak performance, often surpassing hand-tuned kernels. Our example demonstrates how a manually optimized Tensor Core kernel already outpaces a baseline.
Reduced Development Time: Focus on high-level logic, not low-level GPU optimization.
Increased Reliability: Automated generation minimizes human error, leading to more robust kernels.
Future-Proofing: It adapts to evolving GPU architectures, ensuring continuous performance.

Conclusion

Generating efficient GPU kernel code, especially leveraging Tensor Cores via PTX, is transformative for high-performance computing. At Makora, we believe MakoraGenerate empowers developers to push GPU capabilities, turning complex optimization into streamlined, automated processes.

Stay tuned for more updates as we evolve MakoraGenerate and enhance your GPU development workflow!

Latest