Code generated by MakoraGenerate is wrong... or brilliant?

Code generated by MakoraGenerate is wrong... or brilliant?

Code generated by MakoraGenerate is wrong... or brilliant?

Code generated by MakoraGenerate is wrong... or brilliant?

The code was correct. The problem wasn't.

The code was correct. The problem wasn't.

Written by

Piotr Kowalczyk

Piotr Kowalczyk

Published on

Feb 18, 2026

Feb 18, 2026

Yesterday I have looked at one of Makora’s internal benchmarks for MakoraGenerate and I saw some unusually high speedup of over 10x. Interested of what stands behind it, I looked at the problem and the solution. I was pretty sure that MakoraGenerate fell victim to an undetected reward hack.

The problem to optimize was simple:

class Model(nn.Module):
    """
    Model that performs a series of operations: Gemm, Subtract, GlobalAvgPool, LogSumExp, GELU, and ResidualAdd.
    """
    def __init__(self, in_features, out_features, bias=True):
        super(Model, self).__init__()
        self.gemm = nn.Linear(in_features, out_features, bias=bias)
        self.subtract = nn.Parameter(torch.randn(out_features))

    def forward(self, x):
        original_x = x.clone().detach()
        # Gemm
        x = self.gemm(x)

        # Subtract
        x = x - self.subtract

        # GlobalAvgPool
        x = torch.mean(x, dim=1, keepdim=True)

        # LogSumExp
        x = torch.logsumexp(x, dim=1, keepdim=True)

        # GELU
        x = torch.nn.functional.gelu(x)

        # ResidualAdd
        x = x + original_x

        return x

batch_size = 2048
in_features = 8192
out_features = 8192

How would you approach optimizing something like this?

If you are a guy like me (rather low level, not super fluent with Python), it might be tempting to do it mechanically:

  • there are two main code blocks: GEMM followed by some memory-bound ops

  • in the first version one can use some vendor library to execute GEMM

  • and probably fuse the rest of operations into single kernel with at least one reduction

  • at the end potentially everything could be fused into a single kernel

So I was expecting either gemm_kernel + fused_ops_kernel, or just one kernel call that does everything. However, MakoraGenerate went a different path: it replaced everything with 3 distinct kernels:

#define BLOCK 256

// Sum weights over out_features dimension to get a vector of size in_features
__global__ void weight_col_sum_kernel(const float* __restrict__ W,
                                      int out_features,
                                      int in_features,
                                      float* __restrict__ wsum) {
    int col = blockIdx.x;
    if (col >= in_features) return;

    int tid = threadIdx.x;
    double acc = 0.0;

    // W is [out_features, in_features], row-major
    for (int j = tid; j < out_features; j += blockDim.x) {
        acc += (double)W[j * in_features + col];
    }

    __shared__ double shm[BLOCK];
    shm[tid] = acc;
    __syncthreads();

    // Block reduction
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            shm[tid] += shm[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) {
        wsum[col] = (float)shm[0];
    }
}

// Compute sum_j (bias_j - subtract_j) into a single scalar
__global__ void bias_sub_sum_kernel(const float* __restrict__ bias,
                                    const float* __restrict__ subtract,
                                    int has_bias,
                                    int out_features,
                                    float* __restrict__ bs_sum) {
    int tid = threadIdx.x;
    double acc = 0.0;

    for (int j = tid; j < out_features; j += blockDim.x) {
        float b = has_bias ? bias[j] : 0.0f;
        acc += (double)b - (double)subtract[j];
    }

    __shared__ double shm[BLOCK];
    shm[tid] = acc;
    __syncthreads();

    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            shm[tid] += shm[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) {
        bs_sum[0] = (float)shm[0];
    }
}

// Fused output kernel: computes per-batch-row scalar = GELU(((X * sum_j W_j) + sum_j(bias - subtract)) / out_features)
// Then adds this scalar to every element of the corresponding row of X (residual add)
__global__ void fused_output_kernel(const float* __restrict__ X,
                                    const float* __restrict__ wsum,
                                    const float* __restrict__ bs_sum_ptr,
                                    int B,
                                    int in_features,
                                    int out_features,
                                    float* __restrict__ Out) {
    int b = blockIdx.x;
    if (b >= B) return;

    int tid = threadIdx.x;

    // Dot product of X[b, :] with wsum[:]
    double acc = 0.0;
    for (int i = tid; i < in_features; i += blockDim.x) {
        acc += (double)X[b * in_features + i] * (double)wsum[i];
    }

    __shared__ double shm[BLOCK];
    __shared__ float gelu_scalar;
    shm[tid] = acc;
    __syncthreads();

    // Reduce within block
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            shm[tid] += shm[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) {
        float bs_sum = bs_sum_ptr[0];
        double mean = (shm[0] + (double)bs_sum) / (double)out_features;
        float m = (float)mean;
        // Exact GELU using erf, as in PyTorch default
        float v = 0.5f * m * (1.0f + erff(m * 0.70710678118654752440f));
        gelu_scalar = v;
    }
    __syncthreads();

    for (int i = tid; i < in_features; i += blockDim.x) {
        Out[b * in_features + i] = X[b * in_features + i] + gelu_scalar;
    }
}

torch::Tensor fused_forward_cuda(torch::Tensor x,
                                 torch::Tensor weight,
                                 torch::Tensor bias,
                                 torch::Tensor subtract,
                                 bool has_bias) {
// [PK]: Setup omitted for readabilty

    // Launch kernels
    dim3 block_wsum(BLOCK);
    dim3 grid_wsum(in_features);
    weight_col_sum_kernel<<<grid_wsum, block_wsum, 0, stream.stream()>>>(W_ptr, out_features, in_features, wsum_ptr);

    dim3 block_bsum(BLOCK);
    dim3 grid_bsum(1);
    bias_sub_sum_kernel<<<grid_bsum, block_bsum, 0, stream.stream()>>>(bias_ptr, sub_ptr, has_bias ? 1 : 0, out_features, bs_sum_ptr);

    dim3 block_out(BLOCK);
    dim3 grid_out(B);
    fused_output_kernel<<<grid_out, block_out, 0, stream.stream()>>>(
        X_ptr,
        wsum_ptr,
        bs_sum_ptr,
        B,
        in_features,
        out_features,
        Out_ptr
    );

    return out;
}

// Python/Pytorch related code omitted for readability

Interesting enough, none of the kernels calculated GEMM! There is only some form of GEMV in fused_output_kernel.

At this point I was pretty sure it is some form of hallucination or reward hack. The only curious thing to me was why this has passed our validation tests. I assumed that MakoraGenerate exploited some odd input pattern. However, to my surprise, I was not able to find an input where generated code fails…

So I looked closer at what is going on and I realized that generated code is not wrong. It is brilliant!

Basically MakoraGenerate was able to determine that for this particular problem we can replace GEMM with GEMV - effectively reducing problem (in general case) from being compute-bound to memory-bound.

This is possible because the code:

 x = self.gemm(x)
 ...
 x = torch.mean(x, dim=1, keepdim=True)
 ...

can be replaced by:

*Pseudocode:*
gemv(x, gemm_weights_summed_per_row)/out_features

This is implemented in fused_output_kernel. The proof is pretty easy, so I will skip it.

Note that MakoraGenerate also correctly realized that x = torch.logsumexp(x, dim=1, keepdim=True) is no-op here, because dimension 1 will have size 1 at this point.

This is an example of a poorly defined problem. Or, if you look at it from another angle, a cleverly designed one aimed at measuring your ability to identify algorithmic optimizations. Of course, generated CUDA code is still far from being optimal. Probably it should fuse all into single kernel, apply vectorized load/stores etc. We are actively working on MakeGenerate to improve generated CUDA code, but the main speedup in this case comes from the idea of replacing GEMM with GEMV - an opportunity which might be otherwise hard to spot in real codebase. However, with MakoraGenerate you don’t need to think about it - you can get a working version just with one click.

Copyright © 2025 MakoRA. All rights reserved.

Copyright © 2025 MakoRA. All rights reserved.

Copyright © 2025 MakoRA. All rights reserved.

Copyright © 2025 MakoRA. All rights reserved.