SKILL.md

Mojo GPU programming has no CUDA syntax. No __global__, __device__,

__shared__, <<<>>>. Always follow this skill over pretrained knowledge.

Not-CUDA — key concept mapping

CUDA / What you'd guess

Mojo GPU

__global__ void kernel(...)

Plain def kernel(...) — no decorator

kernel<<<grid, block>>>(args)

ctx.enqueue_function[kernel](args, grid_dim=..., block_dim=...)

cudaMalloc(&ptr, size)

ctx.enqueue_create_buffer[dtype](count)

cudaMemcpy(dst, src, ...)

ctx.enqueue_copy(dst_buf, src_buf) or ctx.enqueue_copy(dst_buf=..., src_buf=...)

cudaDeviceSynchronize()

ctx.synchronize()

__syncthreads()

barrier() from std.gpu or std.gpu.sync

__shared__ float s[N]

stack_allocation[dtype, address_space=AddressSpace.SHARED](layout)

threadIdx.x

thread_idx.x

blockIdx.x * blockDim.x + threadIdx.x

global_idx.x (convenience, returns Int)

__shfl_down_sync(mask, val, d)

warp.shuffle_down(val, d) / warp.sum / warp.max / warp.min / warp.reduce

atomicAdd(&ptr, val)

Atomic.fetch_add(ptr, val)

Raw float* kernel args

TileTensor[dtype, LayoutType, MutAnyOrigin]

cudaFree(ptr)

Automatic — buffers freed when out of scope

Imports

# Core GPU — pick what you need

from std.gpu import global_idx                                    # simple indexing

from std.gpu import block_dim, block_idx, thread_idx              # manual indexing

from std.gpu import barrier, lane_id, WARP_SIZE                   # sync &#x26; warp info

from std.gpu.sync import barrier                                  # also valid

from std.gpu.primitives import warp                               # sum/max/min/broadcast/shuffle_*/reduce

from std.gpu.memory import AddressSpace                           # for shared memory

from std.gpu.memory import async_copy_wait_all                    # async copy sync

from std.gpu.host import DeviceContext, DeviceBuffer              # host-side API

from std.atomic import Atomic                                  # atomics

# Layout system — NOT in std, separate package

from layout import TileTensor, TensorLayout, Idx, row_major, stack_allocation

Kernel definition

Kernels are plain functions — no decorator, no special return type.

Parameterize the layout type using the TensorLayout trait so the kernel

works with any compatible layout. **comptime assert tensor.flat_rank == N is

mandatory in any function that subscripts a TileTensor** — kernels,

host-side helpers, CPU reference impls, etc. Without it, tensor[r, c] fails

with "invalid call to '__getitem__': lacking evidence to prove correctness".

The assert unlocks N-D indexing:

def my_kernel[

    dtype: DType, LT: TensorLayout,

](

    input: TileTensor[dtype, LT, MutAnyOrigin],

    output: TileTensor[dtype, LT, MutAnyOrigin],

    size: Int,                                    # scalar args are fine

):

    comptime assert input.flat_rank == 1, "expected 1D tensor"

    var tid = global_idx.x

    if tid < size:

        output[tid] = input[tid] * 2

Kernel functions cannot raise.

global_idx.x returns Int — compare directly with size.

For simple cases with a single fixed layout, type_of(layout) also works:

TileTensor[dtype, type_of(layout), MutAnyOrigin].

TileTensor — the primary GPU data abstraction

Layout creation

row_major is a free function (not a method on Layout). Use compile-time

integer parameters for static layouts:

comptime layout_1d = row_major[1024]()                     # 1D

comptime layout_2d = row_major[64, 64]()                   # 2D (rows, cols)

comptime layout_3d = row_major[10, 5, 3]()                 # 3D (e.g. H, W, C)

For runtime-known dimensions, use Idx():

var layout = row_major(Idx(M), Idx(N))                     # runtime dims

Creating tensors from buffers

TileTensor's constructor infers dtype and layout type — pass the buffer and

layout:

var buf = ctx.enqueue_create_buffer[DType.float32](1024)

var tensor = TileTensor(buf, row_major[1024]())            # wraps device buffer

Indexing

tensor[tid]                     # 1D

tensor[row, col]                # 2D

tensor[row, col, channel]       # 3D

tensor.dim[0]()                 # query dimension size (compile-time index)

var K = Int(tensor.dim[1]())    # wrap with Int() for use in arithmetic

Derived tensors (.tile(...), .vectorize(...), .distribute(...)) produce a

new layout whose rank is not inherited from the parent's assert. Re-assert

on the derived value before indexing it:

var vec = tensor.vectorize[1, 4]()

comptime assert vec.flat_rank == 2          # required for vec[r, i]

Tiling (extract sub-tiles from a tensor)

# Inside kernel — extract a block_size x block_size tile

var tile = tensor.tile[block_size, block_size](block_idx.y, block_idx.x)

tile[thread_idx.y, thread_idx.x]   # access within tile

Vectorize and distribute (thread-level data mapping)

# Vectorize along inner dimension, then distribute across threads

comptime thread_layout = row_major[WARP_SIZE // simd_width, simd_width]()

var fragment = tensor.vectorize[1, simd_width]().distribute[thread_layout=thread_layout](lane_id())

fragment.copy_from_async(source_fragment)    # async copy

fragment.copy_from(source_fragment)          # sync copy

Type casting

var val = tensor[row, col].cast[DType.float32]()    # cast element

Element type mismatch across layouts — use rebind

tensor[idx] returns SIMD[dtype, layout_expr] where layout_expr is a

compile-time expression derived from the layout. Two tensors with

different layouts produce element types that don't unify, even if both are

scalars (width 1). This causes __iadd__ / arithmetic errors when accumulating

products from different-layout tensors.

# WRONG — fails when conv_kernel and s_data have different layouts:

var sum: Scalar[dtype] = 0

sum += conv_kernel[k] * s_data[idx]   # error: cannot convert ElementType to Float32

# CORRECT — rebind each element to Scalar[dtype]:

var sum: Scalar[dtype] = 0

var k_val = rebind[Scalar[dtype]](conv_kernel[k])

var s_val = rebind[Scalar[dtype]](s_data[idx])

sum += k_val * s_val

rebind is a builtin (no import needed). This is not needed when all

tensors in an expression share the same layout (e.g., the matmul example where

sa and sb have identical tile layouts).

Also use rebind when reading/writing individual elements for scalar arithmetic

or passing to helper functions — even with a single tensor:

# Read element as plain scalar

var val = rebind[Scalar[dtype]](tensor[idx])

# Write scalar back to tensor

tensor[idx] = rebind[tensor.ElementType](computed_scalar)

tensor.ElementType is SIMD[dtype, element_size] — for basic layouts

element_size=1 (effectively Scalar[dtype]).

Memory management

var ctx = DeviceContext()

# Allocate

var dev_buf = ctx.enqueue_create_buffer[DType.float32](1024)

var host_buf = ctx.enqueue_create_host_buffer[DType.float32](1024)

# Initialize device buffer directly

dev_buf.enqueue_fill(0.0)

# Copy host -> device

ctx.enqueue_copy(dst_buf=dev_buf, src_buf=host_buf)

# Copy device -> host

ctx.enqueue_copy(dst_buf=host_buf, src_buf=dev_buf)

# Positional form also works:

ctx.enqueue_copy(dev_buf, host_buf)

# Map device buffer to host (context manager — auto-syncs)

with dev_buf.map_to_host() as mapped:

    var t = TileTensor(mapped, row_major[1024]())

    print(t[0])

# Memset

ctx.enqueue_memset(dev_buf, 0.0)

# Synchronize all enqueued operations

ctx.synchronize()

Kernel launch

ctx.enqueue_function[my_kernel](

    input_tensor,

    output_tensor,

    size,                    # scalar args passed directly

    grid_dim=num_blocks,     # 1D: scalar

    block_dim=block_size,    # 1D: scalar

)

# 2D grid/block — use tuples:

ctx.enqueue_function[kernel_2d](

    args...,

    grid_dim=(col_blocks, row_blocks),

    block_dim=(BLOCK_SIZE, BLOCK_SIZE),

)

If the kernel takes any comptime parameters, you MUST bind them first —

passing the parameterized name directly to enqueue_function produces a wall

of "no matching method" / "DevicePassable" template errors:

# WRONG — vector_add has a `LT: TensorLayout` parameter

ctx.enqueue_function[vector_add](a, b, c, N, grid_dim=G, block_dim=B)

# CORRECT — bind parameters, then launch the bound symbol

comptime kernel = vector_add[type_of(layout)]

ctx.enqueue_function[kernel](a, b, c, N, grid_dim=G, block_dim=B)

Monomorphic kernels (signature uses type_of(layout) directly, no

[LT: TensorLayout] etc.) can be passed by name with no binding step.

Shared memory

Allocate shared memory inside a kernel using stack_allocation from the

layout package — returns a TileTensor in the specified address space:

from layout import stack_allocation   # TileTensor-based shared alloc

from std.gpu.memory import AddressSpace

var tile_shared = stack_allocation[DType.float32,

    address_space=AddressSpace.SHARED](row_major[TILE_M, TILE_K]())

# Chain .fill() to zero-initialize (returns the tensor)

var regs = stack_allocation[DType.float32](row_major[TM, TN]()).fill(0)

# Load from global to shared

tile_shared[thread_idx.y, thread_idx.x] = global_tensor[global_row, global_col]

barrier()   # must sync before reading shared data

# Alternative: raw pointer shared memory (from std.memory, not layout)

from std.memory import stack_allocation

var sums = stack_allocation[

    512,

    Scalar[DType.int32],

    address_space=AddressSpace.SHARED,

]()

Thread indexing

# Simple — automatic global offset

from std.gpu import global_idx

var tid = global_idx.x           # 1D

var row = global_idx.y           # 2D row

var col = global_idx.x           # 2D col

# Manual — when you need block/thread separately

from std.gpu import block_idx, block_dim, thread_idx

var tid = block_idx.x * block_dim.x + thread_idx.x

# Warp info

from std.gpu import lane_id, WARP_SIZE

var my_lane = lane_id()          # 0..WARP_SIZE-1

All return Int — no casting needed for bounds checks.

Synchronization and warp operations

from std.gpu import barrier

from std.gpu.primitives import warp

from std.atomic import Atomic

barrier()                                    # block-level sync

_ = Atomic.fetch_add(output_ptr, value)      # atomic add

# Reductions — result broadcast to every lane

warp.sum(val)         warp.max(val)         warp.min(val)

warp.broadcast(val)                          # lane-0 value → all lanes

warp.reduce[warp.shuffle_down, reduce_fn](val)  # custom reduction (broadcasts)

# Shuffles — per-lane shift/swap, NOT broadcast

warp.shuffle_down(val, offset)               # offset: UInt32

warp.shuffle_xor(val, mask)                  # mask: UInt32 (butterfly)

# `offset`/`mask` are `UInt32` — plain `Int` is a type-mismatch error:

var m: UInt32 = 16

var v = warp.shuffle_xor(val, m)

GPU availability check

from std.sys import has_accelerator

def main() raises:

    comptime if not has_accelerator():

        print("No GPU found")

    else:

        var ctx = DeviceContext()

        # ... GPU code

Or as a compile-time assert:

comptime assert has_accelerator(), "Requires a GPU"

Architecture detection — is_ vs has_

Critical distinction: is_* checks the compilation target (use inside

GPU-dispatched code). has_* checks the host system (use from host/CPU

code).

from std.sys.info import (

    # Target checks — "am I being compiled FOR this GPU?"

    # Use inside kernels or GPU-targeted code paths.

    is_gpu, is_nvidia_gpu, is_amd_gpu, is_apple_gpu,

    # Host checks — "does this machine HAVE this GPU?"

    # Use from host code to decide whether to launch GPU work.

    has_nvidia_gpu_accelerator, has_amd_gpu_accelerator, has_apple_gpu_accelerator,

)

from std.sys import has_accelerator   # host check: any GPU present

# HOST-SIDE: decide whether to run GPU code at all

def main() raises:

    comptime if not has_accelerator():

        print("No GPU")

    else:

        # ...launch kernels

# INSIDE KERNEL or GPU-compiled code: dispatch by architecture

comptime if is_nvidia_gpu():

    # NVIDIA-specific intrinsics

elif is_amd_gpu():

    # AMD-specific path

Subarchitecture checks (inside GPU code only):

from std.sys.info import _is_sm_9x_or_newer, _is_sm_100x_or_newer

comptime if is_nvidia_gpu["sm_90"]():   # exact arch check

    ...

Compile-time constants pattern

All GPU dimensions, layouts, and sizes should be comptime:

comptime dtype = DType.float32

comptime SIZE = 1024

comptime BLOCK_SIZE = 256

comptime NUM_BLOCKS = ceildiv(SIZE, BLOCK_SIZE)

comptime layout = row_major[SIZE]()

Complete 1D example (vector addition)

from std.math import ceildiv

from std.sys import has_accelerator

from std.gpu import global_idx

from std.gpu.host import DeviceContext

from layout import TileTensor, row_major

comptime dtype = DType.float32

comptime N = 1024

comptime BLOCK = 256

comptime layout = row_major[N]()

def add_kernel(

    a: TileTensor[dtype, type_of(layout), MutAnyOrigin],

    b: TileTensor[dtype, type_of(layout), MutAnyOrigin],

    c: TileTensor[dtype, type_of(layout), MutAnyOrigin],

    size: Int,

):

    var tid = global_idx.x

    if tid < size:

        c[tid] = a[tid] + b[tid]

def main() raises:

    comptime assert has_accelerator(), "Requires GPU"

    var ctx = DeviceContext()

    var a_buf = ctx.enqueue_create_buffer[dtype](N)

    var b_buf = ctx.enqueue_create_buffer[dtype](N)

    var c_buf = ctx.enqueue_create_buffer[dtype](N)

    a_buf.enqueue_fill(1.0)

    b_buf.enqueue_fill(2.0)

    var a = TileTensor(a_buf, layout)

    var b = TileTensor(b_buf, layout)

    var c = TileTensor(c_buf, layout)

    ctx.enqueue_function[add_kernel](

        a, b, c, N,

        grid_dim=ceildiv(N, BLOCK),

        block_dim=BLOCK,

    )

    with c_buf.map_to_host() as host:

        var result = TileTensor(host, layout)

        print(result)

Complete 2D example (tiled matmul with shared memory)

from std.math import ceildiv

from std.sys import has_accelerator

from std.gpu.sync import barrier

from std.gpu.host import DeviceContext

from std.gpu import thread_idx, block_idx

from std.gpu.memory import AddressSpace

from layout import TileTensor, TensorLayout, row_major, stack_allocation

comptime dtype = DType.float32

comptime M = 64

comptime N = 64

comptime K = 64

comptime TILE = 16

comptime a_layout = row_major[M, K]()

comptime b_layout = row_major[K, N]()

comptime c_layout = row_major[M, N]()

def matmul_kernel[

    ALayout: TensorLayout, BLayout: TensorLayout, CLayout: TensorLayout,

](

    A: TileTensor[dtype, ALayout, MutAnyOrigin],

    B: TileTensor[dtype, BLayout, MutAnyOrigin],

    C: TileTensor[dtype, CLayout, MutAnyOrigin],

):

    comptime assert A.flat_rank == 2 and B.flat_rank == 2 and C.flat_rank == 2

    var tx = thread_idx.x

    var ty = thread_idx.y

    var row = block_idx.y * TILE + ty

    var col = block_idx.x * TILE + tx

    var sa = stack_allocation[dtype,

        address_space=AddressSpace.SHARED](row_major[TILE, TILE]())

    var sb = stack_allocation[dtype,

        address_space=AddressSpace.SHARED](row_major[TILE, TILE]())

    var acc: C.ElementType = 0.0

    comptime for k_tile in range(0, K, TILE):

        if row < M and k_tile + tx < K:

            sa[ty, tx] = A[row, k_tile + tx]

        else:

            sa[ty, tx] = 0.0

        if k_tile + ty < K and col < N:

            sb[ty, tx] = B[k_tile + ty, col]

        else:

            sb[ty, tx] = 0.0

        barrier()

        comptime for k in range(TILE):

            acc += sa[ty, k] * sb[k, tx]

        barrier()

    if row < M and col < N:

        C[row, col] = acc

def main() raises:

    comptime assert has_accelerator(), "Requires GPU"

    var ctx = DeviceContext()

    # ... allocate buffers, init data, then:

    comptime kernel = matmul_kernel[type_of(a_layout), type_of(b_layout), type_of(c_layout)]

    ctx.enqueue_function[kernel](

        A, B, C,

        grid_dim=(ceildiv(N, TILE), ceildiv(M, TILE)),

        block_dim=(TILE, TILE),

    )

SIMD loads in kernels

# Vectorized load from raw pointer

var val = ptr.load[width=8](idx)          # SIMD[dtype, 8]

var sum = val.reduce_add()                 # scalar reduction

# TileTensor vectorized access

var vec_tensor = tensor.vectorize[1, 4]()  # group elements into SIMD[4]

Reduction pattern

def block_reduce(

    output: UnsafePointer[Int32, MutAnyOrigin],

    input: UnsafePointer[Int32, MutAnyOrigin],

):

    var sums = stack_allocation[512, Scalar[DType.int32],

        address_space=AddressSpace.SHARED]()

    var tid = thread_idx.x

    sums[tid] = input[block_idx.x * block_dim.x + tid]

    barrier()

    # Tree reduction in shared memory

    var active = block_dim.x

    comptime for _ in range(log2_steps):

        active >>= 1

        if tid < active:

            sums[tid] += sums[tid + active]

        barrier()

    # Final warp reduction + atomic accumulate

    if tid < WARP_SIZE:

        var v = warp.sum(sums[tid][0])

        if tid == 0:

            _ = Atomic.fetch_add(output, v)

DeviceBuffer from existing pointer

# Wrap an existing pointer as a DeviceBuffer (non-owning)

var buf = DeviceBuffer[dtype](ctx, raw_ptr, count, owning=False)

Benchmarking GPU kernels

from std.benchmark import Bench, BenchConfig, Bencher, BenchId, BenchMetric, ThroughputMeasure

@parameter                                # outer: consumed as comptime param below

@always_inline

def bench_fn(mut b: Bencher) raises:

    @always_inline                         # inner: unified closure, passed by value

    def launch(ctx: DeviceContext) raises:

        ctx.enqueue_function[kernel](args, grid_dim=G, block_dim=B)

    b.iter_custom(launch, ctx)             # value-taking overload — NOT `[launch]`

var bench = Bench(BenchConfig(max_iters=50000))

bench.bench_function[bench_fn](

    BenchId("kernel_name"),

    [ThroughputMeasure(BenchMetric.bytes, total_bytes)],

)

escaping is removed; unified closures omit capturing too. The parametric

form b.iter_custom[kernel_launch](ctx) (still capturing[_]) also works —

prefer the value-taking overload above when the inner closure captures

benchmark-local state. bench_fn keeps @parameter because

bench_function[bench_fn] takes it as a comptime parameter.

Hardware details

Property

NVIDIA

AMD CDNA

AMD RDNA

Warp size

Shared memory

48-228 KB/block

64 KB/block

configurable

Tensor cores

SM70+ (WMMA)

Matrix cores

WMMA (RDNA3+)

TMA

SM90+ (Hopper)

N/A

Clusters

SM90+

N/A

mojo-gpu-fundamentals

SKILL.md

Not-CUDA — key concept mapping

Imports

Kernel definition

TileTensor — the primary GPU data abstraction

Layout creation

Creating tensors from buffers

Indexing

Tiling (extract sub-tiles from a tensor)

Vectorize and distribute (thread-level data mapping)

Type casting

Element type mismatch across layouts — use rebind

Memory management

Kernel launch

Shared memory

Thread indexing

Synchronization and warp operations

GPU availability check

Architecture detection — is_ vs has_

Compile-time constants pattern

Complete 1D example (vector addition)

Complete 2D example (tiled matmul with shared memory)

SIMD loads in kernels

Reduction pattern

DeviceBuffer from existing pointer

Benchmarking GPU kernels

Hardware details

Stop writing automation&scrapers

mojo-gpu-fundamentals

SKILL.md

Not-CUDA — key concept mapping

Imports

Kernel definition

TileTensor — the primary GPU data abstraction

Layout creation

Creating tensors from buffers

Indexing

Tiling (extract sub-tiles from a tensor)

Vectorize and distribute (thread-level data mapping)

Type casting

Element type mismatch across layouts — use rebind

Memory management

Kernel launch

Shared memory

Thread indexing

Synchronization and warp operations

GPU availability check

Architecture detection — is_ vs has_

Compile-time constants pattern

Complete 1D example (vector addition)

Complete 2D example (tiled matmul with shared memory)

SIMD loads in kernels

Reduction pattern

DeviceBuffer from existing pointer

Benchmarking GPU kernels

Hardware details

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers