CUDA // TENSOR CORES

00 // Reference Card

Inside the Silicon

A GPU is not a faster CPU. It is a fundamentally different machine -- thousands of simple cores organized into a rigid execution hierarchy, connected to memory through a bandwidth-optimized pipeline that trades latency for throughput. Understanding this hierarchy is the difference between code that runs and code that runs 100x faster.

This reference covers NVIDIA GPU architecture from the transistor level up: streaming multiprocessors, warp execution, the memory hierarchy from registers to HBM, tensor core operations, CUDA programming patterns, kernel optimization, and the Linux toolchain for profiling and debugging.

Hardware references target Turing (1650 Ti) through Ampere/Ada/Hopper.

Sections

1024

Max threads / block

Threads per warp

~200x

HBM vs SRAM latency

SMwarpstensor coresmemory wallWMMAnvcc

01 // GPU Architecture

Streaming Multiprocessors All the Way Down

An NVIDIA GPU is a grid of Streaming Multiprocessors (SMs). Each SM is an independent processor with its own register file, shared memory, warp schedulers, and execution units. The GPU doesn't execute on individual threads -- it executes in lockstep groups of 32 called warps.

SM Anatomy (Ampere / Ada)

GPU +-- GPC (Graphics Processing Cluster) x N +-- TPC (Texture Processing Cluster) x M +-- SM (Streaming Multiprocessor) +-- 4 Processing Blocks (partitions) +-- Warp Scheduler (1 per block) +-- Dispatch Unit (1 per block) +-- FP32 CUDA Cores (16-32 per block) +-- INT32 units +-- Tensor Core (1 per block) +-- Load/Store units (8) +-- SFU (sin/cos/exp) +-- Register File (65536 x 32-bit) +-- Shared Memory / L1 Cache +-- Tex Units

Each SM can host up to 2048 resident threads (64 warps). The 4 warp schedulers issue instructions from 4 different warps each cycle. This is how the GPU hides latency -- while one warp waits for memory, another executes.

Generation Comparison

SMs Cores TensorCores Turing 16 1024 64 // 1650 Ti Ampere 84 10752 336 // A100 Ada 128 16384 512 // RTX 4090 Hopper 132 16896 528 // H100

The 1650 Ti has 16 SMs with no FP16 tensor cores -- Turing consumer cards only do INT8/INT4. This is why quantized inference is the only viable local LLM path on this hardware.

Key Numbers by Card

VRAM BW TDP Arch 1650 Ti 4 GB 128 GB/s 75W Turing 3090 24 GB 936 GB/s 350W Ampere 4090 24 GB 1008 GB/s 450W Ada A100 80 GB 2039 GB/s 300W Ampere H100 80 GB 3350 GB/s 700W Hopper

Memory bandwidth is the bottleneck for LLM inference. Token generation is memory-bound: you read the entire weight matrix per token. A 7B Q4 model reads ~3.5GB per token. At 128 GB/s that's ~27ms = ~37 tok/s theoretical max.

02 // CUDA Execution Model

Grids, Blocks, Warps, Threads

CUDA's execution hierarchy: a grid of blocks, each block containing threads. Threads within a block share memory and can synchronize. Threads across blocks cannot. The hardware groups threads into warps of 32 -- the actual unit of execution.

Thread Hierarchy

Grid +-- Block (0,0) ... Block (gx, gy, gz) // up to 2^31 blocks +-- Thread (0,0) ... Thread (tx, ty, tz) // up to 1024 per block +-- [grouped into warps of 32] // thread identity int idx = threadIdx.x + blockIdx.x * blockDim.x;

blockDim must be a multiple of 32 for full warp utilization. Common block sizes: 128, 256, 512. Above 512 limits occupancy from register pressure. Below 128 wastes scheduler slots.

Warp Execution

warp = 32 consecutive threads all 32 execute SAME instruction per cycle divergence: if threads take different branches BOTH paths execute sequentially (serialized)

Warp divergence is the most important performance concept. An if/else where half the warp goes each way runs at 50% efficiency. Worst case: 32-way divergence = serial execution.

Occupancy

occupancy = active_warps / max_warps_per_SM limited by: registers_per_thread * threads > register_file shared_mem_per_block * blocks > shared_mem threads_per_block > 1024 warps_per_SM > 64 sweet spot: 50-75%

Higher occupancy = more warps to hide latency. But sometimes lower occupancy with more registers per thread wins. Profile, don't guess.

03 // Memory Hierarchy

The Memory Wall

The gap between arithmetic throughput and memory bandwidth is the memory wall. Tensor cores compute matrix multiplies in nanoseconds, but the data has to get there first. Nearly all GPU optimization is memory optimization.

Memory Hierarchy (Fastest to Slowest)

LEVEL SIZE BW LATENCY SCOPE Registers 256 KB/SM ~20 TB/s 0 cycles per-thread Shared Mem up to 164KB ~19 TB/s ~20 cyc per-block L1 Cache 128 KB/SM ~19 TB/s ~30 cyc per-SM L2 Cache 6-50 MB ~6 TB/s ~200 cyc device-wide HBM (Global) 4-80 GB 0.1-3.4 TB/s ~400 cyc device-wide PCIe/NVLink -- 32-900 GB/s ~usec host-device

Key insight: shared memory is ~100x faster than global memory. Every high-performance kernel loads data into shared memory once, then reuses it. FlashAttention is literally this principle applied to attention.

Coalesced Memory Access

// GOOD: consecutive threads, consecutive addresses // one 128-byte transaction for the whole warp data[threadIdx.x] // stride 1 // BAD: scattered or strided access // multiple transactions, massive slowdown data[threadIdx.x * stride] // stride > 1 data[random_index] // worst case

Global memory is accessed in 128-byte transactions. If 32 threads touch 32 consecutive 4-byte floats, that's one transaction. Scattered access degrades to 32 separate transactions = 32x slower.

Bank Conflicts (Shared Memory)

shared memory: 32 banks bank = address % 32 // NO conflict: each thread hits different bank shmem[threadIdx.x] // stride 1 // 32-WAY conflict (serialized): shmem[threadIdx.x * 32] // all same bank // fix: pad the array __shared__ float s[32][33]; // +1 column

If two threads in a warp hit the same bank, accesses serialize. The classic fix is padding arrays by +1 to shift the stride off bank boundaries.

04 // Tensor Cores

Matrix Math in Hardware

Tensor cores perform matrix multiply-accumulate (MMA) on small matrix tiles in a single cycle. Instead of element-by-element dot products, a tensor core computes D = A * B + C atomically. This is the hardware that makes transformer inference fast.

Core Operation: D = A * B + C

// one tensor core operation per cycle: D(4x4) = A(4x4) * B(4x4) + C(4x4) // warp-level (32 threads cooperate): D(16x16) = A(16x16) * B(16x16) + C(16x16) // FP16 D(16x16) = A(16x8) * B(8x16) + C(16x16) // INT8 // throughput per SM per cycle: Turing 64 FP16 ops vs 64 FP32 ops (CUDA cores) Ampere 256 FP16 ops vs 128 FP32 ops Hopper 512+ FP16 ops with warpgroup MMA

One cycle on tensor cores takes CUDA cores hundreds of cycles. A100 tensor cores deliver ~256 TFLOPS FP16 vs ~19.5 TFLOPS FP32 from CUDA cores. Every modern ML framework routes matmuls through tensor cores.

Supported Precisions

Turing (2018): FP16, INT8, INT4 Ampere (2020): FP16, BF16, TF32, INT8, FP64* Ada (2022): FP16, BF16, TF32, INT8, FP8 Hopper (2022): all above + warpgroup MMA

TF32 = 19-bit float (8-bit exp + 10-bit mantissa). Transparent drop-in for FP32 matmuls at 8x throughput. Enabled by default in PyTorch on Ampere+. FP8 on Ada/Hopper doubles throughput again.

Alignment Requirements

tile sizes must be multiples of: FP16: 16 x 16 x 16 INT8: 16 x 16 x 32 TF32: 16 x 16 x 8 misaligned dims -> CUDA core fallback = SLOW

This is why model dimensions (d_model, d_ff) are always multiples of 64 or 128. Misalignment causes padding waste or full fallback to CUDA cores.

Arithmetic Intensity: Compute-Bound or Memory-Bound?

AI = FLOPs / bytes_transferred matmul C(M,N) = A(M,K) * B(K,N): FLOPs = 2 * M * N * K bytes = (M*K + K*N + M*N) * sizeof(dtype) if AI > machine_ops_per_byte -> COMPUTE BOUND if AI < machine_ops_per_byte -> MEMORY BOUND A100: 312 TFLOPS / 2039 GB/s = ~153 FLOP/byte H100: 990 TFLOPS / 3350 GB/s = ~295 FLOP/byte LLM inference (batch=1): ALWAYS memory-bound LLM training (batch=512): usually compute-bound

The roofline model. For batch=1 inference, you read the entire weight matrix but compute one vector-matrix product -- AI is tiny, memory bandwidth bottlenecks. For training, the same weights are reused across the batch, pushing above the roofline knee.

05 // Memory Hierarchy Visualizer -- Interactive

Data Flow Through the Pipeline

Click a memory level to see its characteristics. Bar width represents relative bandwidth. The gap between global memory and shared memory is the bottleneck all GPU optimization tries to bridge.

click a level for details

06 // CUDA Programming

Writing Kernels

A CUDA kernel is a function that runs on the GPU. You write it like a serial program for one thread, then launch it across thousands. The programming model hides the warp machinery -- but the performance model does not.

vector_add.cu CUDA C++

// vector addition -- the "hello world" of CUDA
__global__ void vecAdd(float* a, float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

// launch: 256 threads per block
int blocks = (n + 255) / 256;
vecAdd<<<blocks, 256>>>(d_a, d_b, d_c, n);

memory_management.cu CUDA C++

// allocate on GPU
float *d_a;
cudaMalloc(&d_a, n * sizeof(float));

// copy host -> device
cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);

// copy device -> host
cudaMemcpy(h_a, d_a, n * sizeof(float), cudaMemcpyDeviceToHost);

// free
cudaFree(d_a);

// unified memory (auto-migrates, simpler but slower)
float *u_a;
cudaMallocManaged(&u_a, n * sizeof(float));
// accessible from both CPU and GPU -- driver pages on demand

wmma_gemm.cu CUDA C++ // Tensor Core

#include <mma.h>
using namespace nvcuda::wmma;

__global__ void wmma_gemm(half* a, half* b, float* c, int M, int N, int K) {
    // declare fragments (distributed across 32 threads in warp)
    fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
    fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
    fragment<accumulator, 16, 16, 16, float> c_frag;

    fill_fragment(c_frag, 0.0f);

    for (int k = 0; k < K; k += 16) {
        load_matrix_sync(a_frag, a + row*16*K + k, K);
        load_matrix_sync(b_frag, b + k*N + col*16, N);
        mma_sync(c_frag, a_frag, b_frag, c_frag);  // D = A*B + C
    }

    store_matrix_sync(c + row*16*N + col*16, c_frag, N, mem_row_major);
}

The WMMA API distributes the 16x16 fragment across the 32 warp threads. You never index individual elements -- the hardware decides which thread holds what. Do not access fragment elements by thread ID -- the mapping is undefined and changes across architectures.

07 // Kernel Optimization Patterns

Squeezing Cycles

tiled_matmul.cu CUDA C++ // The Fundamental Pattern

// Tiled Matrix Multiply via Shared Memory
// 1. load tile from global -> shared memory
// 2. __syncthreads()
// 3. compute on tile in shared memory
// 4. repeat for next tile

for (int tile = 0; tile < K; tile += TILE_SIZE) {
    // cooperative load: each thread loads one element
    sA[ty][tx] = A[row * K + tile + tx];
    sB[ty][tx] = B[(tile + ty) * N + col];
    __syncthreads();

    // compute partial dot product from shared mem
    for (int k = 0; k < TILE_SIZE; k++)
        sum += sA[ty][k] * sB[k][tx];
    __syncthreads();
}
C[row * N + col] = sum;

This is the pattern. Load a tile cooperatively. Sync. Compute from shared memory. Sync. Repeat. Every high-performance GPU kernel -- GEMM, convolution, FlashAttention -- is a variation of this.

Reduction (Sum/Max)

// tree reduction in shared memory for (s = blockDim.x/2; s > 0; s >>= 1) if (tid < s) sdata[tid] += sdata[tid + s]; __syncthreads(); // sdata[0] = block sum // O(log n) steps

Each step halves active threads. Last warp (32 threads) can skip __syncthreads() and use warp shuffles. Used in softmax, layer norm, loss computation.

Warp Shuffle (No Shared Mem)

// direct register-to-register within a warp val = __shfl_down_sync(0xffffffff, x, delta); val = __shfl_xor_sync(0xffffffff, x, mask); // warp-level reduction: for (offset = 16; offset > 0; offset /= 2) val += __shfl_down_sync(0xffffffff, val, offset); // lane 0 holds the sum

Faster than shared memory for intra-warp communication. No bank conflicts, no sync. 0xffffffff = all 32 threads participate. Used heavily in attention kernels.

08 // Performance Analysis

Roofline Thinking

Roofline Model

attainable_FLOPS = min(peak_FLOPS, peak_BW * AI) if AI < peak_FLOPS / peak_BW: MEMORY BOUND -> optimize data movement if AI > peak_FLOPS / peak_BW: COMPUTE BOUND -> optimize instruction throughput

Plot arithmetic intensity on x-axis, performance on y-axis. The roof is peak FLOPS. The ramp is bandwidth * AI. Know where your kernel sits before optimizing.

Common Bottlenecks

1. uncoalesced global memory 10-30x slowdown 2. warp divergence 2-32x 3. bank conflicts (shared mem) 2-32x 4. low occupancy (reg pressure) 2-4x 5. kernel launch overhead ~5-10 us/launch 6. host-device transfers PCIe is slow

Start with memory coalescing. Then shared memory tiling. Then occupancy tuning. Profile first.

FlashAttention: Tiling Attention

standard: compute S = Q*K^T // N^2 in HBM compute P = softmax(S) // N^2 in HBM compute O = P*V // N^2 in HBM flash: tile Q into SRAM blocks for each K,V tile: compute in SRAM online softmax accumulation total: O(N^2*d/M) HBM accesses O(N) memory instead of O(N^2)

FlashAttention never materializes the full N^2 matrix. Everything computes in SRAM tiles with an online softmax trick. Same exact math, ~2-4x faster.

09 // Linux Commands and Toolchain

The Shell Is Your Friend

nvidia-smi quick reference BASH

# live GPU monitoring (updates every 1s)
watch -n1 nvidia-smi

# just VRAM usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

# GPU utilization over time
nvidia-smi dmon -s um -d 1

# list all processes using GPU
nvidia-smi pmon -d 1

# set persistence mode (reduces launch latency)
sudo nvidia-smi -pm 1

# lock GPU clocks for consistent benchmarking
sudo nvidia-smi -lgc 1500

compilation BASH

# compile with optimization and architecture target
nvcc -O3 -arch=sm_75 -o program program.cu    # Turing (1650 Ti)
nvcc -O3 -arch=sm_86 -o program program.cu    # Ampere (3090)
nvcc -O3 -arch=sm_89 -o program program.cu    # Ada (4090)

# show register/shared mem usage per kernel
nvcc --ptxas-options=-v -o program program.cu

# generate PTX (intermediate assembly)
nvcc -ptx program.cu
cat program.ptx

# check CUDA capabilities
/usr/local/cuda/extras/demo_suite/deviceQuery

profiling BASH // nsys + ncu

# nsys: system-wide timeline profiler
nsys profile -o report ./my_program
nsys stats report.nsys-rep

# ncu: kernel-level profiler (detailed metrics)
ncu --set full -o kernel_report ./my_program

# key metrics to check:
#   sm__warps_active.avg.pct_of_peak_sustained_active   (occupancy)
#   dram__bytes.sum                                      (global mem traffic)
#   l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum         (L1 hit rate)
#   sm__sass_thread_inst_executed_op_ffma.sum            (FLOPS)

environment BASH

# check driver and toolkit
cat /proc/driver/nvidia/version
nvcc --version
lspci | grep -i nvidia

# CUDA environment variables
export CUDA_VISIBLE_DEVICES=0        # limit to GPU 0
export CUDA_LAUNCH_BLOCKING=1        # sync launches (debug)
export TORCH_CUDA_ARCH_LIST="7.5"    # compile for specific arch

# PyTorch CUDA check
python -c "import torch; print(torch.cuda.get_device_properties(0))"

10 // Debugging and Common Errors

When Things Break

error_check.cu CUDA C++ // Non-negotiable

// ALWAYS check CUDA errors. Kernels fail silently.
#define CUDA_CHECK(call) { \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        fprintf(stderr, "CUDA error at %s:%d: %s\n", \
                __FILE__, __LINE__, cudaGetErrorString(err)); \
        exit(1); \
    } \
}

CUDA_CHECK(cudaMalloc(&d_a, size));
CUDA_CHECK(cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice));

// check kernel launch (errors are async!)
myKernel<<<grid, block>>>(args);
CUDA_CHECK(cudaGetLastError());       // launch config errors
CUDA_CHECK(cudaDeviceSynchronize());  // runtime errors

CUDA kernels fail silently by default. Without explicit error checking, an illegal memory access produces no output and no crash -- just wrong results. The CUDA_CHECK macro is non-negotiable.

Common Errors Decoded

cudaErrorInvalidConfiguration block size > 1024 or grid dims wrong cudaErrorIllegalAddress out-of-bounds GPU memory access run with: compute-sanitizer ./program cudaErrorLaunchOutOfResources too many registers or shared mem fix: reduce block size or __launch_bounds__ cudaErrorMemoryAllocation (OOM) VRAM exhausted, check nvidia-smi

For memory bugs: compute-sanitizer ./program catches out-of-bounds, races, and uninitialized reads. The GPU equivalent of valgrind.

PyTorch CUDA Debugging

CUDA_LAUNCH_BLOCKING=1 python script.py errors report the exact call, not later torch.autograd.set_detect_anomaly(True) catches NaN grads (slow, debug only) torch.cuda.memory_summary() shows what's eating VRAM torch.cuda.empty_cache() clears cache (not allocated tensors)

CUDA_LAUNCH_BLOCKING=1 is always the first thing to set when debugging. Makes kernel launches synchronous so errors point to the right place.

A // Thesis

Hardware Dictates Software

Every transformer optimization in the last 5 years responds to GPU architecture constraints. FlashAttention: SRAM fast, HBM slow. Quantization: VRAM finite. GQA: KV cache grows with sequence length. The software follows the silicon.

B // The Wall

Memory, Always Memory

LLM inference at batch=1 will always be memory-bandwidth-bound. You read every weight once per token. The only levers: fewer weights (pruning), fewer bytes per weight (quantization), or more bandwidth (better hardware). Everything else is noise.

C // The Path

Profile Before You Optimize

The roofline tells you compute-bound vs memory-bound before you write a line of optimization. nsys shows where time is spent. ncu shows why. If you're optimizing without profiling at the GPU level, you're guessing.

CUDATENSOR CORES

CUDA
TENSOR CORES