A GPU is not a faster CPU. It is a fundamentally different machine -- thousands of simple cores organized into a rigid execution hierarchy, connected to memory through a bandwidth-optimized pipeline that trades latency for throughput. Understanding this hierarchy is the difference between code that runs and code that runs 100x faster.
This reference covers NVIDIA GPU architecture from the transistor level up: streaming multiprocessors, warp execution, the memory hierarchy from registers to HBM, tensor core operations, CUDA programming patterns, kernel optimization, and the Linux toolchain for profiling and debugging.
Hardware references target Turing (1650 Ti) through Ampere/Ada/Hopper.
An NVIDIA GPU is a grid of Streaming Multiprocessors (SMs). Each SM is an independent processor with its own register file, shared memory, warp schedulers, and execution units. The GPU doesn't execute on individual threads -- it executes in lockstep groups of 32 called warps.
CUDA's execution hierarchy: a grid of blocks, each block containing threads. Threads within a block share memory and can synchronize. Threads across blocks cannot. The hardware groups threads into warps of 32 -- the actual unit of execution.
The gap between arithmetic throughput and memory bandwidth is the memory wall. Tensor cores compute matrix multiplies in nanoseconds, but the data has to get there first. Nearly all GPU optimization is memory optimization.
Tensor cores perform matrix multiply-accumulate (MMA) on small matrix tiles in a single cycle. Instead of element-by-element dot products, a tensor core computes D = A * B + C atomically. This is the hardware that makes transformer inference fast.
Click a memory level to see its characteristics. Bar width represents relative bandwidth. The gap between global memory and shared memory is the bottleneck all GPU optimization tries to bridge.
A CUDA kernel is a function that runs on the GPU. You write it like a serial program for one thread, then launch it across thousands. The programming model hides the warp machinery -- but the performance model does not.
// vector addition -- the "hello world" of CUDA __global__ void vecAdd(float* a, float* b, float* c, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) c[i] = a[i] + b[i]; } // launch: 256 threads per block int blocks = (n + 255) / 256; vecAdd<<<blocks, 256>>>(d_a, d_b, d_c, n);
// allocate on GPU float *d_a; cudaMalloc(&d_a, n * sizeof(float)); // copy host -> device cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice); // copy device -> host cudaMemcpy(h_a, d_a, n * sizeof(float), cudaMemcpyDeviceToHost); // free cudaFree(d_a); // unified memory (auto-migrates, simpler but slower) float *u_a; cudaMallocManaged(&u_a, n * sizeof(float)); // accessible from both CPU and GPU -- driver pages on demand
#include <mma.h> using namespace nvcuda::wmma; __global__ void wmma_gemm(half* a, half* b, float* c, int M, int N, int K) { // declare fragments (distributed across 32 threads in warp) fragment<matrix_a, 16, 16, 16, half, row_major> a_frag; fragment<matrix_b, 16, 16, 16, half, col_major> b_frag; fragment<accumulator, 16, 16, 16, float> c_frag; fill_fragment(c_frag, 0.0f); for (int k = 0; k < K; k += 16) { load_matrix_sync(a_frag, a + row*16*K + k, K); load_matrix_sync(b_frag, b + k*N + col*16, N); mma_sync(c_frag, a_frag, b_frag, c_frag); // D = A*B + C } store_matrix_sync(c + row*16*N + col*16, c_frag, N, mem_row_major); }
The WMMA API distributes the 16x16 fragment across the 32 warp threads. You never index individual elements -- the hardware decides which thread holds what. Do not access fragment elements by thread ID -- the mapping is undefined and changes across architectures.
// Tiled Matrix Multiply via Shared Memory // 1. load tile from global -> shared memory // 2. __syncthreads() // 3. compute on tile in shared memory // 4. repeat for next tile for (int tile = 0; tile < K; tile += TILE_SIZE) { // cooperative load: each thread loads one element sA[ty][tx] = A[row * K + tile + tx]; sB[ty][tx] = B[(tile + ty) * N + col]; __syncthreads(); // compute partial dot product from shared mem for (int k = 0; k < TILE_SIZE; k++) sum += sA[ty][k] * sB[k][tx]; __syncthreads(); } C[row * N + col] = sum;
This is the pattern. Load a tile cooperatively. Sync. Compute from shared memory. Sync. Repeat. Every high-performance GPU kernel -- GEMM, convolution, FlashAttention -- is a variation of this.
# live GPU monitoring (updates every 1s) watch -n1 nvidia-smi # just VRAM usage nvidia-smi --query-gpu=memory.used,memory.total --format=csv # GPU utilization over time nvidia-smi dmon -s um -d 1 # list all processes using GPU nvidia-smi pmon -d 1 # set persistence mode (reduces launch latency) sudo nvidia-smi -pm 1 # lock GPU clocks for consistent benchmarking sudo nvidia-smi -lgc 1500
# compile with optimization and architecture target nvcc -O3 -arch=sm_75 -o program program.cu # Turing (1650 Ti) nvcc -O3 -arch=sm_86 -o program program.cu # Ampere (3090) nvcc -O3 -arch=sm_89 -o program program.cu # Ada (4090) # show register/shared mem usage per kernel nvcc --ptxas-options=-v -o program program.cu # generate PTX (intermediate assembly) nvcc -ptx program.cu cat program.ptx # check CUDA capabilities /usr/local/cuda/extras/demo_suite/deviceQuery
# nsys: system-wide timeline profiler nsys profile -o report ./my_program nsys stats report.nsys-rep # ncu: kernel-level profiler (detailed metrics) ncu --set full -o kernel_report ./my_program # key metrics to check: # sm__warps_active.avg.pct_of_peak_sustained_active (occupancy) # dram__bytes.sum (global mem traffic) # l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum (L1 hit rate) # sm__sass_thread_inst_executed_op_ffma.sum (FLOPS)
# check driver and toolkit cat /proc/driver/nvidia/version nvcc --version lspci | grep -i nvidia # CUDA environment variables export CUDA_VISIBLE_DEVICES=0 # limit to GPU 0 export CUDA_LAUNCH_BLOCKING=1 # sync launches (debug) export TORCH_CUDA_ARCH_LIST="7.5" # compile for specific arch # PyTorch CUDA check python -c "import torch; print(torch.cuda.get_device_properties(0))"
// ALWAYS check CUDA errors. Kernels fail silently. #define CUDA_CHECK(call) { \ cudaError_t err = call; \ if (err != cudaSuccess) { \ fprintf(stderr, "CUDA error at %s:%d: %s\n", \ __FILE__, __LINE__, cudaGetErrorString(err)); \ exit(1); \ } \ } CUDA_CHECK(cudaMalloc(&d_a, size)); CUDA_CHECK(cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice)); // check kernel launch (errors are async!) myKernel<<<grid, block>>>(args); CUDA_CHECK(cudaGetLastError()); // launch config errors CUDA_CHECK(cudaDeviceSynchronize()); // runtime errors
CUDA kernels fail silently by default. Without explicit error checking, an illegal memory access produces no output and no crash -- just wrong results. The CUDA_CHECK macro is non-negotiable.
Every transformer optimization in the last 5 years responds to GPU architecture constraints. FlashAttention: SRAM fast, HBM slow. Quantization: VRAM finite. GQA: KV cache grows with sequence length. The software follows the silicon.
LLM inference at batch=1 will always be memory-bandwidth-bound. You read every weight once per token. The only levers: fewer weights (pruning), fewer bytes per weight (quantization), or more bandwidth (better hardware). Everything else is noise.
The roofline tells you compute-bound vs memory-bound before you write a line of optimization. nsys shows where time is spent. ncu shows why. If you're optimizing without profiling at the GPU level, you're guessing.