CUDA

Some Good Stuff to Read: - Course on CUDA Programming on NVIDIA GPUs, Nov 28 - Dec 9, 2022, at UT Austin - The website is offline, but you can find my notes below and slides are in Here - COM4521/COM6521 Parallel Computing with GPUs (Course) - very in-depth knowledge about CUDA and OpenMP programming) - Computer Architecture (ETH Zurich, Fall 2017) - course website - youtube - Youtube CUDA Crash Course by CoffeeBeforeArch - Nvidia Official CUDA Programming Guide - CS5984: Advanced Computer Graphics: Parallel Computing and Visualization on GPU (VirginiaTech) - This one is mostly computer graphics and GPU - course website - course slides - Nvidia CUDA Training Resources - ACMS 40212/60212: Advanced Scientific Computing (Notre Dame) - Mostly on MPI and briefly mentioned CUDA for 2 lectures - course website - ECE1742S: Programming Massively Parallel Multiprocessors Using CUDA (UToronto) - This one isn't so good - course website - course slides or local repo

GPU is about 9 times faster than CPU (based on bandwidth difference - P100 vs E7-8894v4). But since usually CPU code is shit and GPU code is okay, people experience GPU to run faster.

Original Design of CPU: make single thread fast (reduce latency through cache, and predict branch)

Original Design of GPU: throughput matters (hide memory latency through parallelism, let programmer deal with raw storage hierarchy, avoid high frequency clock speed)

Graphics Processing Clusters (GPC): a computer cluster in which each node is equipped with a Graphics Processing Unit (GPU)

Nvidia GPU Architecture

Introduction to GPU, SIMD, SIMT

Youtube: An Intro to GPU Architecture and Programming Models I Tim Warburton, Virginia Tech

Sample GPU Architecture with 16 cores

Sample GPU Architecture with 16 cores

SIMD

SIMD

SIMD/T (Single Instruction, Multiple Data/Thread): unlike CPU, there exists no scaler operation unit, only matrix (or vectorized) operations is allowed. Here we have one instruction transforming multiple pieces of data. In each SM, a group of 32 ALU (a warp of them) execute the same instruction simultaneously, but with different data.

GPU cores:

GPU limits: Parallel computation is fast, but is limited by memory bandwidth (feeding into GPU is slow). To achieve high FLOPS, one data usually requires 8 instructions.

GPU Architecture

Thread, Kernel, Blocks, Cluster, Grid

Terminologies:

Host/Device, Grid, Block, Thread. Programmer's perspective on memory.

Host/Device, Grid, Block, Thread. Programmer's perspective on memory.

Memory Hierarchy

Memory Hierarchy

Memory Hierarchy

Memory: (ranked by speed)

  1. register memory: register, stack-based memory, only accessable within current thread. In most cases, accessing a register consumes zero clock cycles per instruction.
    1. However, delays can occur due to read after write dependencies and bank conflicts. The latency of read after write dependencies is roughly 24 clock cycles. For newer CUDA devices that have 32 cores per multiprocessor, it may take up to 768 threads to completely hide latency.
    2. In addition to the read after write latency, register pressure can severely detract from the performance of the application. Register pressure occurs when there are not enough registers available for a given task. When this occurs, the data is “spilled over” using local memory.
  2. shared memory: memory created using __shared keyword and is shared within the block. (Shared memory is as fast as registers if there are no bank conflicts)
  3. __constant__ memory: read only data. Using constant rather than global memory can reduce the required memory bandwidth, however, this performance gain can only be realized when a warp of threads read the same location. (1…10s … 100s of cycles, depending on cache locality)
    1. initialized by the host code using cudaMemcpyToSymbol(), cudaMemcpyFromSymbol() or cudaMemcpy() in combination with cudaGetSymbolAddress()
    2. Only 64KB of constant memory
    3. Does not tie up to register, so we can safe register count to prevent spelling
    4. Don't forget the f at the end if you want single precision (since single precision \times double precision = double precision)
  4. texture memory: read-only memory on the device. When all reads in a warp are physically adjacent, using texture memory can reduce memory traffic and increase performance compared to global memory. (1…10s … 100s of cycles, depending on cache locality)
  5. local memory: usually used (automatically) when one runs out of SM resources. Visible to current thread. Bytes are stored in global memory, but they are cached in L1. (Slow DRAM)
  6. global memory: GPU-wide (per-application) memory, usually passed in as parameter. (100x slower than shared memory) (Slow DRAM)
  7. Instruction Memory (invisible): DRAM, cached

Memory Space:

Example of using Global Memory

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));

__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol(devData, &value, sizeof(float));
__device__ float* devPointer;

float* ptr;
cudaMalloc(&ptr, 256 * sizeof(float));
cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr));

For detail about memory, read here

GPU Core Consists of:

GPU Address Space: 20 bits address space 5 bits for representing 32 threads, 3 bits for 8 (if programmer choose so) warps, and remaining 12 remaining bits.

Global Dispatcher

\begin{align*} 2^{20} \text{ data} =& 4096 \text{ blocks}\\ 1 \text{ block} =& 8 \text{ warps} \tag{configurated by specifying number of threads in a block}\\ 1 \text{ warp} =& 32 \text{ threads} \tag{fixed for every GPU}\\ 1 \text{ halfwarp} =& 16 \text{ threads} \tag{also fixed}\\ \end{align*}

How it works: (assume 8 warps in a block for now)

  1. Programmer split d bits of data into b blocks, so a block will contains d/b threads or d/b/8 warps. One SM (streaming multiprocessors) core can only store 1 block at a time.
  2. Global dispatcher send 1 block to 1 core, depending on how many cores we have.
  3. Since a block can be broken down into d/b/8 warps, at maximum, a clock cycle allows the GPU to execute 1 warp (32 threads). However, since we have 16 int32 units, assuming our data is integer, we need 2 GPU cycles to finish 1 warp. Within a block, data are passed by pipelining using derivatives of Tomasulo algorithm. Therefore, in this example, 16 threads (a half-warp) are synchronously parallel, 1 warp is asynchronously parallel by pipelining, and b blocks are synchronously parallel by multiple cores.
  4. Since 16 threads are synchronous, there isn't branching within these data. But you might be able to do branching between each 16 threads.
    1. Warp divergence: execute both branch, but don't worry about illegal access of memory in branch that you already checked condition.
    2. warp voting: If the branches are big, nvcc compiler inserts code to check if all threads in the warp take the same branch and then branches accordingly. (not very costly)

Suppose we have 1000 blocks, and each one has 128 threads - how does it get executed?

On Ampere hardware, would probably get 8-12 blocks running at the same time on each SM, and each block has 4 warps to 32-48 warps running on each SM.

Each clock tick, SM warp scheduler decides which warps to execute next, choosing from those not waiting for - data coming from device memory (memory latency) - completion of earlier instructions (pipeline delay)

Pipelining in CPU

5 stages of every instruction in processor diagram

5 stages of every instruction in processor diagram

Pipelining in CPU

Pipelining in CPU

Since every instruction is broken up in fetch, decode, execute, memory access, register write stages, we can pipeline multiple instructions one after the other.

Pipelining

Pipelining

Cycles Fetch Decode ALU Mem Wr Reg.
C1 Add R3,R2,1 Add R2,R1,1 Ldr R1,0x.. ... ...
C2 Add R3,R2,1 Add R2,R1,1 /stall/ Ldr R1,0x.. ...
C3 Add R3,R2,1 Add R2,R1,1 /stall/ /stall/ Ldr R1,0x..
C4 Jmp 0x.. Add R3,R2,1 Add R2,R1,1 /stall/ /stall/
C5 WRONG INST Jmp 0x.. Add R3,R2,1 Add R2,R1,1 /stall/
C6 WRONG INST WRONG INST Jmp 0x.. Add R3,R2,1 Add R2,R1,1
C7 CORRECT /flush/ /flush/ Jmp 0x.. Add R3,R2,1

Above is an example of pipelining execution where instructions move from left to right in cycles. Notice in C2 the Add R2,R1,1 instruction depends on the value of register R1 after loading from memory, therefore the Add could not be performed at the same time as Ldr R1,0x... CPU will stall Add R2,R1,1 instruction. Also notice in C5, C6 the instruction fetched from memory depends on the execution of Jmp 0x.., and therefore C6 will contain potentially wrong instructions. If this is the case, after Jmp 0x.. executed, wrong instructions will be flushed in C7. Stall and flush are two result of dependency in pipelining and will slow down the process.

Programmer's Abstraction

Grid Dimension and Block Dimension

Grid Dimension and Block Dimension

NAME DESCRIPTION x y z
threadIdx Thread index within the block (zero-based) threadIdx.x threadIdx.y threadIdx.z
blockIdx Block index within the grid (zero-based) blockIdx.x blockIdx.y blockIdx.z
blockDim Block dimensions in threads blockDim.x blockDim.y blockDim.z
gridDim Grid dimensions in blocks gridDim.x gridDim.y gridDim.z

threadIdx, blockIdx, blockDim, gridDim

threadIdx, blockIdx, blockDim, gridDim

In GPU programming, we only specify the number of threads in a block (block dimension), and the program will automatically divide those threads into warps (executing instructions in lock steps). One block correspond to one shader core.

Note that there are maximum number of threads in a block and maximum number of blocks in a grid. The limit depends on specific GPU architecture.

Computation Capability: there are maximum dimensionality of grids and blocks you can specify

Computation Capability: there are maximum dimensionality of grids and blocks you can specify

OpenMP, OpenACC, CUDA, OpenCL support for different GPU vendor

OpenMP, OpenACC, CUDA, OpenCL support for different GPU vendor

CUDA OpenCL
Kernel Kernel
Host program Host program
Thread Work item
Thread block Work group
Grid NDRange(index space)
threadIdx.x get_local_id(0)
blockIdx.x * blockDim.x + threadIdx.x get_global_id(0)
__global__ function __kernel function
__device__ function function
__constent__ variable __constant variable
__device__ variable __global variable
__shared__ variable __local variable

Coding CUDA

Function and Variable Declearation

CUDA extends C:

Threads within a block coordinate by "shared memory", "atomic operations" and "barrier synchronization". Threads in different blocks can not cooperate.

Variable declaration Mmeory Scope Lifetime
__device__ __local__ int LocalVar; local thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid kernel
__device__ __constant__ int ConstantVar; constant grid kernel

__device__ is optional when used with __local__, __shared__, or __constant__.

Helper Functions

cudaGetDeviceCount(&deviceCount): The number of active GPUs visible to the rank

cudaSetDevice(): use device

cudaGetDeviceProperties(): get cudaDeviceProp

Datatypes

Vectors: (addressed by .x, .y, .z, .w)

Floats:

Built-in Variables

Taste of Programming with CUDA

4 Tasks we need to do to use GPU

4 Tasks we need to do to use GPU

#include "cuda.h"
int main(int argc, char **argv) {
  int N = 3789; // size of array we want to compute

  float *d_a; // an address in GPU we want to know
  cudaMalloc((void**) &d_a, N*sizeof(float)); // tell GPU to give us allocated GPU address

  int B = 512; // programmer partition data into blocks each has size of 512 threads
  dim3 dimBlock(B, 1, 1); // programmer specify the dimension of a block, which in this case is (512 threads, 1 thread, 1 thread)
  dim3 dimGrid((N + B - 1) / B, 1, 1); // programmer specify the dimension of Grid which contains `(N + B - 1) / B` many blocks

  // Queue Kernel on DEVICE
  simpleKernel<<<dimGrid, dimBlock>>>(N, d_a);

  float *h_a = (float*) calloc(N, sizeof(float)); // allocate array in CPU host to receive calculated data from GPU device
  cudaMemcpy(h_a, d_a, N*sizeof(float), cudaMemcpyDeviceToHost); // this will block until GPU finish kernel

  for(int n = 0; n < N; ++n) printf("h_a[%d] = %f\n", n, h_a[n]); // print stuff out
}

You can imagine there is a internal function that calls your kernel:

void kernelCaller(int N, float *d_a) {
  for (int b = 0; b < gridDim; ++b) {
    for (int t = 0; t < blockDim; ++t) {
      simpleKernel(b, t, gridDim, blockDim)
    }
  }
}

But instead of passing b, t, gridDim, and blockDim, you have blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, blockDim.x, blockDim.y, blockDim.z, gridDim.x, gridDim.y, gridDim.z avaliable in your kernel. Those variables appear from nowhere in code.

The actual kernel looks like this:

__global__ void simpleKernel(int N, float *d_a) {
  // Convert threadIdx and blockIdx to array index
  const int n = threadIdx.x + blockDim.x * blockIdx.x;

  if (n < N) { // make sure we are inside the array we want to calculate
    d_a[n] = n;
  }
  // You don't return
}

Finally, you compile with nvcc -o ./out -arch=sm_50 main.cu. Make sure to set -arch=sm_50 to your GPU architecture. You can use cuda-memcheck --leak-check full to check for memory leak.

cuobjdump, cuobjdump -ptx, cuobjdump -sass -ptx is used to see byte code of compiled kernel.

nvprof and nvvp are for profiling.

Below is another sample code from Nvidia.

#include <iostream>
#include <math.h>

// CUDA Kernel function to add the elements of two arrays on the GPU. `__global__` tells the CUDA C++ compiler that this is a function that runs on the GPU and can be called from CPU code.
__global__
void add(int n, float *x, float *y) {
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void) {
  int N = 1<<20; // 1M elements

  float *x, *y;
  cudaMallocManaged(&x, N*sizeof(float)); // float *x = new float[N];
  cudaMallocManaged(&y, N*sizeof(float)); // float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  add(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

Compiling

Compiling

CUDA Random Number Generator (RNG)

CURAND:

curandGenerator_t r;
// argument tells which algorithm to use
curandCreateGenerator(&r, CURAND_RNG_PSEUDO_DEFAULT);
curandSetStream(r, stream); // optional
curandSetPseudoRandomGeneratorSeed(r, seed);
curandGenerateUniform(r, data, numElems);
curandDestroyGenerator(r);

__device__ void curand_init (unsigned long long seed, unsigned long long
sequence, unsigned long long offset, curandState *state);
__device__ float curand_uniform (curandState *state);
__device__ float curand_normal (curandState *state);

Generate Random Number in Host

#include <curand.h>
int main() {
. . .
curandGenerator_t gen;
float *devNum, *hostNum;
hostNum = new float[n];
cudaMalloc((void **)&devNum, n*sizeof(float));
. . .
curandCreateGenerator(&gen,CURAND_RNG_PSEUDO_DEFAULT);
curandSetPseudoRandomGeneratorSeed(gen, 12345);
curandGenerateUniform(gen, devNum, n);
cudaMemcpy(hostNum, devNum, n*sizeof(float),cudaMemcpyDeviceToHost);
. . .
curandDestroyGenerator(gen);
cudaFree(devNum);
. . .
}

Generate Random Number in Kernel

#include <stdio.h>
#include <stdlib.h>
#include <curand_kernel.h> // CURAND lib header file
#define TRIALS_PER_THREAD 2048
#define BLOCKS 256
#define THREADS 256
int main(int argc, char *argv[]) {
 float host[BLOCKS * THREADS];
 float *dev;
 curandState *devStates;
 cudaMalloc((void **) &dev, BLOCKS * THREADS * sizeof(float));
 cudaMalloc( (void **)&devStates, BLOCKS*THREADS*sizeof(curandState) );
 …
}

__global__ void pi_mc(float *estimate, curandState *states) {
 unsigned int tid = threadIdx.x + blockDim.x*blockIdx.x;
 int points_in_circle = 0;
 float x, y;
 // Initialize CURAND
 curand_init(tid, 0, 0, &states[tid]);
 for(int i = 0; i < TRIALS_PER_THREAD; i++) {
 x = curand_uniform(&states[tid]);
 y = curand_uniform(&states[tid]);
 // count if x & y is in the circule.
 points_in_circle += (x*x + y*y <= 1.0f);
 }
 estimate[tid] = 4.0f * points_in_circle / (float) TRIALS_PER_THREAD;
}

Simple Vector Addition

In order to store data on the gpu that can be communicated back to the host, we need to have allocated memory that lives until it is freed, see global memory as the heap space with life until the application closes or is freed, it is visible to any thread and block that have a pointer to that memory region. Shared memory can be considered as stack space with life until a block of a kernel finishes, the visibility is limited to only threads within the same block.

Shared memory is faster than global memory, but need to be initialized from global memory. This initialization must be coalesced (access global memory with locality in mind) to achieve high performance.

Main Take away: Global memory refers to allocated memory in the input of kernel. It can be used by all threads within all blocks. In contrast, shared memory is a stack space that is only shared among threads within the same block.

The size of the Global memory depends from card to card, anything from none to 32GB (V100). While the shared memory depend on the compute capability. Anything below compute capability 2.x have a maximum 16KB of shared memory per multiprocessor(where the amount of multiprocessors varies from card to card). And cards with compute capability of 2.x and greater have an minimum of 48KB of shared memory per multiprocessor. (See documentation for details)

CUDA could use CPU memory (mapped memory) as Global memory instead of GPU memory. Although a bit slow, the amount of memory now depends on CPU memory avaliability.

#include <algorithm>
#include <cassert>
#include <iostream>
#include <vector>

// CUDA kernel for vector addition
// __global__ means this is called from the CPU, and runs on the GPU. The name global refers to global memory allocated on GPU.
// In the C programming language, restrict is a keyword that can be used in pointer declarations. By adding this type qualifier, a programmer hints to the compiler that for the lifetime of the pointer, no other pointer will be used to access the object to which it points. This allows the compiler to make optimizations (for example, vectorization) that would not otherwise have been possible. __restrict is a cuda-memory version of restrict.
// a, b, c both are global GPU memory
__global__ void vectorAdd(const int *__restrict a, const int *__restrict b,
                          int *__restrict c, int N) {
  int tid = (blockIdx.x * blockDim.x) + threadIdx.x; // global thread ID
  if (tid < N) c[tid] = a[tid] + b[tid]; // Boundary check and then write back to global memory
}

int main() {
  constexpr int N = 1 << 16; // Array size of 2^16 (65536 elements)
  constexpr size_t bytes = sizeof(int) * N;

  // Initialize random vectors for holding the host-side (CPU-side) data
  std::vector<int> a;
  a.reserve(N);
  std::vector<int> b;
  b.reserve(N);
  std::vector<int> c;
  c.reserve(N); // allocate empty memory for GPU return
  for (int i = 0; i < N; i++) {
    a.push_back(rand() % 100);
    b.push_back(rand() % 100);
  }

  // Allocate memory on the device (GPU)
  int *d_a, *d_b, *d_c;
  cudaMalloc(&d_a, bytes);
  cudaMalloc(&d_b, bytes);
  cudaMalloc(&d_c, bytes);
  // Copy data from the host to the device (CPU -> GPU)
  cudaMemcpy(d_a, a.data(), bytes, cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, b.data(), bytes, cudaMemcpyHostToDevice);

  // Threads per "compute thread array" (generally called block, or CTA) (1024)
  int NUM_THREADS = 1 << 10;
  // blocks per Grid
  int NUM_BLOCKS = (N + NUM_THREADS - 1) / NUM_THREADS;

  // Launch the kernel on the GPU
  vectorAdd<<<NUM_BLOCKS, NUM_THREADS>>>(d_a, d_b, d_c, N);

  // Copy sum vector from device to host
  // cudaMemcpy is a synchronous operation, and waits for the prior kernel
  // launch to complete (both go to the default stream in this case).
  // Therefore, this cudaMemcpy acts as both a memcpy and synchronization
  // barrier.
  cudaMemcpy(c.data(), d_c, bytes, cudaMemcpyDeviceToHost);
  cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Free memory on device

  std::cout << "COMPLETED SUCCESSFULLY\n";
  return 0;
}

Unified Memory and Prefetch

Q: What are the advantages of using unified memory other than not needing to transfer data between host and device? After profiling it seems like using unified memory takes significant more time than the usual malloc and cudaMalloc.

A: Good question! Unified memory can take longer than normal allocation and copy, but with adequate pre-fetching, they can yield roughly similar results. It is mainly a programmability feature, and that goes beyond just avoiding duplicate data structures. Modern architectures (Volta and later) support over-subscription, where you can allocate more memory than you have on your GPU. Doing this without unified memory would require you to decompose your problem into multiple kernels so that you can work on a chunk of data, finish computation, copy new data in, start a new kernel, etc. With unified memory, none of that is required.

We also learned about hints, for more, read this paper

#include <stdio.h>
#include <cassert>
#include <iostream>

using std::cout;

__global__ void vectorAdd(int *a, int *b, int *c, int N) {
  int tid = (blockDim.x * blockIdx.x) + threadIdx.x;
  if (tid < N) c[tid] = a[tid] + b[tid];
}

int main() {
  const int N = 1 << 16;
  size_t bytes = N * sizeof(int);

  int *a, *b, *c;
  // Allocation memory for these pointers
  // CUDA transfer memory from CPU to GPU and from GPU to CPU automatically
  // based on demand. You can think of this memory as a shared memory in GPU/CPU
  cudaMallocManaged(&a, bytes);
  cudaMallocManaged(&b, bytes);
  cudaMallocManaged(&c, bytes);


  int id = cudaGetDevice(&id); // Get the device ID for prefetching calls
  // cudaMemAdviseSetPreferredLocation (PL) sets the preferred location (CPU or GPU) for a range of data. If another processor wants to access the memory region, then the data will not migrate from the preferred location.
  // In this example, we prefer data a, b to stay on CPU
  cudaMemAdvise(a, bytes, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);
  cudaMemAdvise(b, bytes, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);
  // Set some hints about the data and do some prefetching for c
  cudaMemPrefetchAsync(c, bytes, id);

  // Initialize vectors
  for (int i = 0; i < N; i++) {
    a[i] = rand() % 100;
    b[i] = rand() % 100;
  }

  // cudaMemAdviseSetReadMostly (RM) implies that the data will be mostly read and occasionally written to. It lets the driver create read-only copies of that data, and then send copies to other processors rather than migrate.
  cudaMemAdvise(a, bytes, cudaMemAdviseSetReadMostly, id);
  cudaMemAdvise(b, bytes, cudaMemAdviseSetReadMostly, id);
  // Pre-fetch 'a' and 'b' arrays to the specified device (GPU)
  cudaMemPrefetchAsync(a, bytes, id);
  cudaMemPrefetchAsync(b, bytes, id);

  int BLOCK_SIZE = 1 << 10;
  int GRID_SIZE = (N + BLOCK_SIZE - 1) / BLOCK_SIZE;

  vectorAdd<<<GRID_SIZE, BLOCK_SIZE>>>(a, b, c, N);

  // Wait for all previous operations before using values
  // We need this because we don't get the implicit synchronization of
  // cudaMemcpy like in the original example
  cudaDeviceSynchronize();

  // Prefetch to the host (CPU)
  cudaMemPrefetchAsync(a, bytes, cudaCpuDeviceId);
  cudaMemPrefetchAsync(b, bytes, cudaCpuDeviceId);
  cudaMemPrefetchAsync(c, bytes, cudaCpuDeviceId);

  cudaFree(a); cudaFree(b); cudaFree(c);

  cout << "COMPLETED SUCCESSFULLY!\n";
  return 0;
}

Matrix Multiplication

Matrix Multiplication Memory Access

Matrix Multiplication Memory Access

To calculate A^{a \times n} \cdot B^{n \times b} = C^{n \times n}, we assign each thread to an element of C. So we can view our threads as 2D threads each can be identified as (threadIdx.x, threadIdx.y). However, internally, the GPU threads are 1 dimensional, say [(0, 0), (0, 1), ..., (0, n), ..., (1, 0), (1, 1), ..., (1, n), ..., (n, 0), ..., (n, n)]. For each thread, there are n additions, and for each addition we need to load 2 integers. Since at least 32 threads are running in lock step, we need to load 32 \times 2 integers at once from 2 matrix. To preserve locality, we hope that 32 integers are in the same memory region. For the first iteration, every thread loads the first column of A and first row of B. Therefore, B preserves locality while A is not.

However, another problem is that we don't usually have enough cache to fit 32 \times 2 integers at once. Cache tiling is needed just like CMU 15-213 Cache Lab.

We will take a simpler approach in the code below: we divide C^{n \times n} into (n /\text{block\_size}, n /\text{block\_size}) regions. We still map each thread to each element in C^{n \times n}, but change the order of calculation by maintaining partial sum as the image below:

Tailed Matrix Multiplication

Tailed Matrix Multiplication

__global__ void matrixMul(const int *a, const int *b, int *c) {
  // Compute each thread's global row and column index
  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;

  // Statically allocated shared memory
  __shared__ int s_a[SHMEM_SIZE];
  __shared__ int s_b[SHMEM_SIZE];

  // Accumulate in temporary variable
  int tmp = 0;

  // Sweep tile across matrix
  for (int i = 0; i < N; i += blockDim.x) {
    // Load in elements for this tile
    s_a[threadIdx.y * blockDim.x + threadIdx.x] = a[row * N + i + threadIdx.x];
    s_b[threadIdx.y * blockDim.x + threadIdx.x] =
        b[i * N + threadIdx.y * N + col];

    // Wait for both tiles to be loaded in before doing computation
    __syncthreads();

    // Do matrix multiplication on the small matrix
    for (int j = 0; j < blockDim.x; j++) {
      tmp +=
          s_a[threadIdx.y * blockDim.x + j] * s_b[j * blockDim.x + threadIdx.x];
    }

    // Wait for all threads to finish using current tiles before loading in new
    // ones
    __syncthreads();
  }

  // Write back results
  c[row * N + col] = tmp;
}

Reduction Optimization in CUDA

Reduction Usually are: which enable re-arrange

Note: in MPI there are special routines to perform reductions over distributed arrays

Scan: scan is another important building block of algorithm that can be massively paralleled.

Scan is very tricky to be implement quickly with maximum efficiency.

Advice: best to first see if you can get working code from someone else (e.g. investigate Thrust C++ library) Don’t re-invent the wheel unless you really think you can do it better.

For this reason, I will not cover the exact algorithm here. Interested reader can read directly from slide

General Tips:

Single-block parallel reduction for commutative operator

Single-block parallel reduction for commutative operator

Single-block parallel reduction for commutative operator

We partition our data into only 1 block (gridSize = 1) consists of 1024 threads. Therefore each thread will process more than 1 element of an array.

static const int arraySize = 10000;
static const int blockSize = 1024;

// the function is run at thread level
__global__ void sumCommSingleBlock(const int *a, int *out) {
    int idx = threadIdx.x; // each thread start at thread id. Thread 2 takes a[2].
    int sum = 0;
    for (int i = idx; i < arraySize; i += blockSize) // each thread sum and jump blocksize. Thread 2 takes a[2], a[1024+2], a[2048+2], ...
        sum += a[i];
    __shared__ int r[blockSize]; // each thread within a block pull their result together. This instruction is executed at block level
    r[idx] = sum;
    __syncthreads(); // wait until every thread in the block finishes
    for (int size = blockSize/2; size>0; size/=2) { // reduction
        if (idx<size)
            r[idx] += r[idx+size];
        __syncthreads();
    }
    if (idx == 0) // copy result to global memory
        *out = r[0];
}

...

sumCommSingleBlock<<<1, blockSize>>>(dev_a, dev_out);

In above example, we use shared memory __shared__ int r[blockSize], which is 100x faster than global memory const int *a, int *out.

Multi-block parallel reduction for commutative operator

Multi-block parallel reduction for commutative operator

Multi-block parallel reduction for commutative operator

#define SHMEM_SIZE 256

__global__ void sumReduction(int *v, int *v_r) {
  // Allocate shared memory
  __shared__ int partial_sum[SHMEM_SIZE];

  // Calculate thread ID
  int tid = blockIdx.x * blockDim.x + threadIdx.x;

  // Load elements into shared memory
  partial_sum[threadIdx.x] = v[tid];
  __syncthreads();

  // Iterate of log base 2 the block dimension
  for (int s = 1; s < blockDim.x; s *= 2) {
    // Reduce the threads performing work by half previous the previous
    // iteration each cycle
    if (threadIdx.x % (2 * s) == 0) {
      partial_sum[threadIdx.x] += partial_sum[threadIdx.x + s];
    }
    __syncthreads();
  }

  // Let the thread 0 for this block write it's result to main memory
  // Result is indexed by this block
  if (threadIdx.x == 0) {
    v_r[blockIdx.x] = partial_sum[0];
  }
}

int main() {
  // Vector size
  int N = 1 << 16;
  size_t bytes = N * sizeof(int);

  // Host data
  vector<int> h_v(N);
  vector<int> h_v_r(N);
  generate(begin(h_v), end(h_v), [](){ return rand() % 10; }); // Initialize data

  int *d_v, *d_v_r;
  cudaMalloc(&d_v, bytes);
  cudaMalloc(&d_v_r, bytes);
  cudaMemcpy(d_v, h_v.data(), bytes, cudaMemcpyHostToDevice);

  const int TB_SIZE = 256; // Block Size (how many threads in a block)
  int GRID_SIZE = N / TB_SIZE; // Grid Size (No padding)

  // first summon 256 blocks and calculate 256 partial sum
  sumReduction<<<GRID_SIZE, TB_SIZE>>>(d_v, d_v_r);
  // then use 1 block of 256 threads to sum those 256 partial sum
  sumReduction<<<1, TB_SIZE>>> (d_v_r, d_v_r);

  cudaMemcpy(h_v_r.data(), d_v_r, bytes, cudaMemcpyDeviceToHost);
  assert(h_v_r[0] == std::accumulate(begin(h_v), end(h_v), 0));
  cout << "COMPLETED SUCCESSFULLY\n";
  return 0;
}

Multi-block parallel reduction for commutative operator with warp divergence

if (branch_condition) do_work(); else do_no_work()

In the example above, we set the block size to be 256 and therefore each block consists of 256 / 32 = 8 semi-sequential warps with lock step. In every warp, at least half of the threads takes the if branch condition to do no work while the rest do some work. In this case, some work-items in the same warp must not run the same instruction, that they are masked when the warp is dispatched. If the conditional is such that some warp in the block must do something, and the other work-items in the block must do something else, then what happens is that the two code paths are taken sequentially by the block, with the appropriate work-items masked. The resulting runtime is exactly the same as every thread in the same warp will do the same amount of work as those working threads. This is not ideal since we can make at least half of the threads (and therefore half of the warps) in a block execute faster than the rest of the threads (rest of the warps) because they are essentially doing nothing. We will implement this by making all threads that do work have similar threadIdx, which means those that do work are grouped in the same warp.

When a warp instruction is executed on compute capability 2.x, the entire warp is sent to 16 of the CUDA cores. CUDA cores are pipelined (as are all modern CPUs), so what happens is that the 32 threads are queued up in two consecutive pipeline stages in those 16 CUDA cores. The instruction takes something like 16 to 24 clock ticks to propagate through the pipeline, but because many instructions moving through the pipeline at once, that group of 16 CUDA cores will complete one warp every two clock ticks. (seibert)

Although the smallest execution unit is half-warp, the warp-divergence is only in the warp level (not half-warp level) because the scheduler works with entire warps. The programmer should not worry about implementation detail of half-warp. For detail, see Compute Capabilities

When the warp instruction is issued to a group of 16 cores, the entire warp executes the instruction, because the cores are clocked twice (Fermi's "hotclock") so that each core actually executes two thread's worth of computation in a single cycle (= 2 hotclocks). When a warp instruction is dispatched, the entire warp gets serviced. It does not need to be scheduled twice. Robert Crovella Certainly all threads in a warp are executing the same instruction at any given time. But warps execute independently from each other and so different warps within a block may be executing different instructions from the stream, at any given time.

The instruction-level parallism (ILP) is more complicated. For detail, please see Stackoverflow

The determination of whether to do mask or to do branch seem to be in run-time. One thing to keep in mind is that the warp size on all CUDA devices so far is 32, but NVIDIA has said that it may change in the future. If you need to use the size of a warp in your device code somewhere, then you should use the integer variable warpSize, which is provided by the compiler. (seibert)

Multi-block parallel reduction for commutative operator with warp divergence

Multi-block parallel reduction for commutative operator with warp divergence

The code is basically identical as above except for the inner for loop. The execution is about 2x faster than above implementation.

for (int s = 1; s < blockDim.x; s*= 2) {
  int index = 2 * s * threadIdx.x;
  if (index < blockDim.x) {
    partial_sum[index] += partial_sum[index + s];
  }
  __syncthreads();
}

Multi-block parallel reduction for commutative operator with warp divergence and memory locality without shared memory bank conflict

The shared memory in each block is broken into words of 32 bit (a int32, a float, 4 bytes, 2 short, or half double). Those words live in 32 different banks as the diagram below shows.

However, Kepler architecture introduced the option to increase banks to 8 bytes using cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte). This can help avoid bank conflicts when accessing double precision data.

Bank Diagram in terms of words: word 0 at bank 0, word 1 at bank 1, ..., word 31 in bank 31, but word 32 go back again to bank 0.

Bank Diagram in terms of words: word 0 at bank 0, word 1 at bank 1, ..., word 31 in bank 31, but word 32 go back again to bank 0.

Bank Diagram in GPU. The total size of shared memory may be set to 16KB, 32KB or 48KB (with the remaining amount automatically used for L1 Cache) when kernel launch as shown in the image. Shared memory defaults to 48KB (with 16KB remaining for L1 Cache). With the Kepler architecture, each bank has a bandwidth of 64 bits per clock cycle. The older Fermi architecture was clocked differently, but effectively offered half this bandwidth.

Bank Diagram in GPU. The total size of shared memory may be set to 16KB, 32KB or 48KB (with the remaining amount automatically used for L1 Cache) when kernel launch as shown in the image. Shared memory defaults to 48KB (with 16KB remaining for L1 Cache). With the Kepler architecture, each bank has a bandwidth of 64 bits per clock cycle. The older Fermi architecture was clocked differently, but effectively offered half this bandwidth.

Bank Conflict: occurs when two or more threads of a half-warp request different words from the same bank in a single request, those memory access are done in sequential order.

// QUESTION: some article say that bank conflict are in half-warp level while other say in warp level.

Broadcast: if all threads of a half-warp request exactly the same word. The word will be read once from shared memory and broadcast to the threads.

arr[threadIdx.x*0], arr[12], and arr[blockIdx.x*3] are examples of broadcast.

/**
* Allocates at least width (in bytes) * height bytes of
* linear memory on the device and returns in *devPtr a pointer
* to the allocated memory. The function may pad the allocation
* to ensure that corresponding pointers in any given row will
* continue to meet the alignment requirements for coalescing as
* the address is updated from row to row. The pitch returned in
* *pitch by cudaMallocPitch() is the width in bytes of the allocation.
* @param devPtr: Pointer to allocated pitched device memory
* @param pitch: Pitch for allocation
* @param width: Requested pitched allocation width (in bytes)
* @param height: Requested pitched allocation height
* @returns cudaSuccess cudaErrorMemoryAllocation
**/
cudaError_t cudaMallocPitch(void **devPtr, size_t *pitch, size_t width,
size_t height);

Given the row and column of an array element of type T, the address is computed as T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;

Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D().

Multi-block parallel reduction for commutative operator with warp divergence and memory locality without shared memory bank conflict

Multi-block parallel reduction for commutative operator with warp divergence and memory locality without shared memory bank conflict

The code is basically identical as above except for the inner for loop. The execution is about 1.3x faster than above implementation.

for (int s = blockDim.x / 2; s > 0; s >>=1) {
  if (threadIdx.x < x) partial_sum[threadIdx.x] += partial_sum[threadIdx.x + s];
  __syncthreads();
}

A final optimization can be done by do calculation at the time when loading from global memory to shared memory to get rid of idle threads in the first batch execution. This will improve 1.4x(5 us) faster.

Single-block parallel reduction for non-commutative operator

Single-block parallel reduction for non-commutative operator

Single-block parallel reduction for non-commutative operator

static const int arraySize = 1000000;
static const int blockSize = 1024;

__global__ void sumNoncommSingleBlock(const int *gArr, int *out) {
    int thIdx = threadIdx.x;
    __shared__ int shArr[blockSize*2];
    __shared__ int offset;
    shArr[thIdx] = thIdx<arraySize ? gArr[thIdx] : 0;
    if (thIdx == 0)
        offset = blockSize;
    __syncthreads();
    while (offset < arraySize) { //uniform
        shArr[thIdx + blockSize] = thIdx+offset<arraySize ? gArr[thIdx+offset] : 0;
        __syncthreads();
        if (thIdx == 0)
            offset += blockSize;
        int sum = shArr[2*thIdx] + shArr[2*thIdx+1];
        __syncthreads();
        shArr[thIdx] = sum;
    }
    __syncthreads();
    for (int stride = 1; stride<blockSize; stride*=2) { //uniform
        int arrIdx = thIdx*stride*2;
        if (arrIdx+stride<blockSize)
            shArr[arrIdx] += shArr[arrIdx+stride];
        __syncthreads();
    }
    if (thIdx == 0)
        *out = shArr[0];
}

...

sumNoncommSingleBlock<<<1, blockSize>>>(dev_a, dev_out);

Special Topics

Atomic Instructions

Atomic vs. Non-Atomic

Atomic vs. Non-Atomic

// addition / subtraction
atomicAdd()
atomicSub()
// minimum / maximum
atomicMin()
atomicMax()
// increment / decrement
atomicInc()
atomicDec()
// exchange / compare-and-swap
atomicExch()
atomicCAS()
// bitwise AND / OR / XOR
atomicAnd()
atomicOr()
atomicXor()

For locks of global memory, you can use:

// compare and swap (useful for atomic locks)
int atomicCAS(int* address,int compare,int val);
// if compare equals old value stored at address else val is stored instead
// in either case, routine returns the value of old

Here is an example of global atomic lock using __threadfence();:

__threadfence_block(): wait until all global and shared memory writes are visible to all threads in block

__threadfence(): wait until all global and shared memory writes are visible to all threads in block, all threads, for global data

__global__ void kernel(...) {
...
if (threadIdx.x==0) {
  do {} while(atomicCAS(&lock,0,1));
  ... // some global memory write
  __threadfence(); // wait for writes to finish: since writing to global memory is async, we need manual synching to ensure correctness
  lock = 0; // free lock
  }
}

Warp Shuffles

Warp Shuffles: exchanging data between threads in the same warp

LaneID: threadIdx.x % 32 for 1D blocks

// mask: controls which threads are involved — usually set to -1
// var: a local register variable (int, unsigned int, long long, unsigned long long, float or double) put in for one other thread
// delta: offset within the warp
// if the appropriate thread does not exist then the value is taken from the current thread
// return: value from one other thread
T __shfl_up_sync(unsigned mask, T var, unsigned int delta);
// copy from a lane with lower ID relative to caller
T __shfl_down_sync(unsigned mask, T var, unsigned int delta);
// copy from a lane with higher ID relative to caller

// an XOR operation is performed between laneMask and the calling thread’s laneID to determine the lane from which to copy the value
// very useful for reduction operations and FFTs
T __shfl_xor_sync(unsigned mask, T var, int laneMask);
// copy from a lane based on bitwise XOR of own lane ID

// copies data from srcLane
T __shfl_syncc(unsigned mask, T var, int srcLane);
// copy from indexed lane ID

Be careful about condition if code: Threads may only read data from another thread which is actively participating in the shuffle command. If the target thread is inactive, the retrieved value is undefined.

sum all the elements in a warp: XOR

sum all the elements in a warp: XOR

sum all the elements in a warp: Down

sum all the elements in a warp: Down

Streams

Here we introduce streams.

When there exists no concept of "stream", therefore we call the implementation as "default stream". Say somehow I want the GPU to process two different kernel code. The first calculate one integer addition while the second calculate one integer multiplication. As you can see, there is little memory and ALU usage in both kernel, and therefore you expect the kernels run in concurrent. However, it is not possible since we only have one "stream", the "default stream". The kernel execution is serialized.

// default stream and stream 0 are different
// launch of kernel is async
kernel<<< blocks, threads, bytes >>>();    // default stream
kernel<<< blocks, threads, bytes, 0 >>>(); // stream 0

When the concept of stream is introduced (but before CUDA7), all launch of kernel that doesn't have a stream assigned will go to "default stream". Say if you launch multiple kernels concurrently in multiple host-thread, they will all go to "default stream" and executed in serial (sequential order)

Before CUDA7: A multi-threaded example with the legacy default stream behavior: all eight threads are serialized.

Before CUDA7: A multi-threaded example with the legacy default stream behavior: all eight threads are serialized.

In later versions of cuda, user can control which stream they want to execute kernels in. But they run in to backward compatibility issue: if you have a job executing "default stream", all other numbered stream cannot execute in concurrent with "default stream". This is because "default stream" is application-wide and is shared accross host-threads and want to take up the whole bandwidth.

After CUDA7: Each host-thread creates a new stream (the same as regular stream as far as synchronization and concurrency goes) automatically and they do not synchronize, so the kernels from all eight threads run concurrently.

After CUDA7: Each host-thread creates a new stream (the same as regular stream as far as synchronization and concurrency goes) automatically and they do not synchronize, so the kernels from all eight threads run concurrently.

The code below demonstrates this issue:

const int N = 1 << 20;

__global__ void kernel(float *x, int n) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
        x[i] = sqrt(pow(3.14159,i));
    }
}

int main() {
    const int num_streams = 8;

    cudaStream_t streams[num_streams];
    float *data[num_streams];

    for (int i = 0; i < num_streams; i++) {
        cudaStreamCreate(&streams[i]);
        cudaMalloc(&data[i], N * sizeof(float));

        // launch one worker kernel per stream
        kernel<<<1, 64, 0, streams[i]>>>(data[i], N);

        // launch a dummy kernel on the default stream
        kernel<<<1, 1>>>(0, 0);
    }
    cudaDeviceReset();
    return 0;
}

If we compile above code with nvcc ./stream_test.cu -o stream_legacy you will see that the dummy kernel in default stream sandwitched between two launches of kernel in numbered stream causes numbered stream to serialize.

NVIDIA Visual Profiler (nvvp): resulting kernel timeline on a Macbook Pro with an NVIDIA GeForce GT 750M (a Kepler GPU). You can see the very small bars for the dummy kernels on the default stream, and how they cause all of the other streams to serialize.

NVIDIA Visual Profiler (nvvp): resulting kernel timeline on a Macbook Pro with an NVIDIA GeForce GT 750M (a Kepler GPU). You can see the very small bars for the dummy kernels on the default stream, and how they cause all of the other streams to serialize.

CUDA 7 introduces a new option, the "per-thread default stream", that has two effects:

If we compile above code with nvcc --default-stream per-thread ./stream_test.cu -o stream_per-thread, you will see that default stream now becomes an usual stream.

To enable per-thread default streams in CUDA 7 and later, you can either compile with the nvcc command-line option --default-stream per-thread, or #define the CUDA_API_PER_THREAD_DEFAULT_STREAM preprocessor macro before including CUDA headers (cuda.h or cuda_runtime.h). It is important to note: you cannot use #define CUDA_API_PER_THREAD_DEFAULT_STREAM to enable this behavior in a .cu file when the code is compiled by nvcc because nvcc implicitly includes cuda_runtime.h at the top of the translation unit. The --default-stream option is applied per compilation unit, so make sure to apply it to all nvcc command lines that need it.

Results from klzzwxh:0138. Here you can see full concurrency between nine streams: the default stream, which in this case maps to Stream 14, and the eight other streams we created. Note that the dummy kernels run so quickly that it’s hard to see that there are eight calls on the default stream in this image.

Results from nvvp. Here you can see full concurrency between nine streams: the default stream, which in this case maps to Stream 14, and the eight other streams we created. Note that the dummy kernels run so quickly that it’s hard to see that there are eight calls on the default stream in this image.

Other Tips - cudaDeviceSynchronize() continues to synchronize everything on the device, even with the new per-thread default stream option. If you want to only synchronize a single stream, use cudaStreamSynchronize(cudaStream_t stream), as in example here.

Other Useful Cuda Functions

cudaError_t cudaMemset (void * devPtr, int value, size_t count )

It set the count number of bytes at address devPtr all with value value. For example:

#include <cstdio>

int main(void) {
    const int n = 32;
    const size_t sz = size_t(n) * sizeof(int);
    int *dJunk;
    cudaMalloc((void**)&dJunk, sz);
    cudaMemset(dJunk, 0, sz);
    cudaMemset(dJunk, 0x12, 32);

    int *Junk = new int[n];

    cudaMemcpy(Junk, dJunk, sz, cudaMemcpyDeviceToHost);

    for(int i=0; i<n; i++) {
        fprintf(stdout, "%d %x\n", i, Junk[i]);
    }

    cudaDeviceReset();
    return 0;
}

Above code will set all 128 bytes set to 0, then first 32 bytes set to 0x12 as described by talonmies

$ nvcc memset.cu
$ ./a.out

0 12121212
1 12121212
2 12121212
3 12121212
4 12121212
5 12121212
6 12121212
7 12121212
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0

Error Checking

checkCudaErrors(...); // check for error return codes
getLastCudaError(...); // check for kernel failure messages
printf("message"); // within kernel, each thread generates its own output
// need to use either cudaDeviceSynchronize(); or cudaDeviceReset(); to make sure buffer is flushed

Acceleration Tips

#pragma unroll 3 // for loop unroll
__launch_bounds__(TRACE_RAY_CUDA_THREADS, MIN_BLOCKS_PER_SM) // for register restriction
float* __restrict__ out
const __restrict__ // for constant

General Advice

On Ampere:

If different elements in an array have different load, you can filter out heavy and light task and execute separately to increase performance.

General advice:

Other Instructions Never be Used

// About Sync
int __syncthreads_count(predicate)
// counts how many predicates are true
int __syncthreads_and(predicate)
// returns non-zero (true) if all predicates are true
int __syncthreads_or(predicate)
// returns non-zero (true) if any predicate is true

// About Warp Voting
int __all(predicate)
// returns non-zero (true) if all predicates in warp are true
int __any(predicate)
// returns non-zero (true) if any predicate is true
unsigned int __ballot(predicate)
// sets nth bit based on nth predicate

Table of Content