CUDA

Some Good Stuff to Read: - Course on CUDA Programming on NVIDIA GPUs, Nov 28 - Dec 9, 2022, at UT Austin - The website is offline, but you can find my notes below and slides are in Here - COM4521/COM6521 Parallel Computing with GPUs (Course) - very in-depth knowledge about CUDA and OpenMP programming) - Computer Architecture (ETH Zurich, Fall 2017) - course website - youtube - Youtube CUDA Crash Course by CoffeeBeforeArch - Nvidia Official CUDA Programming Guide - CS5984: Advanced Computer Graphics: Parallel Computing and Visualization on GPU (VirginiaTech) - This one is mostly computer graphics and GPU - course website - course slides - Nvidia CUDA Training Resources - ACMS 40212/60212: Advanced Scientific Computing (Notre Dame) - Mostly on MPI and briefly mentioned CUDA for 2 lectures - course website - ECE1742S: Programming Massively Parallel Multiprocessors Using CUDA (UToronto) - This one isn't so good - course website - course slides or local repo

GPU is about 9 times faster than CPU (based on bandwidth difference - P100 vs E7-8894v4). But since usually CPU code is shit and GPU code is okay, people experience GPU to run faster.

Original Design of CPU: make single thread fast (reduce latency through cache, and predict branch)

Original Design of GPU: throughput matters (hide memory latency through parallelism, let programmer deal with raw storage hierarchy, avoid high frequency clock speed)

Graphics Processing Clusters (GPC): a computer cluster in which each node is equipped with a Graphics Processing Unit (GPU)

Nvidia GPU Architecture

Introduction to GPU, SIMD, SIMT

Youtube: An Intro to GPU Architecture and Programming Models I Tim Warburton, Virginia Tech

SIMD/T (Single Instruction, Multiple Data/Thread): unlike CPU, there exists no scaler operation unit, only matrix (or vectorized) operations is allowed. Here we have one instruction transforming multiple pieces of data. In each SM, a group of 32 ALU (a warp of them) execute the same instruction simultaneously, but with different data.

GPU cores:

Big Idea: A core is made of registers and ALU and other stuffs, GPU takes many CPU's core and bind them together to form a bigger core.
A GPU core has:
- one (or more) SIMD vector units execute one (or more) instruction stream
- a pool of shared memory
- private shared register file
- a shared memory where all threads in the core can access
GPU core is overloaded term: we refer to big core here
- Every body agrees how many CPU cores there are given a CPU
- However, since one GPU core can contain multiple ALUs, Nvidia tries to convince people to count the number of ALUs as "cores"

GPU limits: Parallel computation is fast, but is limited by memory bandwidth (feeding into GPU is slow). To achieve high FLOPS, one data usually requires 8 instructions.

GPU Architecture

Thread, Kernel, Blocks, Cluster, Grid

Terminologies:

Thread: the same function split in multiple execution units for efficiency purpose.
Kernel: a sequence of instructions (code) that is executed in parallel (much like GLSL code)
Thread Blocks: Multiple threads grouped together, all executing at the same time (co-scheduled), although SM may execute multiple blocks at the same time too.
- All threads in the block can access to a block's memory. On current GPUs, a thread block may contain up to $1024$ threads.
- All threads of a block are expected to reside on the same streaming multiprocessor core and must share the limited memory resources of that core
- Threads within a block can access block-specific shared memory. You might need __syncthreads() (deadlock warning). For more primitives, see Cooperative Groups
Cluster: a new thing in capability 9.0. Pack maximum of 8 blocks (exceeding limit require hardware support) together to enforce those 8 blocks will be executed by a group of SMs within a GPC.
- For compile-time: use __cluster_dims__(X,Y,Z)
- For run-time: use cudaLaunchKernelEx(...)
- Can be synchronized using cluster.sync()
- Distributed Shared Memory: shared between blocks belong to same cluster
Grid: multiple thread blocks grouped together. One kernel launch always correspond to one grid.
- There is no automatic synching between kernel launches (a later kernel can finish before other launched earlier). To sync, use cudaDeviceSynchronize()

Host/Device, Grid, Block, Thread. Programmer's perspective on memory.

Memory Hierarchy

Memory: (ranked by speed)

register memory: register, stack-based memory, only accessable within current thread. In most cases, accessing a register consumes zero clock cycles per instruction.
1. However, delays can occur due to read after write dependencies and bank conflicts. The latency of read after write dependencies is roughly 24 clock cycles. For newer CUDA devices that have 32 cores per multiprocessor, it may take up to 768 threads to completely hide latency.
2. In addition to the read after write latency, register pressure can severely detract from the performance of the application. Register pressure occurs when there are not enough registers available for a given task. When this occurs, the data is “spilled over” using local memory.
shared memory: memory created using __shared keyword and is shared within the block. (Shared memory is as fast as registers if there are no bank conflicts)
__constant__ memory: read only data. Using constant rather than global memory can reduce the required memory bandwidth, however, this performance gain can only be realized when a warp of threads read the same location. (1…10s … 100s of cycles, depending on cache locality)
1. initialized by the host code using cudaMemcpyToSymbol(), cudaMemcpyFromSymbol() or cudaMemcpy() in combination with cudaGetSymbolAddress()
2. Only 64KB of constant memory
3. Does not tie up to register, so we can safe register count to prevent spelling
4. Don't forget the f at the end if you want single precision (since single precision $\times$ double precision $=$ double precision)
texture memory: read-only memory on the device. When all reads in a warp are physically adjacent, using texture memory can reduce memory traffic and increase performance compared to global memory. (1…10s … 100s of cycles, depending on cache locality)
local memory: usually used (automatically) when one runs out of SM resources. Visible to current thread. Bytes are stored in global memory, but they are cached in L1. (Slow DRAM)
global memory: GPU-wide (per-application) memory, usually passed in as parameter. (100x slower than shared memory) (Slow DRAM)
Instruction Memory (invisible): DRAM, cached

Memory Space:

host memory: Memory space on CPU
device memory: Memory space on GPU
managed memory: unified memory, an programming abstraction for pre-fetching and better management. See sections about unified memory below.

Example of using Global Memory

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));

__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol(devData, &value, sizeof(float));
__device__ float* devPointer;

float* ptr;
cudaMalloc(&ptr, 256 * sizeof(float));
cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr));

For detail about memory, read here

GPU Core Consists of:

L0 memory: store stuff
Scheduler: make schedule
Dispacher: sends data for ALU to calculate
Registers: super quick temporary storage (compiler might expand loop to use more registers than expected)
- 64K 32-bit registers per SM
- up to 255 registers per thread
- up to 2048 threads per SM (at most 1024 per thread block)
- max registers per thread -> 256 threads
- max threads -> 32 registers per thread
- 8x difference between "fat" and "thin" threads
ALU: capable of computing $32$ threads, or $1$ warp
- 16 * int32
- 16 * fp32
- 8 * fp64
- 1 * tensor core

GPU Address Space: $20$ bits address space $5$ bits for representing $32$ threads, $3$ bits for $8$ (if programmer choose so) warps, and remaining $12$ remaining bits.

Global Dispatcher

$\begin{align*} 2^{20} \text{ data} =& 4096 \text{ blocks}\\ 1 \text{ block} =& 8 \text{ warps} \tag{configurated by specifying number of threads in a block}\\ 1 \text{ warp} =& 32 \text{ threads} \tag{fixed for every GPU}\\ 1 \text{ halfwarp} =& 16 \text{ threads} \tag{also fixed}\\ \end{align*}$

How it works: (assume $8$ warps in a block for now)

Programmer split $d$ bits of data into $b$ blocks, so a block will contains $d/b$ threads or $d/b/8$ warps. One SM (streaming multiprocessors) core can only store $1$ block at a time.
Global dispatcher send $1$ block to $1$ core, depending on how many cores we have.
Since a block can be broken down into $d/b/8$ warps, at maximum, a clock cycle allows the GPU to execute $1$ warp ( $32$ threads). However, since we have $16$ int32 units, assuming our data is integer, we need $2$ GPU cycles to finish $1$ warp. Within a block, data are passed by pipelining using derivatives of Tomasulo algorithm. Therefore, in this example, $16$ threads (a half-warp) are synchronously parallel, $1$ warp is asynchronously parallel by pipelining, and $b$ blocks are synchronously parallel by multiple cores.
Since $16$ threads are synchronous, there isn't branching within these data. But you might be able to do branching between each $16$ threads.
1. Warp divergence: execute both branch, but don't worry about illegal access of memory in branch that you already checked condition.
2. warp voting: If the branches are big, nvcc compiler inserts code to check if all threads in the warp take the same branch and then branches accordingly. (not very costly)

Suppose we have 1000 blocks, and each one has 128 threads - how does it get executed?

On Ampere hardware, would probably get 8-12 blocks running at the same time on each SM, and each block has 4 warps to 32-48 warps running on each SM.

Each clock tick, SM warp scheduler decides which warps to execute next, choosing from those not waiting for - data coming from device memory (memory latency) - completion of earlier instructions (pipeline delay)

Pipelining in CPU

5 stages of every instruction in processor diagram

Since every instruction is broken up in fetch, decode, execute, memory access, register write stages, we can pipeline multiple instructions one after the other.

fetch: fetch instruction from memory
decode: denode the meaning of instruction from byte code so we know what action to do
execute: ALU calculate value based on existing value in register
mem: access memory
wr reg: write memory to register

Cycles	Fetch	Decode	ALU	Mem	Wr Reg.
C1	Add R3,R2,1	Add R2,R1,1	Ldr R1,0x..	...	...
C2	Add R3,R2,1	Add R2,R1,1	/stall/	Ldr R1,0x..	...
C3	Add R3,R2,1	Add R2,R1,1	/stall/	/stall/	Ldr R1,0x..
C4	Jmp 0x..	Add R3,R2,1	Add R2,R1,1	/stall/	/stall/
C5	WRONG INST	Jmp 0x..	Add R3,R2,1	Add R2,R1,1	/stall/
C6	WRONG INST	WRONG INST	Jmp 0x..	Add R3,R2,1	Add R2,R1,1
C7	CORRECT	/flush/	/flush/	Jmp 0x..	Add R3,R2,1

Above is an example of pipelining execution where instructions move from left to right in cycles. Notice in C2 the Add R2,R1,1 instruction depends on the value of register R1 after loading from memory, therefore the Add could not be performed at the same time as Ldr R1,0x... CPU will stall Add R2,R1,1 instruction. Also notice in C5, C6 the instruction fetched from memory depends on the execution of Jmp 0x.., and therefore C6 will contain potentially wrong instructions. If this is the case, after Jmp 0x.. executed, wrong instructions will be flushed in C7. Stall and flush are two result of dependency in pipelining and will slow down the process.

Programmer's Abstraction

NAME	DESCRIPTION	x	y	z
threadIdx	Thread index within the block (zero-based)	threadIdx.x	threadIdx.y	threadIdx.z
blockIdx	Block index within the grid (zero-based)	blockIdx.x	blockIdx.y	blockIdx.z
blockDim	Block dimensions in threads	blockDim.x	blockDim.y	blockDim.z
gridDim	Grid dimensions in blocks	gridDim.x	gridDim.y	gridDim.z

In GPU programming, we only specify the number of threads in a block (block dimension), and the program will automatically divide those threads into warps (executing instructions in lock steps). One block correspond to one shader core.

Note that there are maximum number of threads in a block and maximum number of blocks in a grid. The limit depends on specific GPU architecture.

Computation Capability: there are maximum dimensionality of grids and blocks you can specify

OpenMP, OpenACC, CUDA, OpenCL support for different GPU vendor

CUDA	OpenCL
Kernel	Kernel
Host program	Host program
Thread	Work item
Thread block	Work group
Grid	NDRange(index space)
`threadIdx.x`	`get_local_id(0)`
`blockIdx.x * blockDim.x + threadIdx.x`	`get_global_id(0)`
`__global__ function`	`__kernel function`
`__device__ function`	`function`
`__constent__ variable`	`__constant variable`
`__device__ variable`	`__global variable`
`__shared__ variable`	`__local variable`

Coding CUDA

Function and Variable Declearation

CUDA extends C:

Type Qualifiers: __device__, __global__, __shared__
- __global__: defines a kernel function that must return void, executed on device, only callable from the host (CPU).
- __device__: defines a function called by kernels. Executed on the device, only callable from the device (GPU). You cannot take address of __device__ function. No recursion, no static variable declarations inside the function, no variable number of arguments.
- __host__: defines a function running on the host. Executed on the host, only callable from the host (CPU).
Keywords: threadIdx, blockIdx
Intrinsics: __syncthreads
Runtime API: cudaMalloc()
Kernel launch: convolve<<<100, 10>>> (myimage);

Threads within a block coordinate by "shared memory", "atomic operations" and "barrier synchronization". Threads in different blocks can not cooperate.

Variable declaration	Mmeory	Scope	Lifetime
`__device__ __local__ int LocalVar;`	local	thread	thread
`__device__ __shared__ int SharedVar;`	shared	block	block
`__device__ int GlobalVar;`	global	grid	kernel
`__device__ __constant__ int ConstantVar;`	constant	grid	kernel

__device__ is optional when used with __local__, __shared__, or __constant__.

Helper Functions

cudaGetDeviceCount(&deviceCount): The number of active GPUs visible to the rank

cudaSetDevice(): use device

cudaGetDeviceProperties(): get cudaDeviceProp

Datatypes

Vectors: (addressed by .x, .y, .z, .w)

double2, double3, double4
float2, float3, float4
int, uint

Floats:

half, half2: $6 \times 10^{-5} \to 6 \times 10^4$
bfloat16, bfloat162: lower precision version of float

Built-in Variables

gridDim: type dim3 (like uint3 but .x, .y, .z initialized to 1 by default)
blockIdx: type uint3
blockDim: type dim3
threadIdx: type uint3
warpSize: always 32

Taste of Programming with CUDA

#include "cuda.h"
int main(int argc, char **argv) {
  int N = 3789; // size of array we want to compute

  float *d_a; // an address in GPU we want to know
  cudaMalloc((void**) &d_a, N*sizeof(float)); // tell GPU to give us allocated GPU address

  int B = 512; // programmer partition data into blocks each has size of 512 threads
  dim3 dimBlock(B, 1, 1); // programmer specify the dimension of a block, which in this case is (512 threads, 1 thread, 1 thread)
  dim3 dimGrid((N + B - 1) / B, 1, 1); // programmer specify the dimension of Grid which contains `(N + B - 1) / B` many blocks

  // Queue Kernel on DEVICE
  simpleKernel<<<dimGrid, dimBlock>>>(N, d_a);

  float *h_a = (float*) calloc(N, sizeof(float)); // allocate array in CPU host to receive calculated data from GPU device
  cudaMemcpy(h_a, d_a, N*sizeof(float), cudaMemcpyDeviceToHost); // this will block until GPU finish kernel

  for(int n = 0; n < N; ++n) printf("h_a[%d] = %f\n", n, h_a[n]); // print stuff out
}

You can imagine there is a internal function that calls your kernel:

void kernelCaller(int N, float *d_a) {
  for (int b = 0; b < gridDim; ++b) {
    for (int t = 0; t < blockDim; ++t) {
      simpleKernel(b, t, gridDim, blockDim)
    }
  }
}

But instead of passing b, t, gridDim, and blockDim, you have blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, blockDim.x, blockDim.y, blockDim.z, gridDim.x, gridDim.y, gridDim.z avaliable in your kernel. Those variables appear from nowhere in code.

The actual kernel looks like this:

__global__ void simpleKernel(int N, float *d_a) {
  // Convert threadIdx and blockIdx to array index
  const int n = threadIdx.x + blockDim.x * blockIdx.x;

  if (n < N) { // make sure we are inside the array we want to calculate
    d_a[n] = n;
  }
  // You don't return
}

Finally, you compile with nvcc -o ./out -arch=sm_50 main.cu. Make sure to set -arch=sm_50 to your GPU architecture. You can use cuda-memcheck --leak-check full to check for memory leak.

cuobjdump, cuobjdump -ptx, cuobjdump -sass -ptx is used to see byte code of compiled kernel.

nvprof and nvvp are for profiling.

Below is another sample code from Nvidia.

#include <iostream>
#include <math.h>

// CUDA Kernel function to add the elements of two arrays on the GPU. `__global__` tells the CUDA C++ compiler that this is a function that runs on the GPU and can be called from CPU code.
__global__
void add(int n, float *x, float *y) {
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void) {
  int N = 1<<20; // 1M elements

  float *x, *y;
  cudaMallocManaged(&x, N*sizeof(float)); // float *x = new float[N];
  cudaMallocManaged(&y, N*sizeof(float)); // float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  add(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

Compiling

Kernel, kernel invocation, cudaMalloc, are all best off in a .cu file somewhere
MPI calls should be in .c files
nvcc processes .cu files to generate objective files
mpicc/mpicxx processes .c/.cpp files to generate objective files
If we need to call CUDA kernels from within an MPI task, we can wrap the appropriate CUDA compiled functions with the extern keyword.

CUDA Random Number Generator (RNG)

CURAND:

NVIDIA's library for random number generation in CUDA
CURAND can be called from the host and the device
CURAND Host API provides functions callable on the host to generate random data in GPU global memory
Can create multiple pseudorandom generators using different algorithms

curandGenerator_t r;
// argument tells which algorithm to use
curandCreateGenerator(&r, CURAND_RNG_PSEUDO_DEFAULT);
curandSetStream(r, stream); // optional
curandSetPseudoRandomGeneratorSeed(r, seed);
curandGenerateUniform(r, data, numElems);
curandDestroyGenerator(r);

__device__ void curand_init (unsigned long long seed, unsigned long long
sequence, unsigned long long offset, curandState *state);
__device__ float curand_uniform (curandState *state);
__device__ float curand_normal (curandState *state);

Generate Random Number in Host

#include <curand.h>
int main() {
. . .
curandGenerator_t gen;
float *devNum, *hostNum;
hostNum = new float[n];
cudaMalloc((void **)&devNum, n*sizeof(float));
. . .
curandCreateGenerator(&gen,CURAND_RNG_PSEUDO_DEFAULT);
curandSetPseudoRandomGeneratorSeed(gen, 12345);
curandGenerateUniform(gen, devNum, n);
cudaMemcpy(hostNum, devNum, n*sizeof(float),cudaMemcpyDeviceToHost);
. . .
curandDestroyGenerator(gen);
cudaFree(devNum);
. . .
}

Generate Random Number in Kernel

#include <stdio.h>
#include <stdlib.h>
#include <curand_kernel.h> // CURAND lib header file
#define TRIALS_PER_THREAD 2048
#define BLOCKS 256
#define THREADS 256
int main(int argc, char *argv[]) {
 float host[BLOCKS * THREADS];
 float *dev;
 curandState *devStates;
 cudaMalloc((void **) &dev, BLOCKS * THREADS * sizeof(float));
 cudaMalloc( (void **)&devStates, BLOCKS*THREADS*sizeof(curandState) );
 …
}

__global__ void pi_mc(float *estimate, curandState *states) {
 unsigned int tid = threadIdx.x + blockDim.x*blockIdx.x;
 int points_in_circle = 0;
 float x, y;
 // Initialize CURAND
 curand_init(tid, 0, 0, &states[tid]);
 for(int i = 0; i < TRIALS_PER_THREAD; i++) {
 x = curand_uniform(&states[tid]);
 y = curand_uniform(&states[tid]);
 // count if x & y is in the circule.
 points_in_circle += (x*x + y*y <= 1.0f);
 }
 estimate[tid] = 4.0f * points_in_circle / (float) TRIALS_PER_THREAD;
}

Simple Vector Addition

In order to store data on the gpu that can be communicated back to the host, we need to have allocated memory that lives until it is freed, see global memory as the heap space with life until the application closes or is freed, it is visible to any thread and block that have a pointer to that memory region. Shared memory can be considered as stack space with life until a block of a kernel finishes, the visibility is limited to only threads within the same block.

Shared memory is faster than global memory, but need to be initialized from global memory. This initialization must be coalesced (access global memory with locality in mind) to achieve high performance.

Main Take away: Global memory refers to allocated memory in the input of kernel. It can be used by all threads within all blocks. In contrast, shared memory is a stack space that is only shared among threads within the same block.

The size of the Global memory depends from card to card, anything from none to 32GB (V100). While the shared memory depend on the compute capability. Anything below compute capability 2.x have a maximum 16KB of shared memory per multiprocessor(where the amount of multiprocessors varies from card to card). And cards with compute capability of 2.x and greater have an minimum of 48KB of shared memory per multiprocessor. (See documentation for details)

CUDA could use CPU memory (mapped memory) as Global memory instead of GPU memory. Although a bit slow, the amount of memory now depends on CPU memory avaliability.

#include <algorithm>
#include <cassert>
#include <iostream>
#include <vector>

// CUDA kernel for vector addition
// __global__ means this is called from the CPU, and runs on the GPU. The name global refers to global memory allocated on GPU.
// In the C programming language, restrict is a keyword that can be used in pointer declarations. By adding this type qualifier, a programmer hints to the compiler that for the lifetime of the pointer, no other pointer will be used to access the object to which it points. This allows the compiler to make optimizations (for example, vectorization) that would not otherwise have been possible. __restrict is a cuda-memory version of restrict.
// a, b, c both are global GPU memory
__global__ void vectorAdd(const int *__restrict a, const int *__restrict b,
                          int *__restrict c, int N) {
  int tid = (blockIdx.x * blockDim.x) + threadIdx.x; // global thread ID
  if (tid < N) c[tid] = a[tid] + b[tid]; // Boundary check and then write back to global memory
}

int main() {
  constexpr int N = 1 << 16; // Array size of 2^16 (65536 elements)
  constexpr size_t bytes = sizeof(int) * N;

  // Initialize random vectors for holding the host-side (CPU-side) data
  std::vector<int> a;
  a.reserve(N);
  std::vector<int> b;
  b.reserve(N);
  std::vector<int> c;
  c.reserve(N); // allocate empty memory for GPU return
  for (int i = 0; i < N; i++) {
    a.push_back(rand() % 100);
    b.push_back(rand() % 100);
  }

  // Allocate memory on the device (GPU)
  int *d_a, *d_b, *d_c;
  cudaMalloc(&d_a, bytes);
  cudaMalloc(&d_b, bytes);
  cudaMalloc(&d_c, bytes);
  // Copy data from the host to the device (CPU -> GPU)
  cudaMemcpy(d_a, a.data(), bytes, cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, b.data(), bytes, cudaMemcpyHostToDevice);

  // Threads per "compute thread array" (generally called block, or CTA) (1024)
  int NUM_THREADS = 1 << 10;
  // blocks per Grid
  int NUM_BLOCKS = (N + NUM_THREADS - 1) / NUM_THREADS;

  // Launch the kernel on the GPU
  vectorAdd<<<NUM_BLOCKS, NUM_THREADS>>>(d_a, d_b, d_c, N);

  // Copy sum vector from device to host
  // cudaMemcpy is a synchronous operation, and waits for the prior kernel
  // launch to complete (both go to the default stream in this case).
  // Therefore, this cudaMemcpy acts as both a memcpy and synchronization
  // barrier.
  cudaMemcpy(c.data(), d_c, bytes, cudaMemcpyDeviceToHost);
  cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Free memory on device

  std::cout << "COMPLETED SUCCESSFULLY\n";
  return 0;
}

Unified Memory and Prefetch

Q: What are the advantages of using unified memory other than not needing to transfer data between host and device? After profiling it seems like using unified memory takes significant more time than the usual malloc and cudaMalloc.

A: Good question! Unified memory can take longer than normal allocation and copy, but with adequate pre-fetching, they can yield roughly similar results. It is mainly a programmability feature, and that goes beyond just avoiding duplicate data structures. Modern architectures (Volta and later) support over-subscription, where you can allocate more memory than you have on your GPU. Doing this without unified memory would require you to decompose your problem into multiple kernels so that you can work on a chunk of data, finish computation, copy new data in, start a new kernel, etc. With unified memory, none of that is required.

We also learned about hints, for more, read this paper

cudaMemAdviseSetAccessedBy (AB) indicates that the data will always be mapped into a specified processor's page tables if possible. When the data is migrated, the mapping will be updated accordingly.
cudaMemAdviseSetPreferredLocation (PL) sets the preferred location (CPU or GPU) for a range of data. If another processor wants to access the memory region, then the data will not migrate from the preferred location.
cudaMemAdviseSetReadMostly (RM) implies that the data will be mostly read and occasionally written to. It lets the driver create read-only copies of that data, and then send copies to other processors rather than migrate.

#include <stdio.h>
#include <cassert>
#include <iostream>

using std::cout;

__global__ void vectorAdd(int *a, int *b, int *c, int N) {
  int tid = (blockDim.x * blockIdx.x) + threadIdx.x;
  if (tid < N) c[tid] = a[tid] + b[tid];
}

int main() {
  const int N = 1 << 16;
  size_t bytes = N * sizeof(int);

  int *a, *b, *c;
  // Allocation memory for these pointers
  // CUDA transfer memory from CPU to GPU and from GPU to CPU automatically
  // based on demand. You can think of this memory as a shared memory in GPU/CPU
  cudaMallocManaged(&a, bytes);
  cudaMallocManaged(&b, bytes);
  cudaMallocManaged(&c, bytes);


  int id = cudaGetDevice(&id); // Get the device ID for prefetching calls
  // cudaMemAdviseSetPreferredLocation (PL) sets the preferred location (CPU or GPU) for a range of data. If another processor wants to access the memory region, then the data will not migrate from the preferred location.
  // In this example, we prefer data a, b to stay on CPU
  cudaMemAdvise(a, bytes, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);
  cudaMemAdvise(b, bytes, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);
  // Set some hints about the data and do some prefetching for c
  cudaMemPrefetchAsync(c, bytes, id);

  // Initialize vectors
  for (int i = 0; i < N; i++) {
    a[i] = rand() % 100;
    b[i] = rand() % 100;
  }

  // cudaMemAdviseSetReadMostly (RM) implies that the data will be mostly read and occasionally written to. It lets the driver create read-only copies of that data, and then send copies to other processors rather than migrate.
  cudaMemAdvise(a, bytes, cudaMemAdviseSetReadMostly, id);
  cudaMemAdvise(b, bytes, cudaMemAdviseSetReadMostly, id);
  // Pre-fetch 'a' and 'b' arrays to the specified device (GPU)
  cudaMemPrefetchAsync(a, bytes, id);
  cudaMemPrefetchAsync(b, bytes, id);

  int BLOCK_SIZE = 1 << 10;
  int GRID_SIZE = (N + BLOCK_SIZE - 1) / BLOCK_SIZE;

  vectorAdd<<<GRID_SIZE, BLOCK_SIZE>>>(a, b, c, N);

  // Wait for all previous operations before using values
  // We need this because we don't get the implicit synchronization of
  // cudaMemcpy like in the original example
  cudaDeviceSynchronize();

  // Prefetch to the host (CPU)
  cudaMemPrefetchAsync(a, bytes, cudaCpuDeviceId);
  cudaMemPrefetchAsync(b, bytes, cudaCpuDeviceId);
  cudaMemPrefetchAsync(c, bytes, cudaCpuDeviceId);

  cudaFree(a); cudaFree(b); cudaFree(c);

  cout << "COMPLETED SUCCESSFULLY!\n";
  return 0;
}

Matrix Multiplication

To calculate $A^{a \times n} \cdot B^{n \times b} = C^{n \times n}$ , we assign each thread to an element of $C$ . So we can view our threads as 2D threads each can be identified as (threadIdx.x, threadIdx.y). However, internally, the GPU threads are 1 dimensional, say [(0, 0), (0, 1), ..., (0, n), ..., (1, 0), (1, 1), ..., (1, n), ..., (n, 0), ..., (n, n)]. For each thread, there are $n$ additions, and for each addition we need to load $2$ integers. Since at least $32$ threads are running in lock step, we need to load $32 \times 2$ integers at once from $2$ matrix. To preserve locality, we hope that $32$ integers are in the same memory region. For the first iteration, every thread loads the first column of $A$ and first row of $B$ . Therefore, $B$ preserves locality while $A$ is not.

However, another problem is that we don't usually have enough cache to fit $32 \times 2$ integers at once. Cache tiling is needed just like CMU 15-213 Cache Lab.

We will take a simpler approach in the code below: we divide $C^{n \times n}$ into $(n /\text{block\_size}, n /\text{block\_size})$ regions. We still map each thread to each element in $C^{n \times n}$ , but change the order of calculation by maintaining partial sum as the image below:

__global__ void matrixMul(const int *a, const int *b, int *c) {
  // Compute each thread's global row and column index
  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;

  // Statically allocated shared memory
  __shared__ int s_a[SHMEM_SIZE];
  __shared__ int s_b[SHMEM_SIZE];

  // Accumulate in temporary variable
  int tmp = 0;

  // Sweep tile across matrix
  for (int i = 0; i < N; i += blockDim.x) {
    // Load in elements for this tile
    s_a[threadIdx.y * blockDim.x + threadIdx.x] = a[row * N + i + threadIdx.x];
    s_b[threadIdx.y * blockDim.x + threadIdx.x] =
        b[i * N + threadIdx.y * N + col];

    // Wait for both tiles to be loaded in before doing computation
    __syncthreads();

    // Do matrix multiplication on the small matrix
    for (int j = 0; j < blockDim.x; j++) {
      tmp +=
          s_a[threadIdx.y * blockDim.x + j] * s_b[j * blockDim.x + threadIdx.x];
    }

    // Wait for all threads to finish using current tiles before loading in new
    // ones
    __syncthreads();
  }

  // Write back results
  c[row * N + col] = tmp;
}

Reduction Optimization in CUDA

Reduction Usually are: which enable re-arrange

commutative
associative

Note: in MPI there are special routines to perform reductions over distributed arrays

Scan: scan is another important building block of algorithm that can be massively paralleled.

Strategy: (1) perform local scan within each block (2) add on sum of all preceding blocks.
Local Scan:
- $O(n\log n)$ : using shared memory
- $O(n)$ : using warp shuffles and a recursive structure

Scan is very tricky to be implement quickly with maximum efficiency.

Advice: best to first see if you can get working code from someone else (e.g. investigate Thrust C++ library) Don’t re-invent the wheel unless you really think you can do it better.

For this reason, I will not cover the exact algorithm here. Interested reader can read directly from slide

General Tips:

for within-block calculation, there are two choices
- use shuffles to compute solution within each warp
- use shared memory and shuffles to combine results from different warps and update solution from first stage
When combining the results from different blocks
- conbine result in CPU
- use atomic trick to launch blocks in order, and then after completing first phase get $v_{-1}$ from previous block to complete the computation.

Single-block parallel reduction for commutative operator

We partition our data into only 1 block (gridSize = 1) consists of 1024 threads. Therefore each thread will process more than 1 element of an array.

static const int arraySize = 10000;
static const int blockSize = 1024;

// the function is run at thread level
__global__ void sumCommSingleBlock(const int *a, int *out) {
    int idx = threadIdx.x; // each thread start at thread id. Thread 2 takes a[2].
    int sum = 0;
    for (int i = idx; i < arraySize; i += blockSize) // each thread sum and jump blocksize. Thread 2 takes a[2], a[1024+2], a[2048+2], ...
        sum += a[i];
    __shared__ int r[blockSize]; // each thread within a block pull their result together. This instruction is executed at block level
    r[idx] = sum;
    __syncthreads(); // wait until every thread in the block finishes
    for (int size = blockSize/2; size>0; size/=2) { // reduction
        if (idx<size)
            r[idx] += r[idx+size];
        __syncthreads();
    }
    if (idx == 0) // copy result to global memory
        *out = r[0];
}

...

sumCommSingleBlock<<<1, blockSize>>>(dev_a, dev_out);

In above example, we use shared memory __shared__ int r[blockSize], which is 100x faster than global memory const int *a, int *out.

Multi-block parallel reduction for commutative operator

#define SHMEM_SIZE 256

__global__ void sumReduction(int *v, int *v_r) {
  // Allocate shared memory
  __shared__ int partial_sum[SHMEM_SIZE];

  // Calculate thread ID
  int tid = blockIdx.x * blockDim.x + threadIdx.x;

  // Load elements into shared memory
  partial_sum[threadIdx.x] = v[tid];
  __syncthreads();

  // Iterate of log base 2 the block dimension
  for (int s = 1; s < blockDim.x; s *= 2) {
    // Reduce the threads performing work by half previous the previous
    // iteration each cycle
    if (threadIdx.x % (2 * s) == 0) {
      partial_sum[threadIdx.x] += partial_sum[threadIdx.x + s];
    }
    __syncthreads();
  }

  // Let the thread 0 for this block write it's result to main memory
  // Result is indexed by this block
  if (threadIdx.x == 0) {
    v_r[blockIdx.x] = partial_sum[0];
  }
}

int main() {
  // Vector size
  int N = 1 << 16;
  size_t bytes = N * sizeof(int);

  // Host data
  vector<int> h_v(N);
  vector<int> h_v_r(N);
  generate(begin(h_v), end(h_v), [](){ return rand() % 10; }); // Initialize data

  int *d_v, *d_v_r;
  cudaMalloc(&d_v, bytes);
  cudaMalloc(&d_v_r, bytes);
  cudaMemcpy(d_v, h_v.data(), bytes, cudaMemcpyHostToDevice);

  const int TB_SIZE = 256; // Block Size (how many threads in a block)
  int GRID_SIZE = N / TB_SIZE; // Grid Size (No padding)

  // first summon 256 blocks and calculate 256 partial sum
  sumReduction<<<GRID_SIZE, TB_SIZE>>>(d_v, d_v_r);
  // then use 1 block of 256 threads to sum those 256 partial sum
  sumReduction<<<1, TB_SIZE>>> (d_v_r, d_v_r);

  cudaMemcpy(h_v_r.data(), d_v_r, bytes, cudaMemcpyDeviceToHost);
  assert(h_v_r[0] == std::accumulate(begin(h_v), end(h_v), 0));
  cout << "COMPLETED SUCCESSFULLY\n";
  return 0;
}

Multi-block parallel reduction for commutative operator with warp divergence

if (branch_condition) do_work(); else do_no_work()

In the example above, we set the block size to be 256 and therefore each block consists of $256 / 32 = 8$ semi-sequential warps with lock step. In every warp, at least half of the threads takes the if branch condition to do no work while the rest do some work. In this case, some work-items in the same warp must not run the same instruction, that they are masked when the warp is dispatched. If the conditional is such that some warp in the block must do something, and the other work-items in the block must do something else, then what happens is that the two code paths are taken sequentially by the block, with the appropriate work-items masked. The resulting runtime is exactly the same as every thread in the same warp will do the same amount of work as those working threads. This is not ideal since we can make at least half of the threads (and therefore half of the warps) in a block execute faster than the rest of the threads (rest of the warps) because they are essentially doing nothing. We will implement this by making all threads that do work have similar threadIdx, which means those that do work are grouped in the same warp.

When a warp instruction is executed on compute capability 2.x, the entire warp is sent to 16 of the CUDA cores. CUDA cores are pipelined (as are all modern CPUs), so what happens is that the 32 threads are queued up in two consecutive pipeline stages in those 16 CUDA cores. The instruction takes something like 16 to 24 clock ticks to propagate through the pipeline, but because many instructions moving through the pipeline at once, that group of 16 CUDA cores will complete one warp every two clock ticks. (seibert)

Although the smallest execution unit is half-warp, the warp-divergence is only in the warp level (not half-warp level) because the scheduler works with entire warps. The programmer should not worry about implementation detail of half-warp. For detail, see Compute Capabilities

When the warp instruction is issued to a group of 16 cores, the entire warp executes the instruction, because the cores are clocked twice (Fermi's "hotclock") so that each core actually executes two thread's worth of computation in a single cycle (= 2 hotclocks). When a warp instruction is dispatched, the entire warp gets serviced. It does not need to be scheduled twice. Robert Crovella Certainly all threads in a warp are executing the same instruction at any given time. But warps execute independently from each other and so different warps within a block may be executing different instructions from the stream, at any given time.

The instruction-level parallism (ILP) is more complicated. For detail, please see Stackoverflow

The determination of whether to do mask or to do branch seem to be in run-time. One thing to keep in mind is that the warp size on all CUDA devices so far is 32, but NVIDIA has said that it may change in the future. If you need to use the size of a warp in your device code somewhere, then you should use the integer variable warpSize, which is provided by the compiler. (seibert)

Multi-block parallel reduction for commutative operator with warp divergence

The code is basically identical as above except for the inner for loop. The execution is about 2x faster than above implementation.

for (int s = 1; s < blockDim.x; s*= 2) {
  int index = 2 * s * threadIdx.x;
  if (index < blockDim.x) {
    partial_sum[index] += partial_sum[index + s];
  }
  __syncthreads();
}

Multi-block parallel reduction for commutative operator with warp divergence and memory locality without shared memory bank conflict

The shared memory in each block is broken into words of 32 bit (a int32, a float, 4 bytes, 2 short, or half double). Those words live in 32 different banks as the diagram below shows.

However, Kepler architecture introduced the option to increase banks to 8 bytes using cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte). This can help avoid bank conflicts when accessing double precision data.

Bank Diagram in terms of words: word 0 at bank 0, word 1 at bank 1, ..., word 31 in bank 31, but word 32 go back again to bank 0.

Bank Diagram in GPU. The total size of shared memory may be set to 16KB, 32KB or 48KB (with the remaining amount automatically used for L1 Cache) when kernel launch as shown in the image. Shared memory defaults to 48KB (with 16KB remaining for L1 Cache). With the Kepler architecture, each bank has a bandwidth of 64 bits per clock cycle. The older Fermi architecture was clocked differently, but effectively offered half this bandwidth.

Bank Conflict: occurs when two or more threads of a half-warp request different words from the same bank in a single request, those memory access are done in sequential order.

Bank conflicts are not that bad. However, since there are only 32 banks, if you map every threadIdx to a unique double, then you are guaranteed to have a 2-way bank conflicts.
If you have 16 structs of two int32, then if 16 threads (half-warp) access the first int32 simultaneously, there is a 16-way bank conflict. However, if you have 16 structs of three int32, then if 16 threads (half-warp) access the first, second, or third int32, simultaneously it is not an issue. When accessing the first int32, the first thread reads from bank0, the second thread reads from bank3, the third thread reads from bank6, ..., and the 11th thread read from bank30, and the 12th thread read from bank1... This can work as long as there isn't more than 32 memory access to a word simultaneously.

// QUESTION: some article say that bank conflict are in half-warp level while other say in warp level.

Broadcast: if all threads of a half-warp request exactly the same word. The word will be read once from shared memory and broadcast to the threads.

Multicast (compute capability >= 2.0): if several threads request the same value from a particular bank, while other threads request a same word from different bank, it is still a multicast (same speed as broadcast). As long as only one word is read from a bank in one cycle, it is considered a multicast.
Due to multicast, consecutive access to shared memory by a warp of threads does not result in a bank conflict for "small" element sizes.

arr[threadIdx.x*0], arr[12], and arr[blockIdx.x*3] are examples of broadcast.

/**
* Allocates at least width (in bytes) * height bytes of
* linear memory on the device and returns in *devPtr a pointer
* to the allocated memory. The function may pad the allocation
* to ensure that corresponding pointers in any given row will
* continue to meet the alignment requirements for coalescing as
* the address is updated from row to row. The pitch returned in
* *pitch by cudaMallocPitch() is the width in bytes of the allocation.
* @param devPtr: Pointer to allocated pitched device memory
* @param pitch: Pitch for allocation
* @param width: Requested pitched allocation width (in bytes)
* @param height: Requested pitched allocation height
* @returns cudaSuccess cudaErrorMemoryAllocation
**/
cudaError_t cudaMallocPitch(void **devPtr, size_t *pitch, size_t width,
size_t height);

Given the row and column of an array element of type T, the address is computed as T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;

Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D().

Recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements imposed by the device.
It ensures best performance when accessing the row addresses or performing copies between 2D arrays and other regions of device memory (using the cudaMemcpy2D() and cudaMemcpy3D() functions)
The returned pitch (or stride) must be used to access array elements.

Multi-block parallel reduction for commutative operator with warp divergence and memory locality without shared memory bank conflict

The code is basically identical as above except for the inner for loop. The execution is about 1.3x faster than above implementation.

for (int s = blockDim.x / 2; s > 0; s >>=1) {
  if (threadIdx.x < x) partial_sum[threadIdx.x] += partial_sum[threadIdx.x + s];
  __syncthreads();
}

A final optimization can be done by do calculation at the time when loading from global memory to shared memory to get rid of idle threads in the first batch execution. This will improve 1.4x(5 us) faster.

Single-block parallel reduction for non-commutative operator

static const int arraySize = 1000000;
static const int blockSize = 1024;

__global__ void sumNoncommSingleBlock(const int *gArr, int *out) {
    int thIdx = threadIdx.x;
    __shared__ int shArr[blockSize*2];
    __shared__ int offset;
    shArr[thIdx] = thIdx<arraySize ? gArr[thIdx] : 0;
    if (thIdx == 0)
        offset = blockSize;
    __syncthreads();
    while (offset < arraySize) { //uniform
        shArr[thIdx + blockSize] = thIdx+offset<arraySize ? gArr[thIdx+offset] : 0;
        __syncthreads();
        if (thIdx == 0)
            offset += blockSize;
        int sum = shArr[2*thIdx] + shArr[2*thIdx+1];
        __syncthreads();
        shArr[thIdx] = sum;
    }
    __syncthreads();
    for (int stride = 1; stride<blockSize; stride*=2) { //uniform
        int arrIdx = thIdx*stride*2;
        if (arrIdx+stride<blockSize)
            shArr[arrIdx] += shArr[arrIdx+stride];
        __syncthreads();
    }
    if (thIdx == 0)
        *out = shArr[0];
}

...

sumNoncommSingleBlock<<<1, blockSize>>>(dev_a, dev_out);

Special Topics

Atomic Instructions

// addition / subtraction
atomicAdd()
atomicSub()
// minimum / maximum
atomicMin()
atomicMax()
// increment / decrement
atomicInc()
atomicDec()
// exchange / compare-and-swap
atomicExch()
atomicCAS()
// bitwise AND / OR / XOR
atomicAnd()
atomicOr()
atomicXor()

For locks of global memory, you can use:

// compare and swap (useful for atomic locks)
int atomicCAS(int* address,int compare,int val);
// if compare equals old value stored at address else val is stored instead
// in either case, routine returns the value of old

Here is an example of global atomic lock using __threadfence();:

__threadfence_block(): wait until all global and shared memory writes are visible to all threads in block

__threadfence(): wait until all global and shared memory writes are visible to all threads in block, all threads, for global data

__global__ void kernel(...) {
...
if (threadIdx.x==0) {
  do {} while(atomicCAS(&lock,0,1));
  ... // some global memory write
  __threadfence(); // wait for writes to finish: since writing to global memory is async, we need manual synching to ensure correctness
  lock = 0; // free lock
  }
}

Warp Shuffles

Warp Shuffles: exchanging data between threads in the same warp

LaneID: threadIdx.x % 32 for 1D blocks

// mask: controls which threads are involved — usually set to -1
// var: a local register variable (int, unsigned int, long long, unsigned long long, float or double) put in for one other thread
// delta: offset within the warp
// if the appropriate thread does not exist then the value is taken from the current thread
// return: value from one other thread
T __shfl_up_sync(unsigned mask, T var, unsigned int delta);
// copy from a lane with lower ID relative to caller
T __shfl_down_sync(unsigned mask, T var, unsigned int delta);
// copy from a lane with higher ID relative to caller

// an XOR operation is performed between laneMask and the calling thread’s laneID to determine the lane from which to copy the value
// very useful for reduction operations and FFTs
T __shfl_xor_sync(unsigned mask, T var, int laneMask);
// copy from a lane based on bitwise XOR of own lane ID

// copies data from srcLane
T __shfl_syncc(unsigned mask, T var, int srcLane);
// copy from indexed lane ID

Be careful about condition if code: Threads may only read data from another thread which is actively participating in the shuffle command. If the target thread is inactive, the retrieved value is undefined.

Streams

Here we introduce streams.

When there exists no concept of "stream", therefore we call the implementation as "default stream". Say somehow I want the GPU to process two different kernel code. The first calculate one integer addition while the second calculate one integer multiplication. As you can see, there is little memory and ALU usage in both kernel, and therefore you expect the kernels run in concurrent. However, it is not possible since we only have one "stream", the "default stream". The kernel execution is serialized.

// default stream and stream 0 are different
// launch of kernel is async
kernel<<< blocks, threads, bytes >>>();    // default stream
kernel<<< blocks, threads, bytes, 0 >>>(); // stream 0

When the concept of stream is introduced (but before CUDA7), all launch of kernel that doesn't have a stream assigned will go to "default stream". Say if you launch multiple kernels concurrently in multiple host-thread, they will all go to "default stream" and executed in serial (sequential order)

Before CUDA7: A multi-threaded example with the legacy default stream behavior: all eight threads are serialized.

In later versions of cuda, user can control which stream they want to execute kernels in. But they run in to backward compatibility issue: if you have a job executing "default stream", all other numbered stream cannot execute in concurrent with "default stream". This is because "default stream" is application-wide and is shared accross host-threads and want to take up the whole bandwidth.

After CUDA7: Each host-thread creates a new stream (the same as regular stream as far as synchronization and concurrency goes) automatically and they do not synchronize, so the kernels from all eight threads run concurrently.

The code below demonstrates this issue:

const int N = 1 << 20;

__global__ void kernel(float *x, int n) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
        x[i] = sqrt(pow(3.14159,i));
    }
}

int main() {
    const int num_streams = 8;

    cudaStream_t streams[num_streams];
    float *data[num_streams];

    for (int i = 0; i < num_streams; i++) {
        cudaStreamCreate(&streams[i]);
        cudaMalloc(&data[i], N * sizeof(float));

        // launch one worker kernel per stream
        kernel<<<1, 64, 0, streams[i]>>>(data[i], N);

        // launch a dummy kernel on the default stream
        kernel<<<1, 1>>>(0, 0);
    }
    cudaDeviceReset();
    return 0;
}

If we compile above code with nvcc ./stream_test.cu -o stream_legacy you will see that the dummy kernel in default stream sandwitched between two launches of kernel in numbered stream causes numbered stream to serialize.

NVIDIA Visual Profiler (nvvp): resulting kernel timeline on a Macbook Pro with an NVIDIA GeForce GT 750M (a Kepler GPU). You can see the very small bars for the dummy kernels on the default stream, and how they cause all of the other streams to serialize.

CUDA 7 introduces a new option, the "per-thread default stream", that has two effects:

it gives each host-thread its own default stream. This means that commands issued to the default stream by different host threads can run concurrently.
these default streams are regular streams. This means that commands in the default stream may run concurrently with commands in non-default streams.

If we compile above code with nvcc --default-stream per-thread ./stream_test.cu -o stream_per-thread, you will see that default stream now becomes an usual stream.

To enable per-thread default streams in CUDA 7 and later, you can either compile with the nvcc command-line option --default-stream per-thread, or #define the CUDA_API_PER_THREAD_DEFAULT_STREAM preprocessor macro before including CUDA headers (cuda.h or cuda_runtime.h). It is important to note: you cannot use #define CUDA_API_PER_THREAD_DEFAULT_STREAM to enable this behavior in a .cu file when the code is compiled by nvcc because nvcc implicitly includes cuda_runtime.h at the top of the translation unit. The --default-stream option is applied per compilation unit, so make sure to apply it to all nvcc command lines that need it.

Results from klzzwxh:0138. Here you can see full concurrency between nine streams: the default stream, which in this case maps to Stream 14, and the eight other streams we created. Note that the dummy kernels run so quickly that it’s hard to see that there are eight calls on the default stream in this image. — Results from `nvvp`. Here you can see full concurrency between nine streams: the default stream, which in this case maps to Stream 14, and the eight other streams we created. Note that the dummy kernels run so quickly that it’s hard to see that there are eight calls on the default stream in this image.

Other Tips - cudaDeviceSynchronize() continues to synchronize everything on the device, even with the new per-thread default stream option. If you want to only synchronize a single stream, use cudaStreamSynchronize(cudaStream_t stream), as in example here.

Starting in CUDA 7 you can also explicitly access the per-thread default stream using the handle cudaStreamPerThread, and you can access the legacy default stream using the handle cudaStreamLegacy. Note that cudaStreamLegacy still synchronizes implicitly with the per-thread default streams if you happen to mix them in a program.

You can create non-blocking streams which do not synchronize with the legacy default stream by passing the cudaStreamNonBlocking flag to cudaStreamCreate().

Other Useful Cuda Functions

cudaError_t cudaMemset (void * devPtr, int value, size_t count )

It set the count number of bytes at address devPtr all with value value. For example:

#include <cstdio>

int main(void) {
    const int n = 32;
    const size_t sz = size_t(n) * sizeof(int);
    int *dJunk;
    cudaMalloc((void**)&dJunk, sz);
    cudaMemset(dJunk, 0, sz);
    cudaMemset(dJunk, 0x12, 32);

    int *Junk = new int[n];

    cudaMemcpy(Junk, dJunk, sz, cudaMemcpyDeviceToHost);

    for(int i=0; i<n; i++) {
        fprintf(stdout, "%d %x\n", i, Junk[i]);
    }

    cudaDeviceReset();
    return 0;
}

Above code will set all 128 bytes set to 0, then first 32 bytes set to 0x12 as described by talonmies

$ nvcc memset.cu
$ ./a.out

0 12121212
1 12121212
2 12121212
3 12121212
4 12121212
5 12121212
6 12121212
7 12121212
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0

Error Checking

checkCudaErrors(...); // check for error return codes
getLastCudaError(...); // check for kernel failure messages

printf("message"); // within kernel, each thread generates its own output
// need to use either cudaDeviceSynchronize(); or cudaDeviceReset(); to make sure buffer is flushed

Acceleration Tips

#pragma unroll 3 // for loop unroll
__launch_bounds__(TRACE_RAY_CUDA_THREADS, MIN_BLOCKS_PER_SM) // for register restriction
float* __restrict__ out
const __restrict__ // for constant

General Advice

number of active threads depends on number of registers each needs
good to have at least 4 active blocks per SM, each with at least 128 threads
smaller number of blocks when each needs lots of shared memory
larger number of blocks when they don’t need any shared memory

On Ampere:

maybe 4 big blocks (512 threads) if each needs a lot of shared memory
maybe 12 small blocks (128 threads) if no shared memory needed
or 4 small blocks (128 threads) if each thread needs lots of registers

If different elements in an array have different load, you can filter out heavy and light task and execute separately to increase performance.

General advice:

number of active threads depends on number of registers each needs
good to have at least 2-4 active blocks, each with at least 128 threads
smaller number of blocks when each needs lots of shared memory
larger number of blocks when they don’t need shared memory
On Fermi card:
- maybe 2 big blocks (512 threads) if each needs a lot of shared memory
- maybe 6 smaller blocks (256 threads) if no shared memory needed
- or 4 small blocks (128 threads) if each thread needs lots of registers
Use shared memory
- Partition data into subsets that fit into shared memory
- Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism
- Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element
- Copying results from shared memory to global memory
Constant memory also resides in device memory (DRAM) - much slower access than shared memory
- but they are cached!
- Highly efficient access for read-only data
Carefully divide data according to access patterns
- R/Only -> constant memory (very fast if in cache)
- R/W shared within Block -> shared memory (very fast)
- R/W within each thread -> registers (very fast)
- R/W inputs/results -> global memory (very slow)

Other Instructions Never be Used

// About Sync
int __syncthreads_count(predicate)
// counts how many predicates are true
int __syncthreads_and(predicate)
// returns non-zero (true) if all predicates are true
int __syncthreads_or(predicate)
// returns non-zero (true) if any predicate is true

// About Warp Voting
int __all(predicate)
// returns non-zero (true) if all predicates in warp are true
int __any(predicate)
// returns non-zero (true) if any predicate is true
unsigned int __ballot(predicate)
// sets nth bit based on nth predicate

Table of Content