Some Good Stuff to Read: - Course on CUDA Programming on NVIDIA GPUs, Nov 28 - Dec 9, 2022, at UT Austin - The website is offline, but you can find my notes below and slides are in Here - COM4521/COM6521 Parallel Computing with GPUs (Course) - very in-depth knowledge about CUDA and OpenMP programming) - Computer Architecture (ETH Zurich, Fall 2017) - course website - youtube - Youtube CUDA Crash Course by CoffeeBeforeArch - Nvidia Official CUDA Programming Guide - CS5984: Advanced Computer Graphics: Parallel Computing and Visualization on GPU (VirginiaTech) - This one is mostly computer graphics and GPU - course website - course slides - Nvidia CUDA Training Resources - ACMS 40212/60212: Advanced Scientific Computing (Notre Dame) - Mostly on MPI and briefly mentioned CUDA for 2 lectures - course website - ECE1742S: Programming Massively Parallel Multiprocessors Using CUDA (UToronto) - This one isn't so good - course website - course slides or local repo
GPU is about 9 times faster than CPU (based on bandwidth difference - P100 vs E7-8894v4). But since usually CPU code is shit and GPU code is okay, people experience GPU to run faster.
Original Design of CPU: make single thread fast (reduce latency through cache, and predict branch)
Original Design of GPU: throughput matters (hide memory latency through parallelism, let programmer deal with raw storage hierarchy, avoid high frequency clock speed)
Graphics Processing Clusters (GPC): a computer cluster in which each node is equipped with a Graphics Processing Unit (GPU)
Youtube: An Intro to GPU Architecture and Programming Models I Tim Warburton, Virginia Tech
SIMD/T (Single Instruction, Multiple Data/Thread): unlike CPU, there exists no scaler operation unit, only matrix (or vectorized) operations is allowed. Here we have one instruction transforming multiple pieces of data. In each SM, a group of 32 ALU (a warp of them) execute the same instruction simultaneously, but with different data.
GPU cores:
Big Idea: A core is made of registers and ALU and other stuffs, GPU takes many CPU's core and bind them together to form a bigger core.
A GPU core has:
GPU core is overloaded term: we refer to big core here
GPU limits: Parallel computation is fast, but is limited by memory bandwidth (feeding into GPU is slow). To achieve high FLOPS, one data usually requires 8 instructions.
Terminologies:
Thread: the same function split in multiple execution units for efficiency purpose.
Kernel: a sequence of instructions (code) that is executed in parallel (much like GLSL code)
Thread Blocks: Multiple threads grouped together, all executing at the same time (co-scheduled), although SM may execute multiple blocks at the same time too.
__syncthreads()
(deadlock warning). For more primitives, see Cooperative GroupsCluster: a new thing in capability 9.0. Pack maximum of 8 blocks (exceeding limit require hardware support) together to enforce those 8 blocks will be executed by a group of SMs within a GPC.
__cluster_dims__(X,Y,Z)
cudaLaunchKernelEx(...)
cluster.sync()
Grid: multiple thread blocks grouped together. One kernel launch always correspond to one grid.
cudaDeviceSynchronize()
Memory: (ranked by speed)
__shared
keyword and is shared within the block. (Shared memory is as fast as registers if there are no bank conflicts)__constant__
memory: read only data. Using constant rather than global memory can reduce the required memory bandwidth, however, this performance gain can only be realized when a warp of threads read the same location. (1…10s … 100s of cycles, depending on cache locality)cudaMemcpyToSymbol()
, cudaMemcpyFromSymbol()
or cudaMemcpy()
in combination with cudaGetSymbolAddress()
f
at the end if you want single precision (since single precision \times double precision = double precision)100x
slower than shared memory) (Slow DRAM)Memory Space:
host memory: Memory space on CPU
device memory: Memory space on GPU
managed memory: unified memory, an programming abstraction for pre-fetching and better management. See sections about unified memory below.
Example of using Global Memory
__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));
__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol(devData, &value, sizeof(float));
__device__ float* devPointer;
float* ptr;
cudaMalloc(&ptr, 256 * sizeof(float));
cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr));
For detail about memory, read here
GPU Core Consists of:
L0 memory: store stuff
Scheduler: make schedule
Dispacher: sends data for ALU to calculate
Registers: super quick temporary storage (compiler might expand loop to use more registers than expected)
ALU: capable of computing 32 threads, or 1 warp
GPU Address Space: 20 bits address space 5 bits for representing 32 threads, 3 bits for 8 (if programmer choose so) warps, and remaining 12 remaining bits.
How it works: (assume 8 warps in a block for now)
Suppose we have 1000 blocks, and each one has 128 threads - how does it get executed?
On Ampere hardware, would probably get 8-12 blocks running at the same time on each SM, and each block has 4 warps to 32-48 warps running on each SM.
Each clock tick, SM warp scheduler decides which warps to execute next, choosing from those not waiting for - data coming from device memory (memory latency) - completion of earlier instructions (pipeline delay)
Since every instruction is broken up in fetch
, decode
, execute
, memory access
, register write
stages, we can pipeline multiple instructions one after the other.
fetch
: fetch instruction from memory
decode
: denode the meaning of instruction from byte code so we know what action to do
execute
: ALU calculate value based on existing value in register
mem
: access memory
wr reg
: write memory to register
Cycles | Fetch | Decode | ALU | Mem | Wr Reg. |
---|---|---|---|---|---|
C1 | Add R3,R2,1 | Add R2,R1,1 | Ldr R1,0x.. | ... | ... |
C2 | Add R3,R2,1 | Add R2,R1,1 | /stall/ | Ldr R1,0x.. | ... |
C3 | Add R3,R2,1 | Add R2,R1,1 | /stall/ | /stall/ | Ldr R1,0x.. |
C4 | Jmp 0x.. | Add R3,R2,1 | Add R2,R1,1 | /stall/ | /stall/ |
C5 | WRONG INST | Jmp 0x.. | Add R3,R2,1 | Add R2,R1,1 | /stall/ |
C6 | WRONG INST | WRONG INST | Jmp 0x.. | Add R3,R2,1 | Add R2,R1,1 |
C7 | CORRECT | /flush/ | /flush/ | Jmp 0x.. | Add R3,R2,1 |
Above is an example of pipelining execution where instructions move from left to right in cycles. Notice in C2
the Add R2,R1,1
instruction depends on the value of register R1
after loading from memory, therefore the Add
could not be performed at the same time as Ldr R1,0x..
. CPU will stall Add R2,R1,1
instruction. Also notice in C5, C6
the instruction fetched from memory depends on the execution of Jmp 0x..
, and therefore C6
will contain potentially wrong instructions. If this is the case, after Jmp 0x..
executed, wrong instructions will be flushed in C7
. Stall and flush are two result of dependency in pipelining and will slow down the process.
NAME | DESCRIPTION | x | y | z |
---|---|---|---|---|
threadIdx | Thread index within the block (zero-based) | threadIdx.x | threadIdx.y | threadIdx.z |
blockIdx | Block index within the grid (zero-based) | blockIdx.x | blockIdx.y | blockIdx.z |
blockDim | Block dimensions in threads | blockDim.x | blockDim.y | blockDim.z |
gridDim | Grid dimensions in blocks | gridDim.x | gridDim.y | gridDim.z |
In GPU programming, we only specify the number of threads in a block (block dimension), and the program will automatically divide those threads into warps (executing instructions in lock steps). One block correspond to one shader core.
Note that there are maximum number of threads in a block and maximum number of blocks in a grid. The limit depends on specific GPU architecture.
CUDA | OpenCL |
---|---|
Kernel | Kernel |
Host program | Host program |
Thread | Work item |
Thread block | Work group |
Grid | NDRange(index space) |
threadIdx.x |
get_local_id(0) |
blockIdx.x * blockDim.x + threadIdx.x |
get_global_id(0) |
__global__ function |
__kernel function |
__device__ function |
function |
__constent__ variable |
__constant variable |
__device__ variable |
__global variable |
__shared__ variable |
__local variable |
CUDA extends C
:
Type Qualifiers: __device__
, __global__
, __shared__
__global__
: defines a kernel function that must return void
, executed on device, only callable from the host (CPU).__device__
: defines a function called by kernels. Executed on the device, only callable from the device (GPU). You cannot take address of __device__
function. No recursion, no static variable declarations inside the function, no variable number of arguments.__host__
: defines a function running on the host. Executed on the host, only callable from the host (CPU).Keywords: threadIdx
, blockIdx
Intrinsics: __syncthreads
Runtime API: cudaMalloc()
Kernel launch: convolve<<<100, 10>>> (myimage);
Threads within a block coordinate by "shared memory", "atomic operations" and "barrier synchronization". Threads in different blocks can not cooperate.
Variable declaration | Mmeory | Scope | Lifetime |
---|---|---|---|
__device__ __local__ int LocalVar; |
local | thread | thread |
__device__ __shared__ int SharedVar; |
shared | block | block |
__device__ int GlobalVar; |
global | grid | kernel |
__device__ __constant__ int ConstantVar; |
constant | grid | kernel |
__device__
is optional when used with__local__
,__shared__
, or__constant__
.
cudaGetDeviceCount(&deviceCount)
: The number of active GPUs visible to the rank
cudaSetDevice()
: use device
cudaGetDeviceProperties()
: get cudaDeviceProp
Vectors: (addressed by .x
, .y
, .z
, .w
)
double2
, double3
, double4
float2
, float3
, float4
int
, uint
Floats:
half
, half2
: 6 \times 10^{-5} \to 6 \times 10^4
bfloat16
, bfloat162
: lower precision version of float
Built-in Variables
gridDim
: type dim3
(like uint3
but .x, .y, .z
initialized to 1 by default)
blockIdx
: type uint3
blockDim
: type dim3
threadIdx
: type uint3
warpSize
: always 32
#include "cuda.h"
int main(int argc, char **argv) {
int N = 3789; // size of array we want to compute
float *d_a; // an address in GPU we want to know
cudaMalloc((void**) &d_a, N*sizeof(float)); // tell GPU to give us allocated GPU address
int B = 512; // programmer partition data into blocks each has size of 512 threads
dim3 dimBlock(B, 1, 1); // programmer specify the dimension of a block, which in this case is (512 threads, 1 thread, 1 thread)
dim3 dimGrid((N + B - 1) / B, 1, 1); // programmer specify the dimension of Grid which contains `(N + B - 1) / B` many blocks
// Queue Kernel on DEVICE
simpleKernel<<<dimGrid, dimBlock>>>(N, d_a);
float *h_a = (float*) calloc(N, sizeof(float)); // allocate array in CPU host to receive calculated data from GPU device
cudaMemcpy(h_a, d_a, N*sizeof(float), cudaMemcpyDeviceToHost); // this will block until GPU finish kernel
for(int n = 0; n < N; ++n) printf("h_a[%d] = %f\n", n, h_a[n]); // print stuff out
}
You can imagine there is a internal function that calls your kernel:
void kernelCaller(int N, float *d_a) {
for (int b = 0; b < gridDim; ++b) {
for (int t = 0; t < blockDim; ++t) {
simpleKernel(b, t, gridDim, blockDim)
}
}
}
But instead of passing b
, t
, gridDim
, and blockDim
, you have blockIdx.x
, blockIdx.y
, blockIdx.z
, threadIdx.x
, threadIdx.y
, threadIdx.z
, blockDim.x
, blockDim.y
, blockDim.z
, gridDim.x
, gridDim.y
, gridDim.z
avaliable in your kernel. Those variables appear from nowhere in code.
The actual kernel looks like this:
__global__ void simpleKernel(int N, float *d_a) {
// Convert threadIdx and blockIdx to array index
const int n = threadIdx.x + blockDim.x * blockIdx.x;
if (n < N) { // make sure we are inside the array we want to calculate
d_a[n] = n;
}
// You don't return
}
Finally, you compile with nvcc -o ./out -arch=sm_50 main.cu
. Make sure to set -arch=sm_50
to your GPU architecture. You can use cuda-memcheck --leak-check full
to check for memory leak.
cuobjdump
, cuobjdump -ptx
, cuobjdump -sass -ptx
is used to see byte code of compiled kernel.
nvprof
and nvvp
are for profiling.
Below is another sample code from Nvidia.
#include <iostream>
#include <math.h>
// CUDA Kernel function to add the elements of two arrays on the GPU. `__global__` tells the CUDA C++ compiler that this is a function that runs on the GPU and can be called from CPU code.
__global__
void add(int n, float *x, float *y) {
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main(void) {
int N = 1<<20; // 1M elements
float *x, *y;
cudaMallocManaged(&x, N*sizeof(float)); // float *x = new float[N];
cudaMallocManaged(&y, N*sizeof(float)); // float *y = new float[N];
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Run kernel on 1M elements on the CPU
add(N, x, y);
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
delete [] x;
delete [] y;
return 0;
}
Compiling
Kernel, kernel invocation, cudaMalloc, are all best off in a .cu
file somewhere
MPI calls should be in .c
files
nvcc
processes .cu
files to generate objective files
mpicc/mpicxx
processes .c/.cpp
files to generate objective files
If we need to call CUDA kernels from within an MPI task, we can wrap the appropriate CUDA compiled functions with the extern
keyword.
CURAND:
NVIDIA's library for random number generation in CUDA
CURAND can be called from the host and the device
CURAND Host API provides functions callable on the host to generate random data in GPU global memory
Can create multiple pseudorandom generators using different algorithms
curandGenerator_t r;
// argument tells which algorithm to use
curandCreateGenerator(&r, CURAND_RNG_PSEUDO_DEFAULT);
curandSetStream(r, stream); // optional
curandSetPseudoRandomGeneratorSeed(r, seed);
curandGenerateUniform(r, data, numElems);
curandDestroyGenerator(r);
__device__ void curand_init (unsigned long long seed, unsigned long long
sequence, unsigned long long offset, curandState *state);
__device__ float curand_uniform (curandState *state);
__device__ float curand_normal (curandState *state);
Generate Random Number in Host
#include <curand.h>
int main() {
. . .
curandGenerator_t gen;
float *devNum, *hostNum;
hostNum = new float[n];
cudaMalloc((void **)&devNum, n*sizeof(float));
. . .
curandCreateGenerator(&gen,CURAND_RNG_PSEUDO_DEFAULT);
curandSetPseudoRandomGeneratorSeed(gen, 12345);
curandGenerateUniform(gen, devNum, n);
cudaMemcpy(hostNum, devNum, n*sizeof(float),cudaMemcpyDeviceToHost);
. . .
curandDestroyGenerator(gen);
cudaFree(devNum);
. . .
}
Generate Random Number in Kernel
#include <stdio.h>
#include <stdlib.h>
#include <curand_kernel.h> // CURAND lib header file
#define TRIALS_PER_THREAD 2048
#define BLOCKS 256
#define THREADS 256
int main(int argc, char *argv[]) {
float host[BLOCKS * THREADS];
float *dev;
curandState *devStates;
cudaMalloc((void **) &dev, BLOCKS * THREADS * sizeof(float));
cudaMalloc( (void **)&devStates, BLOCKS*THREADS*sizeof(curandState) );
…
}
__global__ void pi_mc(float *estimate, curandState *states) {
unsigned int tid = threadIdx.x + blockDim.x*blockIdx.x;
int points_in_circle = 0;
float x, y;
// Initialize CURAND
curand_init(tid, 0, 0, &states[tid]);
for(int i = 0; i < TRIALS_PER_THREAD; i++) {
x = curand_uniform(&states[tid]);
y = curand_uniform(&states[tid]);
// count if x & y is in the circule.
points_in_circle += (x*x + y*y <= 1.0f);
}
estimate[tid] = 4.0f * points_in_circle / (float) TRIALS_PER_THREAD;
}
In order to store data on the gpu that can be communicated back to the host, we need to have allocated memory that lives until it is freed, see global memory as the heap space with life until the application closes or is freed, it is visible to any thread and block that have a pointer to that memory region. Shared memory can be considered as stack space with life until a block of a kernel finishes, the visibility is limited to only threads within the same block.
Shared memory is faster than global memory, but need to be initialized from global memory. This initialization must be coalesced (access global memory with locality in mind) to achieve high performance.
Main Take away: Global memory refers to allocated memory in the input of kernel. It can be used by all threads within all blocks. In contrast, shared memory is a stack space that is only shared among threads within the same block.
The size of the Global memory depends from card to card, anything from none to 32GB (V100). While the shared memory depend on the compute capability. Anything below compute capability 2.x have a maximum 16KB of shared memory per multiprocessor(where the amount of multiprocessors varies from card to card). And cards with compute capability of 2.x and greater have an minimum of 48KB of shared memory per multiprocessor. (See documentation for details)
CUDA could use CPU memory (mapped memory) as Global memory instead of GPU memory. Although a bit slow, the amount of memory now depends on CPU memory avaliability.
#include <algorithm>
#include <cassert>
#include <iostream>
#include <vector>
// CUDA kernel for vector addition
// __global__ means this is called from the CPU, and runs on the GPU. The name global refers to global memory allocated on GPU.
// In the C programming language, restrict is a keyword that can be used in pointer declarations. By adding this type qualifier, a programmer hints to the compiler that for the lifetime of the pointer, no other pointer will be used to access the object to which it points. This allows the compiler to make optimizations (for example, vectorization) that would not otherwise have been possible. __restrict is a cuda-memory version of restrict.
// a, b, c both are global GPU memory
__global__ void vectorAdd(const int *__restrict a, const int *__restrict b,
int *__restrict c, int N) {
int tid = (blockIdx.x * blockDim.x) + threadIdx.x; // global thread ID
if (tid < N) c[tid] = a[tid] + b[tid]; // Boundary check and then write back to global memory
}
int main() {
constexpr int N = 1 << 16; // Array size of 2^16 (65536 elements)
constexpr size_t bytes = sizeof(int) * N;
// Initialize random vectors for holding the host-side (CPU-side) data
std::vector<int> a;
a.reserve(N);
std::vector<int> b;
b.reserve(N);
std::vector<int> c;
c.reserve(N); // allocate empty memory for GPU return
for (int i = 0; i < N; i++) {
a.push_back(rand() % 100);
b.push_back(rand() % 100);
}
// Allocate memory on the device (GPU)
int *d_a, *d_b, *d_c;
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
// Copy data from the host to the device (CPU -> GPU)
cudaMemcpy(d_a, a.data(), bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b.data(), bytes, cudaMemcpyHostToDevice);
// Threads per "compute thread array" (generally called block, or CTA) (1024)
int NUM_THREADS = 1 << 10;
// blocks per Grid
int NUM_BLOCKS = (N + NUM_THREADS - 1) / NUM_THREADS;
// Launch the kernel on the GPU
vectorAdd<<<NUM_BLOCKS, NUM_THREADS>>>(d_a, d_b, d_c, N);
// Copy sum vector from device to host
// cudaMemcpy is a synchronous operation, and waits for the prior kernel
// launch to complete (both go to the default stream in this case).
// Therefore, this cudaMemcpy acts as both a memcpy and synchronization
// barrier.
cudaMemcpy(c.data(), d_c, bytes, cudaMemcpyDeviceToHost);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Free memory on device
std::cout << "COMPLETED SUCCESSFULLY\n";
return 0;
}
Q: What are the advantages of using unified memory other than not needing to transfer data between host and device? After profiling it seems like using unified memory takes significant more time than the usual malloc and cudaMalloc.
A: Good question! Unified memory can take longer than normal allocation and copy, but with adequate pre-fetching, they can yield roughly similar results. It is mainly a programmability feature, and that goes beyond just avoiding duplicate data structures. Modern architectures (Volta and later) support over-subscription, where you can allocate more memory than you have on your GPU. Doing this without unified memory would require you to decompose your problem into multiple kernels so that you can work on a chunk of data, finish computation, copy new data in, start a new kernel, etc. With unified memory, none of that is required.
We also learned about hints, for more, read this paper
cudaMemAdviseSetAccessedBy
(AB) indicates that the data will always be mapped into a specified processor's page tables if possible. When the data is migrated, the mapping will be updated accordingly.
cudaMemAdviseSetPreferredLocation
(PL) sets the preferred location (CPU or GPU) for a range of data. If another processor wants to access the memory region, then the data will not migrate from the preferred location.
cudaMemAdviseSetReadMostly
(RM) implies that the data will be mostly read and occasionally written to. It lets the driver create read-only copies of that data, and then send copies to other processors rather than migrate.
#include <stdio.h>
#include <cassert>
#include <iostream>
using std::cout;
__global__ void vectorAdd(int *a, int *b, int *c, int N) {
int tid = (blockDim.x * blockIdx.x) + threadIdx.x;
if (tid < N) c[tid] = a[tid] + b[tid];
}
int main() {
const int N = 1 << 16;
size_t bytes = N * sizeof(int);
int *a, *b, *c;
// Allocation memory for these pointers
// CUDA transfer memory from CPU to GPU and from GPU to CPU automatically
// based on demand. You can think of this memory as a shared memory in GPU/CPU
cudaMallocManaged(&a, bytes);
cudaMallocManaged(&b, bytes);
cudaMallocManaged(&c, bytes);
int id = cudaGetDevice(&id); // Get the device ID for prefetching calls
// cudaMemAdviseSetPreferredLocation (PL) sets the preferred location (CPU or GPU) for a range of data. If another processor wants to access the memory region, then the data will not migrate from the preferred location.
// In this example, we prefer data a, b to stay on CPU
cudaMemAdvise(a, bytes, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);
cudaMemAdvise(b, bytes, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);
// Set some hints about the data and do some prefetching for c
cudaMemPrefetchAsync(c, bytes, id);
// Initialize vectors
for (int i = 0; i < N; i++) {
a[i] = rand() % 100;
b[i] = rand() % 100;
}
// cudaMemAdviseSetReadMostly (RM) implies that the data will be mostly read and occasionally written to. It lets the driver create read-only copies of that data, and then send copies to other processors rather than migrate.
cudaMemAdvise(a, bytes, cudaMemAdviseSetReadMostly, id);
cudaMemAdvise(b, bytes, cudaMemAdviseSetReadMostly, id);
// Pre-fetch 'a' and 'b' arrays to the specified device (GPU)
cudaMemPrefetchAsync(a, bytes, id);
cudaMemPrefetchAsync(b, bytes, id);
int BLOCK_SIZE = 1 << 10;
int GRID_SIZE = (N + BLOCK_SIZE - 1) / BLOCK_SIZE;
vectorAdd<<<GRID_SIZE, BLOCK_SIZE>>>(a, b, c, N);
// Wait for all previous operations before using values
// We need this because we don't get the implicit synchronization of
// cudaMemcpy like in the original example
cudaDeviceSynchronize();
// Prefetch to the host (CPU)
cudaMemPrefetchAsync(a, bytes, cudaCpuDeviceId);
cudaMemPrefetchAsync(b, bytes, cudaCpuDeviceId);
cudaMemPrefetchAsync(c, bytes, cudaCpuDeviceId);
cudaFree(a); cudaFree(b); cudaFree(c);
cout << "COMPLETED SUCCESSFULLY!\n";
return 0;
}
To calculate A^{a \times n} \cdot B^{n \times b} = C^{n \times n}, we assign each thread to an element of C. So we can view our threads as 2D threads each can be identified as (threadIdx.x, threadIdx.y)
. However, internally, the GPU threads are 1 dimensional, say [(0, 0), (0, 1), ..., (0, n), ..., (1, 0), (1, 1), ..., (1, n), ..., (n, 0), ..., (n, n)]
. For each thread, there are n additions, and for each addition we need to load 2 integers. Since at least 32 threads are running in lock step, we need to load 32 \times 2 integers at once from 2 matrix. To preserve locality, we hope that 32 integers are in the same memory region. For the first iteration, every thread loads the first column of A and first row of B. Therefore, B preserves locality while A is not.
However, another problem is that we don't usually have enough cache to fit 32 \times 2 integers at once. Cache tiling is needed just like CMU 15-213 Cache Lab.
We will take a simpler approach in the code below: we divide C^{n \times n} into (n /\text{block\_size}, n /\text{block\_size}) regions. We still map each thread to each element in C^{n \times n}, but change the order of calculation by maintaining partial sum as the image below:
__global__ void matrixMul(const int *a, const int *b, int *c) {
// Compute each thread's global row and column index
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
// Statically allocated shared memory
__shared__ int s_a[SHMEM_SIZE];
__shared__ int s_b[SHMEM_SIZE];
// Accumulate in temporary variable
int tmp = 0;
// Sweep tile across matrix
for (int i = 0; i < N; i += blockDim.x) {
// Load in elements for this tile
s_a[threadIdx.y * blockDim.x + threadIdx.x] = a[row * N + i + threadIdx.x];
s_b[threadIdx.y * blockDim.x + threadIdx.x] =
b[i * N + threadIdx.y * N + col];
// Wait for both tiles to be loaded in before doing computation
__syncthreads();
// Do matrix multiplication on the small matrix
for (int j = 0; j < blockDim.x; j++) {
tmp +=
s_a[threadIdx.y * blockDim.x + j] * s_b[j * blockDim.x + threadIdx.x];
}
// Wait for all threads to finish using current tiles before loading in new
// ones
__syncthreads();
}
// Write back results
c[row * N + col] = tmp;
}
Reduction Usually are: which enable re-arrange
commutative
associative
Note: in MPI there are special routines to perform reductions over distributed arrays
Scan: scan is another important building block of algorithm that can be massively paralleled.
Strategy: (1) perform local scan within each block (2) add on sum of all preceding blocks.
Local Scan:
Scan is very tricky to be implement quickly with maximum efficiency.
Advice: best to first see if you can get working code from someone else (e.g. investigate Thrust C++ library) Don’t re-invent the wheel unless you really think you can do it better.
For this reason, I will not cover the exact algorithm here. Interested reader can read directly from slide
General Tips:
for within-block calculation, there are two choices
When combining the results from different blocks
We partition our data into only 1
block (gridSize = 1
) consists of 1024
threads. Therefore each thread will process more than 1
element of an array.
static const int arraySize = 10000;
static const int blockSize = 1024;
// the function is run at thread level
__global__ void sumCommSingleBlock(const int *a, int *out) {
int idx = threadIdx.x; // each thread start at thread id. Thread 2 takes a[2].
int sum = 0;
for (int i = idx; i < arraySize; i += blockSize) // each thread sum and jump blocksize. Thread 2 takes a[2], a[1024+2], a[2048+2], ...
sum += a[i];
__shared__ int r[blockSize]; // each thread within a block pull their result together. This instruction is executed at block level
r[idx] = sum;
__syncthreads(); // wait until every thread in the block finishes
for (int size = blockSize/2; size>0; size/=2) { // reduction
if (idx<size)
r[idx] += r[idx+size];
__syncthreads();
}
if (idx == 0) // copy result to global memory
*out = r[0];
}
...
sumCommSingleBlock<<<1, blockSize>>>(dev_a, dev_out);
In above example, we use shared memory __shared__ int r[blockSize]
, which is 100x
faster than global memory const int *a, int *out
.
#define SHMEM_SIZE 256
__global__ void sumReduction(int *v, int *v_r) {
// Allocate shared memory
__shared__ int partial_sum[SHMEM_SIZE];
// Calculate thread ID
int tid = blockIdx.x * blockDim.x + threadIdx.x;
// Load elements into shared memory
partial_sum[threadIdx.x] = v[tid];
__syncthreads();
// Iterate of log base 2 the block dimension
for (int s = 1; s < blockDim.x; s *= 2) {
// Reduce the threads performing work by half previous the previous
// iteration each cycle
if (threadIdx.x % (2 * s) == 0) {
partial_sum[threadIdx.x] += partial_sum[threadIdx.x + s];
}
__syncthreads();
}
// Let the thread 0 for this block write it's result to main memory
// Result is indexed by this block
if (threadIdx.x == 0) {
v_r[blockIdx.x] = partial_sum[0];
}
}
int main() {
// Vector size
int N = 1 << 16;
size_t bytes = N * sizeof(int);
// Host data
vector<int> h_v(N);
vector<int> h_v_r(N);
generate(begin(h_v), end(h_v), [](){ return rand() % 10; }); // Initialize data
int *d_v, *d_v_r;
cudaMalloc(&d_v, bytes);
cudaMalloc(&d_v_r, bytes);
cudaMemcpy(d_v, h_v.data(), bytes, cudaMemcpyHostToDevice);
const int TB_SIZE = 256; // Block Size (how many threads in a block)
int GRID_SIZE = N / TB_SIZE; // Grid Size (No padding)
// first summon 256 blocks and calculate 256 partial sum
sumReduction<<<GRID_SIZE, TB_SIZE>>>(d_v, d_v_r);
// then use 1 block of 256 threads to sum those 256 partial sum
sumReduction<<<1, TB_SIZE>>> (d_v_r, d_v_r);
cudaMemcpy(h_v_r.data(), d_v_r, bytes, cudaMemcpyDeviceToHost);
assert(h_v_r[0] == std::accumulate(begin(h_v), end(h_v), 0));
cout << "COMPLETED SUCCESSFULLY\n";
return 0;
}
if (branch_condition) do_work(); else do_no_work()
In the example above, we set the block size to be 256
and therefore each block consists of 256 / 32 = 8 semi-sequential warps with lock step. In every warp, at least half of the threads takes the if
branch condition to do no work while the rest do some work. In this case, some work-items in the same warp must not run the same instruction, that they are masked when the warp is dispatched. If the conditional is such that some warp in the block must do something, and the other work-items in the block must do something else, then what happens is that the two code paths are taken sequentially by the block, with the appropriate work-items masked. The resulting runtime is exactly the same as every thread in the same warp will do the same amount of work as those working threads. This is not ideal since we can make at least half of the threads (and therefore half of the warps) in a block execute faster than the rest of the threads (rest of the warps) because they are essentially doing nothing. We will implement this by making all threads that do work have similar threadIdx
, which means those that do work are grouped in the same warp.
When a warp instruction is executed on compute capability 2.x, the entire warp is sent to 16 of the CUDA cores. CUDA cores are pipelined (as are all modern CPUs), so what happens is that the 32 threads are queued up in two consecutive pipeline stages in those 16 CUDA cores. The instruction takes something like 16 to 24 clock ticks to propagate through the pipeline, but because many instructions moving through the pipeline at once, that group of 16 CUDA cores will complete one warp every two clock ticks. (seibert)
Although the smallest execution unit is half-warp, the warp-divergence is only in the warp level (not half-warp level) because the scheduler works with entire warps. The programmer should not worry about implementation detail of half-warp. For detail, see Compute Capabilities
When the warp instruction is issued to a group of 16 cores, the entire warp executes the instruction, because the cores are clocked twice (Fermi's "hotclock") so that each core actually executes two thread's worth of computation in a single cycle (= 2 hotclocks). When a warp instruction is dispatched, the entire warp gets serviced. It does not need to be scheduled twice. Robert Crovella Certainly all threads in a warp are executing the same instruction at any given time. But warps execute independently from each other and so different warps within a block may be executing different instructions from the stream, at any given time.
The instruction-level parallism (ILP) is more complicated. For detail, please see Stackoverflow
The determination of whether to do mask or to do branch seem to be in run-time. One thing to keep in mind is that the warp size on all CUDA devices so far is 32, but NVIDIA has said that it may change in the future. If you need to use the size of a warp in your device code somewhere, then you should use the integer variable
warpSize
, which is provided by the compiler. (seibert)
The code is basically identical as above except for the inner for
loop. The execution is about 2x
faster than above implementation.
for (int s = 1; s < blockDim.x; s*= 2) {
int index = 2 * s * threadIdx.x;
if (index < blockDim.x) {
partial_sum[index] += partial_sum[index + s];
}
__syncthreads();
}
The shared memory in each block is broken into words of 32 bit (a int32
, a float
, 4 bytes
, 2 short
, or half double
). Those words live in 32 different banks as the diagram below shows.
However, Kepler architecture introduced the option to increase banks to 8 bytes using
cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte)
. This can help avoid bank conflicts when accessingdouble
precision data.
Bank Conflict: occurs when two or more threads of a half-warp request different words from the same bank in a single request, those memory access are done in sequential order.
Bank conflicts are not that bad. However, since there are only 32
banks, if you map every threadIdx
to a unique double
, then you are guaranteed to have a 2-way bank conflicts.
If you have 16 struct
s of two int32
, then if 16 threads (half-warp) access the first int32
simultaneously, there is a 16-way bank conflict. However, if you have 16 struct
s of three int32
, then if 16 threads (half-warp) access the first, second, or third int32
, simultaneously it is not an issue. When accessing the first int32
, the first thread reads from bank0, the second thread reads from bank3, the third thread reads from bank6, ..., and the 11th thread read from bank30, and the 12th thread read from bank1... This can work as long as there isn't more than 32 memory access to a word simultaneously.
// QUESTION: some article say that bank conflict are in half-warp level while other say in warp level.
Broadcast: if all threads of a half-warp request exactly the same word. The word will be read once from shared memory and broadcast to the threads.
Multicast (compute capability >= 2.0
): if several threads request the same value from a particular bank, while other threads request a same word from different bank, it is still a multicast (same speed as broadcast). As long as only one word is read from a bank in one cycle, it is considered a multicast.
Due to multicast, consecutive access to shared memory by a warp of threads does not result in a bank conflict for "small" element sizes.
arr[threadIdx.x*0]
,arr[12]
, andarr[blockIdx.x*3]
are examples of broadcast.
/**
* Allocates at least width (in bytes) * height bytes of
* linear memory on the device and returns in *devPtr a pointer
* to the allocated memory. The function may pad the allocation
* to ensure that corresponding pointers in any given row will
* continue to meet the alignment requirements for coalescing as
* the address is updated from row to row. The pitch returned in
* *pitch by cudaMallocPitch() is the width in bytes of the allocation.
* @param devPtr: Pointer to allocated pitched device memory
* @param pitch: Pitch for allocation
* @param width: Requested pitched allocation width (in bytes)
* @param height: Requested pitched allocation height
* @returns cudaSuccess cudaErrorMemoryAllocation
**/
cudaError_t cudaMallocPitch(void **devPtr, size_t *pitch, size_t width,
size_t height);
Given the row and column of an array element of type T, the address is computed as
T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;
Linear memory can also be allocated through cudaMallocPitch()
and cudaMalloc3D()
.
Recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements imposed by the device.
It ensures best performance when accessing the row addresses or performing copies between 2D arrays and other regions of device memory (using the cudaMemcpy2D() and cudaMemcpy3D() functions)
The returned pitch (or stride) must be used to access array elements.
The code is basically identical as above except for the inner for
loop. The execution is about 1.3x
faster than above implementation.
for (int s = blockDim.x / 2; s > 0; s >>=1) {
if (threadIdx.x < x) partial_sum[threadIdx.x] += partial_sum[threadIdx.x + s];
__syncthreads();
}
A final optimization can be done by do calculation at the time when loading from global memory to shared memory to get rid of idle threads in the first batch execution. This will improve
1.4x
(5 us
) faster.
static const int arraySize = 1000000;
static const int blockSize = 1024;
__global__ void sumNoncommSingleBlock(const int *gArr, int *out) {
int thIdx = threadIdx.x;
__shared__ int shArr[blockSize*2];
__shared__ int offset;
shArr[thIdx] = thIdx<arraySize ? gArr[thIdx] : 0;
if (thIdx == 0)
offset = blockSize;
__syncthreads();
while (offset < arraySize) { //uniform
shArr[thIdx + blockSize] = thIdx+offset<arraySize ? gArr[thIdx+offset] : 0;
__syncthreads();
if (thIdx == 0)
offset += blockSize;
int sum = shArr[2*thIdx] + shArr[2*thIdx+1];
__syncthreads();
shArr[thIdx] = sum;
}
__syncthreads();
for (int stride = 1; stride<blockSize; stride*=2) { //uniform
int arrIdx = thIdx*stride*2;
if (arrIdx+stride<blockSize)
shArr[arrIdx] += shArr[arrIdx+stride];
__syncthreads();
}
if (thIdx == 0)
*out = shArr[0];
}
...
sumNoncommSingleBlock<<<1, blockSize>>>(dev_a, dev_out);
// addition / subtraction
atomicAdd()
atomicSub()
// minimum / maximum
atomicMin()
atomicMax()
// increment / decrement
atomicInc()
atomicDec()
// exchange / compare-and-swap
atomicExch()
atomicCAS()
// bitwise AND / OR / XOR
atomicAnd()
atomicOr()
atomicXor()
For locks of global memory, you can use:
// compare and swap (useful for atomic locks)
int atomicCAS(int* address,int compare,int val);
// if compare equals old value stored at address else val is stored instead
// in either case, routine returns the value of old
Here is an example of global atomic lock using __threadfence();
:
__threadfence_block()
: wait until all global and shared memory writes are visible to all threads in block
__threadfence()
: wait until all global and shared memory writes are
visible to all threads in block, all threads, for global data
__global__ void kernel(...) {
...
if (threadIdx.x==0) {
do {} while(atomicCAS(&lock,0,1));
... // some global memory write
__threadfence(); // wait for writes to finish: since writing to global memory is async, we need manual synching to ensure correctness
lock = 0; // free lock
}
}
Warp Shuffles: exchanging data between threads in the same warp
LaneID: threadIdx.x % 32
for 1D blocks
// mask: controls which threads are involved — usually set to -1
// var: a local register variable (int, unsigned int, long long, unsigned long long, float or double) put in for one other thread
// delta: offset within the warp
// if the appropriate thread does not exist then the value is taken from the current thread
// return: value from one other thread
T __shfl_up_sync(unsigned mask, T var, unsigned int delta);
// copy from a lane with lower ID relative to caller
T __shfl_down_sync(unsigned mask, T var, unsigned int delta);
// copy from a lane with higher ID relative to caller
// an XOR operation is performed between laneMask and the calling thread’s laneID to determine the lane from which to copy the value
// very useful for reduction operations and FFTs
T __shfl_xor_sync(unsigned mask, T var, int laneMask);
// copy from a lane based on bitwise XOR of own lane ID
// copies data from srcLane
T __shfl_syncc(unsigned mask, T var, int srcLane);
// copy from indexed lane ID
Be careful about condition
if
code: Threads may only read data from another thread which is actively participating in the shuffle command. If the target thread is inactive, the retrieved value is undefined.
Here we introduce streams.
When there exists no concept of "stream", therefore we call the implementation as "default stream". Say somehow I want the GPU to process two different kernel code. The first calculate one integer addition while the second calculate one integer multiplication. As you can see, there is little memory and ALU usage in both kernel, and therefore you expect the kernels run in concurrent. However, it is not possible since we only have one "stream", the "default stream". The kernel execution is serialized.
// default stream and stream 0 are different
// launch of kernel is async
kernel<<< blocks, threads, bytes >>>(); // default stream
kernel<<< blocks, threads, bytes, 0 >>>(); // stream 0
When the concept of stream is introduced (but before CUDA7), all launch of kernel that doesn't have a stream assigned will go to "default stream". Say if you launch multiple kernels concurrently in multiple host-thread, they will all go to "default stream" and executed in serial (sequential order)
In later versions of cuda, user can control which stream they want to execute kernels in. But they run in to backward compatibility issue: if you have a job executing "default stream", all other numbered stream cannot execute in concurrent with "default stream". This is because "default stream" is application-wide and is shared accross host-threads and want to take up the whole bandwidth.
The code below demonstrates this issue:
const int N = 1 << 20;
__global__ void kernel(float *x, int n) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
x[i] = sqrt(pow(3.14159,i));
}
}
int main() {
const int num_streams = 8;
cudaStream_t streams[num_streams];
float *data[num_streams];
for (int i = 0; i < num_streams; i++) {
cudaStreamCreate(&streams[i]);
cudaMalloc(&data[i], N * sizeof(float));
// launch one worker kernel per stream
kernel<<<1, 64, 0, streams[i]>>>(data[i], N);
// launch a dummy kernel on the default stream
kernel<<<1, 1>>>(0, 0);
}
cudaDeviceReset();
return 0;
}
If we compile above code with nvcc ./stream_test.cu -o stream_legacy
you will see that the dummy kernel in default stream sandwitched between two launches of kernel in numbered stream causes numbered stream to serialize.
CUDA 7 introduces a new option, the "per-thread default stream", that has two effects:
it gives each host-thread its own default stream. This means that commands issued to the default stream by different host threads can run concurrently.
these default streams are regular streams. This means that commands in the default stream may run concurrently with commands in non-default streams.
If we compile above code with nvcc --default-stream per-thread ./stream_test.cu -o stream_per-thread
, you will see that default stream now becomes an usual stream.
To enable per-thread default streams in CUDA 7 and later, you can either compile with the
nvcc
command-line option--default-stream per-thread
, or#define
theCUDA_API_PER_THREAD_DEFAULT_STREAM
preprocessor macro before including CUDA headers (cuda.h
orcuda_runtime.h
). It is important to note: you cannot use#define CUDA_API_PER_THREAD_DEFAULT_STREAM
to enable this behavior in a.cu
file when the code is compiled bynvcc
becausenvcc
implicitly includescuda_runtime.h
at the top of the translation unit. The--default-stream
option is applied per compilation unit, so make sure to apply it to allnvcc
command lines that need it.
Other Tips -
cudaDeviceSynchronize()
continues to synchronize everything on the device, even with the new per-thread default stream option. If you want to only synchronize a single stream, usecudaStreamSynchronize(cudaStream_t stream)
, as in example here.
Starting in CUDA 7 you can also explicitly access the per-thread default stream using the handle
cudaStreamPerThread
, and you can access the legacy default stream using the handlecudaStreamLegacy
. Note thatcudaStreamLegacy
still synchronizes implicitly with the per-thread default streams if you happen to mix them in a program.You can create non-blocking streams which do not synchronize with the legacy default stream by passing the
cudaStreamNonBlocking
flag tocudaStreamCreate()
.
cudaError_t cudaMemset (void * devPtr, int value, size_t count )
It set the
count
number of bytes at addressdevPtr
all with valuevalue
. For example:
#include <cstdio>
int main(void) {
const int n = 32;
const size_t sz = size_t(n) * sizeof(int);
int *dJunk;
cudaMalloc((void**)&dJunk, sz);
cudaMemset(dJunk, 0, sz);
cudaMemset(dJunk, 0x12, 32);
int *Junk = new int[n];
cudaMemcpy(Junk, dJunk, sz, cudaMemcpyDeviceToHost);
for(int i=0; i<n; i++) {
fprintf(stdout, "%d %x\n", i, Junk[i]);
}
cudaDeviceReset();
return 0;
}
Above code will set all 128
bytes set to 0
, then first 32
bytes set to 0x12
as described by talonmies
$ nvcc memset.cu
$ ./a.out
0 12121212
1 12121212
2 12121212
3 12121212
4 12121212
5 12121212
6 12121212
7 12121212
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0
checkCudaErrors(...); // check for error return codes
getLastCudaError(...); // check for kernel failure messages
printf("message"); // within kernel, each thread generates its own output
// need to use either cudaDeviceSynchronize(); or cudaDeviceReset(); to make sure buffer is flushed
#pragma unroll 3 // for loop unroll
__launch_bounds__(TRACE_RAY_CUDA_THREADS, MIN_BLOCKS_PER_SM) // for register restriction
float* __restrict__ out
const __restrict__ // for constant
General Advice
number of active threads depends on number of registers each needs
good to have at least 4 active blocks per SM, each with at least 128 threads
smaller number of blocks when each needs lots of shared memory
larger number of blocks when they don’t need any shared memory
On Ampere:
maybe 4 big blocks (512 threads) if each needs a lot of shared memory
maybe 12 small blocks (128 threads) if no shared memory needed
or 4 small blocks (128 threads) if each thread needs lots of registers
If different elements in an array have different load, you can filter out heavy and light task and execute separately to increase performance.
General advice:
number of active threads depends on number of registers each needs
good to have at least 2-4 active blocks, each with at least 128 threads
smaller number of blocks when each needs lots of shared memory
larger number of blocks when they don’t need shared memory
On Fermi card:
Use shared memory
Constant memory also resides in device memory (DRAM) - much slower access than shared memory
Carefully divide data according to access patterns
// About Sync
int __syncthreads_count(predicate)
// counts how many predicates are true
int __syncthreads_and(predicate)
// returns non-zero (true) if all predicates are true
int __syncthreads_or(predicate)
// returns non-zero (true) if any predicate is true
// About Warp Voting
int __all(predicate)
// returns non-zero (true) if all predicates in warp are true
int __any(predicate)
// returns non-zero (true) if any predicate is true
unsigned int __ballot(predicate)
// sets nth bit based on nth predicate
Table of Content