
- Published on
CUDA Week 2: Mastering Thread Organization and Grid-Stride Loops
- Authors
- Name
- Gautam Sharma
- @iamgs_
Last week, I took my first steps into CUDA programming with basic array operations. While the results were... let's say "educational" (GPU was slower than CPU!), this week I dug deeper into the fundamental concepts that make CUDA truly powerful. The key breakthrough? Understanding how threads, blocks, and grids actually work together.
The Problem with Week 1
Looking back at my week 1 implementation, the issue was glaringly obvious once I understood CUDA's execution model better. I was using:
add<<<N,1>>>(dev_a, dev_b, dev_c, N); // One thread per block!
This launched 100,000 blocks with just 1 thread each. That's like hiring 100,000 workers but making each one work completely alone. The overhead of managing all those blocks completely overwhelmed any parallel processing benefits.
Week 2: Grid-Stride Loops and Proper Thread Organization
This week's implementation introduces two crucial concepts that transformed my understanding of CUDA programming.
Better Thread Indexing
__global__ void add(int*a, int *b , int*c, size_t N){
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid < N){
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x; // Grid-stride loop
}
}
The magic happens in that thread ID calculation: threadIdx.x + blockIdx.x * blockDim.x
. This gives each thread a unique global index across the entire grid. No more one-thread-per-block nonsense!
The Grid-Stride Loop Pattern
The while
loop with tid += blockDim.x * gridDim.x
is called a grid-stride loop, and it's a game-changer. Here's why:
- Scalability: The same kernel works regardless of array size
- Efficiency: Each thread processes multiple elements, reducing launch overhead
- Hardware optimization: Better utilization of streaming multiprocessors (SMs)
Instead of trying to launch exactly N threads, we launch a reasonable number of blocks and threads, then let each thread handle multiple array elements.
Optimal Launch Configuration
add<<<128,128>>>(dev_a, dev_b, dev_c, N); // 128 blocks × 128 threads = 16,384 total threads
This configuration follows CUDA best practices:
- 128 threads per block: Multiple of 32 (warp size) for efficient execution
- 128 blocks: Enough to keep all streaming multiprocessors busy
- Total threads: 16,384 threads handling 10 billion elements efficiently
Scaling Up: The Real Test
To really test the improvements, I scaled up the problem dramatically:
- Week 1: 100,000 elements
- Week 2: 10,000,000,000 elements (100,000× larger!)
constexpr size_t N = 10000000000; // 10 billion elements
This required switching from stack-allocated arrays to heap allocation with new
/delete
, since 10 billion integers would overflow the stack.
Performance Results: Finally, GPU Wins!
The results speak for themselves:
Time taken for GPU Addition : 334.532 milliseconds
Time taken for CPU Addition : 17871.6 milliseconds
GPU is now 53× faster than CPU!
This massive speedup comes from:
- Proper parallelization: 16,384 threads working simultaneously
- Efficient memory bandwidth utilization: GPU's high-bandwidth memory shines with large datasets
- Reduced overhead: Grid-stride loops minimize kernel launch costs
Key Lessons Learned
Understanding CUDA's Hierarchy
- Threads: Basic execution units (grouped in warps of 32)
- Blocks: Groups of threads that can share memory and synchronize
- Grid: Collection of blocks launched by a single kernel
Thread Organization Best Practices
- Always use multiples of 32 for thread counts (warp size)
- Typical block sizes: 64, 128, 256, 512, or 1024 threads
- Launch enough blocks to saturate the GPU's streaming multiprocessors
Grid-Stride Loops Are Essential
This pattern allows:
- Scalable kernels: Same code works for any array size
- Efficient resource usage: Optimal thread-to-work ratio
- Better performance: Reduced kernel launch overhead
The Code Evolution
Comparing the kernel implementations:
Week 1 (problematic):
__global__ void add(int*a, int *b , int*c, int N){
int tid = blockIdx.x; // Only using block index
if(tid < N){
c[tid] = a[tid] + b[tid];
}
}
// Called with: add<<<N,1>>> // Bad: Too many blocks!
Week 2 (improved):
__global__ void add(int*a, int *b , int*c, size_t N){
int tid = threadIdx.x + blockIdx.x * blockDim.x; // Proper global index
while(tid < N){
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x; // Grid-stride loop
}
}
// Called with: add<<<128,128>>> // Good: Balanced configuration!
Looking Ahead
Week 2 taught me that CUDA programming isn't just about "making things parallel" – it's about understanding the hardware architecture and designing algorithms that work with the GPU's strengths rather than against them.
Next week, I'm planning to explore:
- Shared memory: Ultra-fast on-chip memory for thread communication
- Memory coalescing: Optimizing global memory access patterns
- Occupancy optimization: Maximizing hardware utilization
The journey from "GPU slower than CPU" to "GPU 53× faster" in just one week shows the importance of understanding the fundamentals. CUDA's power lies not just in parallelism, but in intelligent parallelism.
Want to try this yourself? The complete code is available in my cuda-ml-journey repository. Compile with: nvcc -std=c++20 main.cu add.cu multiply.cu -O0 -o add-multiply