CUDA Week 2: Mastering Thread Organization and Grid-Stride Loops

Last week, I took my first steps into CUDA programming with basic array operations. While the results were... let's say "educational" (GPU was slower than CPU!), this week I dug deeper into the fundamental concepts that make CUDA truly powerful. The key breakthrough? Understanding how threads, blocks, and grids actually work together.

The Problem with Week 1

Looking back at my week 1 implementation, the issue was glaringly obvious once I understood CUDA's execution model better. I was using:

add<<<N,1>>>(dev_a, dev_b, dev_c, N);  // One thread per block!

This launched 100,000 blocks with just 1 thread each. That's like hiring 100,000 workers but making each one work completely alone. The overhead of managing all those blocks completely overwhelmed any parallel processing benefits.

Week 2: Grid-Stride Loops and Proper Thread Organization

This week's implementation introduces two crucial concepts that transformed my understanding of CUDA programming.

Better Thread Indexing

__global__ void add(int*a, int *b , int*c, size_t N){
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    while(tid < N){
        c[tid] = a[tid] + b[tid];
        tid += blockDim.x * gridDim.x;  // Grid-stride loop
    }
}

The magic happens in that thread ID calculation: threadIdx.x + blockIdx.x * blockDim.x. This gives each thread a unique global index across the entire grid. No more one-thread-per-block nonsense!

The Grid-Stride Loop Pattern

The while loop with tid += blockDim.x * gridDim.x is called a grid-stride loop, and it's a game-changer. Here's why:

Scalability: The same kernel works regardless of array size
Efficiency: Each thread processes multiple elements, reducing launch overhead
Hardware optimization: Better utilization of streaming multiprocessors (SMs)

Instead of trying to launch exactly N threads, we launch a reasonable number of blocks and threads, then let each thread handle multiple array elements.

Optimal Launch Configuration

add<<<128,128>>>(dev_a, dev_b, dev_c, N);  // 128 blocks × 128 threads = 16,384 total threads

This configuration follows CUDA best practices:

128 threads per block: Multiple of 32 (warp size) for efficient execution
128 blocks: Enough to keep all streaming multiprocessors busy
Total threads: 16,384 threads handling 10 billion elements efficiently

Scaling Up: The Real Test

To really test the improvements, I scaled up the problem dramatically:

Week 1: 100,000 elements
Week 2: 10,000,000,000 elements (100,000× larger!)

constexpr size_t N = 10000000000; // 10 billion elements

This required switching from stack-allocated arrays to heap allocation with new/delete, since 10 billion integers would overflow the stack.

Performance Results: Finally, GPU Wins!

The results speak for themselves:

Time taken for GPU Addition : 334.532 milliseconds
Time taken for CPU Addition : 17871.6 milliseconds

GPU is now 53× faster than CPU!

This massive speedup comes from:

Proper parallelization: 16,384 threads working simultaneously
Efficient memory bandwidth utilization: GPU's high-bandwidth memory shines with large datasets
Reduced overhead: Grid-stride loops minimize kernel launch costs

Key Lessons Learned

Understanding CUDA's Hierarchy

Threads: Basic execution units (grouped in warps of 32)
Blocks: Groups of threads that can share memory and synchronize
Grid: Collection of blocks launched by a single kernel

Thread Organization Best Practices

Always use multiples of 32 for thread counts (warp size)
Typical block sizes: 64, 128, 256, 512, or 1024 threads
Launch enough blocks to saturate the GPU's streaming multiprocessors

Grid-Stride Loops Are Essential

This pattern allows:

Scalable kernels: Same code works for any array size
Efficient resource usage: Optimal thread-to-work ratio
Better performance: Reduced kernel launch overhead

The Code Evolution

Comparing the kernel implementations:

Week 1 (problematic):

__global__ void add(int*a, int *b , int*c, int N){
    int tid = blockIdx.x;  // Only using block index
    if(tid < N){
        c[tid] = a[tid] + b[tid];
    }
}
// Called with: add<<<N,1>>>  // Bad: Too many blocks!

Week 2 (improved):

__global__ void add(int*a, int *b , int*c, size_t N){
    int tid = threadIdx.x + blockIdx.x * blockDim.x;  // Proper global index
    while(tid < N){
        c[tid] = a[tid] + b[tid];
        tid += blockDim.x * gridDim.x;  // Grid-stride loop
    }
}
// Called with: add<<<128,128>>>  // Good: Balanced configuration!

Looking Ahead

Week 2 taught me that CUDA programming isn't just about "making things parallel" – it's about understanding the hardware architecture and designing algorithms that work with the GPU's strengths rather than against them.

Next week, I'm planning to explore:

Shared memory: Ultra-fast on-chip memory for thread communication
Memory coalescing: Optimizing global memory access patterns
Occupancy optimization: Maximizing hardware utilization

The journey from "GPU slower than CPU" to "GPU 53× faster" in just one week shows the importance of understanding the fundamentals. CUDA's power lies not just in parallelism, but in intelligent parallelism.

Want to try this yourself? The complete code is available in my cuda-ml-journey repository. Compile with: nvcc -std=c++20 main.cu add.cu multiply.cu -O0 -o add-multiply