CUDA Week 2: Mastering Thread Organization and Grid-Stride Loops

Last week, I took my first steps into CUDA programming with basic array operations. While the results were... let's say "educational" (GPU was slower than CPU!), this week I dug deeper into the fundamental concepts that make CUDA truly powerful. The key breakthrough? Understanding how threads, blocks, and grids actually work together.

The Problem with Week 1

Looking back at my week 1 implementation, the issue was glaringly obvious once I understood CUDA's execution model better. I was using:

add<<<N,1>>>(dev_a, dev_b, dev_c, N);  // One thread per block!

This launched 100,000 blocks with just 1 thread each. That's like hiring 100,000 workers but making each one work completely alone. The overhead of managing all those blocks completely overwhelmed any parallel processing benefits.

Week 2: Grid-Stride Loops and Proper Thread Organization

This week's implementation introduces two crucial concepts that transformed my understanding of CUDA programming.

Better Thread Indexing

__global__ void add(int*a, int *b , int*c, size_t N){
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    while(tid < N){
        c[tid] = a[tid] + b[tid];
        tid += blockDim.x * gridDim.x;  // Grid-stride loop
    }
}

The magic happens in that thread ID calculation: threadIdx.x + blockIdx.x * blockDim.x. This gives each thread a unique global index across the entire grid. No more one-thread-per-block nonsense!

The Grid-Stride Loop Pattern

The while loop with tid += blockDim.x * gridDim.x is called a grid-stride loop, and it's a game-changer. Here's why:

Scalability: The same kernel works regardless of array size
Efficiency: Each thread processes multiple elements, reducing launch overhead
Hardware optimization: Better utilization of streaming multiprocessors (SMs)

Instead of trying to launch exactly N threads, we launch a reasonable number of blocks and threads, then let each thread handle multiple array elements.

Optimal Launch Configuration

add<<<128,128>>>(dev_a, dev_b, dev_c, N);  // 128 blocks × 128 threads = 16,384 total threads

This configuration follows CUDA best practices:

128 threads per block: Multiple of 32 (warp size) for efficient execution
128 blocks: Enough to keep all streaming multiprocessors busy
Total threads: 16,384 threads handling 10 billion elements efficiently

Scaling Up: The Real Test

To really test the improvements, I scaled up the problem dramatically:

Week 1: 100,000 elements
Week 2: 10,000,000,000 elements (100,000× larger!)

constexpr size_t N = 10000000000; // 10 billion elements

This required switching from stack-allocated arrays to heap allocation with new/delete, since 10 billion integers would overflow the stack.

Performance Results: Finally, GPU Wins!

The results speak for themselves:

Time taken for GPU Addition : 334.532 milliseconds
Time taken for CPU Addition : 17871.6 milliseconds

GPU is now 53× faster than CPU!

This massive speedup comes from:

Proper parallelization: 16,384 threads working simultaneously
Efficient memory bandwidth utilization: GPU's high-bandwidth memory shines with large datasets
Reduced overhead: Grid-stride loops minimize kernel launch costs

Key Lessons Learned

Understanding CUDA's Hierarchy

Threads: Basic execution units (grouped in warps of 32)
Blocks: Groups of threads that can share memory and synchronize
Grid: Collection of blocks launched by a single kernel

Thread Organization Best Practices

Always use multiples of 32 for thread counts (warp size)
Typical block sizes: 64, 128, 256, 512, or 1024 threads
Launch enough blocks to saturate the GPU's streaming multiprocessors

Grid-Stride Loops Are Essential

This pattern allows:

Scalable kernels: Same code works for any array size
Efficient resource usage: Optimal thread-to-work ratio
Better performance: Reduced kernel launch overhead

The Code Evolution

Comparing the kernel implementations:

Week 1 (problematic):

__global__ void add(int*a, int *b , int*c, int N){
    int tid = blockIdx.x;  // Only using block index
    if(tid < N){
        c[tid] = a[tid] + b[tid];
    }
}
// Called with: add<<<N,1>>>  // Bad: Too many blocks!

Week 2 (improved):

__global__ void add(int*a, int *b , int*c, size_t N){
    int tid = threadIdx.x + blockIdx.x * blockDim.x;  // Proper global index
    while(tid < N){
        c[tid] = a[tid] + b[tid];
        tid += blockDim.x * gridDim.x;  // Grid-stride loop
    }
}
// Called with: add<<<128,128>>>  // Good: Balanced configuration!

Looking Ahead

Week 2 taught me that CUDA programming isn't just about "making things parallel" – it's about understanding the hardware architecture and designing algorithms that work with the GPU's strengths rather than against them.

Next week, I'm planning to explore:

Shared memory: Ultra-fast on-chip memory for thread communication
Memory coalescing: Optimizing global memory access patterns
Occupancy optimization: Maximizing hardware utilization

The journey from "GPU slower than CPU" to "GPU 53× faster" in just one week shows the importance of understanding the fundamentals. CUDA's power lies not just in parallelism, but in intelligent parallelism.

Ready to dive deeper into the world of high-performance GPU computing! 🚀