- Published on
CUDA Dot Product Reduction: Beyond Simple Array Operations
- Authors
- Name
- Gautam Sharma
- @iamgs_
After mastering basic array addition and multiplication in my previous CUDA adventure, I wanted to tackle something more practical for machine learning: dot product reduction. This operation - multiplying corresponding elements of two arrays and summing the results - is everywhere in ML: loss functions, similarity calculations, and gradient computations.
The challenge? Doing this efficiently on GPU requires combining element-wise operations with parallel reduction - a perfect next step from my simple array operations.
The CPU Approach
On CPU, dot product is straightforward:
float dot_product_cpu(float* a, float* b, int N) {
float sum = 0;
for(int i = 0; i < N; i++) {
sum += a[i] * b[i];
}
return sum;
}
But this sequential approach wastes the GPU's parallel power. We need a strategy that combines parallel computation with reduction.
My CUDA Implementation
Here's the kernel I developed:
__global__ void dot_product_kernel(float* a, float* b, float* result, int N) {
__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;
float temp = 0;
// Grid-stride loop: each thread processes multiple elements
while(tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}
// Store partial result in shared memory
cache[cacheIndex] = temp;
__syncthreads();
// Reduction in shared memory
int i = blockDim.x / 2;
while(i != 0) {
if(cacheIndex < i) {
cache[cacheIndex] += cache[cacheIndex + i];
}
__syncthreads();
i /= 2;
}
// First thread writes block result
if(cacheIndex == 0) {
result[blockIdx.x] = cache[0];
}
}
Key Concepts Explained
Grid-Stride Loop
while(tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x; // Jump by total grid size
}
This elegant pattern lets each thread process multiple elements. Instead of launching millions of threads for millions of elements, we use fewer threads that each do more work. Benefits:
- Handles arrays of any size with a single kernel launch
- Better memory locality
- Reduces kernel launch overhead
Two-Phase Operation
Phase 1: Compute Each thread multiplies corresponding elements and accumulates locally in registers (temp
).
Phase 2: Reduce Threads collaborate within each block using shared memory to sum their partial results.
Shared Memory Reduction
The reduction follows a tree pattern - threads pair up and combine results:
- Step 1: Thread 0 + Thread 4, Thread 1 + Thread 5, etc.
- Step 2: Thread 0 + Thread 2, Thread 1 + Thread 3, etc.
- Step 3: Thread 0 + Thread 1
- Final: Thread 0 has the result
int i = blockDim.x / 2;
while(i != 0) {
if(cacheIndex < i) {
cache[cacheIndex] += cache[cacheIndex + i];
}
__syncthreads();
i /= 2;
}
Complete Host Code
void dot_product_gpu(float* h_a, float* h_b, float* result, int N) {
float *d_a, *d_b, *d_partial_results;
const int threadsPerBlock = 256;
const int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
// Allocate GPU memory
cudaMalloc(&d_a, N * sizeof(float));
cudaMalloc(&d_b, N * sizeof(float));
cudaMalloc(&d_partial_results, blocksPerGrid * sizeof(float));
// Copy data to GPU
cudaMemcpy(d_a, h_a, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, N * sizeof(float), cudaMemcpyHostToDevice);
// Launch kernel
dot_product_kernel<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_partial_results, N);
// Copy partial results back and sum on CPU
float* h_partial = new float[blocksPerGrid];
cudaMemcpy(h_partial, d_partial_results, blocksPerGrid * sizeof(float), cudaMemcpyDeviceToHost);
*result = 0;
for(int i = 0; i < blocksPerGrid; i++) {
*result += h_partial[i];
}
// Cleanup
delete[] h_partial;
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_partial_results);
}
Performance Results
Testing with 1M element arrays:
Implementation | Time (ms) |
---|---|
GPU Sequential | 16.335 ms |
CPU Dot Product | 2.967 ms |
The CPU version significantly outperforms GPU.
This analysis just measures the kernel execution or the function execution on the CPU. It does not take into account memory allocation or deallocation.
Testing with 10M element arrays:
Implementation | Time (ms) |
---|---|
GPU Sequential | 16.262 ms |
CPU Dot Product | 28.847 ms |
It's quite fascinating how the time taken by GPU remains the same even though we just increased the data size by a factor of 10.
Testing with 10M element arrays including memory allocation:
Measuring the performance for the whole process including memory allocation and deallocation, we get:
Implementation | Time (ms) |
---|---|
GPU Sequential | 385.854 ms |
CPU Dot Product | 104.82 ms |
Real-World Applications
This dot product pattern appears everywhere in ML:
// Loss function (mean squared error)
float mse_loss = dot_product(predictions, targets, N) / N;
// Cosine similarity
float similarity = dot_product(vector1, vector2, N) /
(norm(vector1) * norm(vector2));
// Gradient descent update
float gradient_magnitude = dot_product(gradients, gradients, N);
What I Learned
- Grid-stride loops are elegant for handling variable-sized data
- Shared memory enables efficient thread collaboration within blocks
- Two-phase algorithms (compute + reduce) are common in GPU programming
- Partial results from blocks can be combined on CPU for simplicity
This dot product implementation bridges the gap between my simple array operations and more complex ML algorithms. The reduction pattern here will be useful for implementing neural network layers, optimization algorithms, and statistical computations.
Next up: exploring how these patterns scale to matrix operations and more advanced ML primitives!
Code for this dot product implementation is available in my cuda-ml-journey repository.