CUDA algorithm¶

Since the computation of the integral depends primarily only on the sum of the terms $$2f(x_{i})$$, this task can be divided among a number of processes, computed independently, and the results of those individual tasks can be combined to get the sum. This is the basis for a distributed parallel algorithm approach.

CUDA¶

The CUDA (Compute Unified Device Architecture) model of parallel computing is based on a set of computing processes housed on a CUDA certified NVIDIA graphics chip [cuda]. A large number of threads can be executed simultaneously on the NVIDIA graphics chip.

In the following subsection, a CUDA program is implemented for the Trapezoidal Rule. The different device and host functions and their calls will be carefully explained in the context of the Trapezoidal Rule.

CUDA implementation¶

This algorithm can be easily implemented using CUDA. Assign the independent tasks to different computing threads on the graphics chip, and compute the sum from the partial sums on the host. The following example shows a naive implementation; the purpose of this implementation is to maximize clarity of computing using CUDA, even at the expense of addition speedup.

On lines 19-23 is the implementation of the myfunction(), f(x). the __device__ designator indicates this function is implemented to run only on the device, graphics, and can be called only by functions running on the device.

On lines 27-40 is the implementation of the kernel function, integrateKernel(), which is executed on each thread in the graphics chip. The __global__ designator specifies the function as an entry point to the GPU. The calling program must specify the number of participating threads to execute this function. This function is passed the address of an array of floats where the partial results will be stored, a floating point value c which is the left hand endpoint of the interval of integration, a floating point value deltaX and an integer N which together indicate the numerical approximation will be computed over N subintervals each with length deltaX. One line 30, idx is computed to be the thread number (remember there are many executing threads), on line 31, x is computed to be the value of the left hand endpoint for this thread, and lines 33-37 compute the partial result, only if the thread corresponds to one of the N subintervals.

On lines 46-94 the host function cudaIntegrate() is implemented. It is passed the interval of integration $$[c,d]$$ and the number of subintervals to use, N. On line 49, the length of each subinterval, deltaX is computed. On line 55, the size in bytes of the array to hold the partial results is computed. On line 58, the space is allocated on the host computer for the partial results, and on lines 60-65 using cudaMalloc() the space is allocated on the device for the partial results. On lines 68-69 the number of threads needed is determined based on the number of blocks and the individual block size. On line 73 the integratorKernel on the device is started and is passed four arguments, as well the number of threads are specified as n_blocks*block_size. On line 76, the array of partial results computed on the device are copied into an array on the host using cudaMemcpy(). On lines 84-86 the trapezoidal approximation is completed. On lines 89-90 the host and device arrays are released; cudaFree() is used to deallocate the array on the device.

In the main function the host function cudaIntegrate() is called on line 111.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 `// This program implements trapezoidal integration for a function // f(x) over the interval [c,d] using N subdivisions. This program // runs on a host and device (NVIDIA graphics chip with cuda // certification). The function f(x) is implemented as a callable // function on the device. The kernel computes the sums f(xi)+f(xi+deltaX). // The host function computes of the individual sums computed on the // device and multiplies by deltaX/2. #include #include using namespace std; #include #include #include // function to integrate, defined as a function on the // GPU device __device__ float myfunction(float a) { return a*a+2.0f*a + 3.0f; } // kernel function to compute the summation used in the trapezoidal // rule for numerical integration __global__ __device__ void integratorKernel(float *a, float c, float deltaX, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; float x = c + (float)idx * deltaX; if (idx>> (a_d, c, deltaX, N); // copy results from device to host if((errorcode = cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost))!=cudaSuccess) { cout << "cudaMemcpy(): " << cudaGetErrorString(errorcode) << endl; exit(1); } // add up results float sum = 0.0; for(int i=0; i

Key algorithmic features¶

The allocation of space on the host and device and the transfer of the partial values from the device to the host are computationally expensive. A good question is can these be significantly improved. The answer is yes, but how will not be addressed here, since the technique, at this point, would obfuscate our discussion.

Speed up¶

For any parallel algorithm, the speedup is the ratio of the execution times of the sequential and parallel implementations. All the threads in the same block execute simultaneously and each takes constant time. The number of blocks is $$n\_blocks = \frac{n}{256}$$. However, on line 85 of main() $$\Theta(n)$$ steps are taken to complete the computation; so the total execution time is the sum of the parallel execution times plus the sequential time to complete the computation. Now speed up can be computed:

$S(n) = \frac{T(1)}{\Theta(n\_blocks)+\Theta(n)} = \frac{\Theta(n)}{\frac{n}{256} + \Theta(n)} \approx \Theta(1)$

for reasonable n. So it appears that question of can the allocation of space and transfer to data be improved? turns out to be very important.

Questions¶

1. Take time to carefully read through the implementation. What do you not understand?
2. Why does the screen flicker when this program runs?