Since the computation of the integral depends primarily only on the sum of the terms \(2f(x_{i})\), this task can be divided among a number of processes, computed independently, and the results of those individual tasks can be combined to get the sum. This is the basis for a distributed parallel algorithm approach.

The CUDA (Compute Unified Device Architecture) model of parallel computing is based on a set of computing processes housed on a CUDA certified NVIDIA graphics chip [cuda]. A large number of threads can be executed simultaneously on the NVIDIA graphics chip.

In the following subsection, a CUDA program is implemented for the Trapezoidal Rule. The different *device* and *host* functions and their calls will be carefully explained in the context of the Trapezoidal Rule.

This algorithm can be easily implemented using *CUDA*. Assign the independent tasks to different computing threads on the graphics chip, and compute the sum from the partial sums on the host. The following example shows a naive implementation; the purpose of this implementation is to maximize clarity of computing using *CUDA*, even at the expense of addition speedup.

On lines 19-23 is the implementation of the *myfunction()*, *f(x)*. the *__device__* designator indicates this function is implemented to run only on the device, graphics, and can be called only by functions running on the device.

On lines 27-40 is the implementation of the kernel function, *integrateKernel()*, which is executed on each thread in the graphics chip. The *__global__* designator specifies the function as an entry point to the GPU. The calling program must specify the number of participating threads to execute this function. This function is passed the address of an array of floats where the partial results will be stored, a floating point value *c* which is the left hand endpoint of the interval of integration, a floating point value *deltaX* and an integer N which together indicate the numerical approximation will be computed over *N* subintervals each with length *deltaX*. One line 30, *idx* is computed to be the thread number (remember there are many executing threads), on line 31, *x* is computed to be the value of the left hand endpoint for this thread, and lines 33-37 compute the partial result, only if the thread corresponds to one of the *N* subintervals.

On lines 46-94 the host function *cudaIntegrate()* is implemented. It is passed the interval of integration \([c,d]\) and the number of subintervals to use, *N*. On line 49, the length of each subinterval, *deltaX* is computed. On line 55, the size in bytes of the array to hold the partial results is computed. On line 58, the space is allocated on the host computer for the partial results, and on lines 60-65 using cudaMalloc() the space is allocated on the device for the partial results. On lines 68-69 the number of threads needed is determined based on the number of blocks and the individual block size. On line 73 the *integratorKernel* on the device is started and is passed four arguments, as well the number of threads are specified as *n_blocks*block_size*. On line 76, the array of partial results computed on the device are copied into an array on the host using cudaMemcpy(). On lines 84-86 the trapezoidal approximation is completed. On lines 89-90 the host and device arrays are released; cudaFree() is used to deallocate the array on the device.

In the main function the host function *cudaIntegrate()* is called on line 111.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | ```
// This program implements trapezoidal integration for a function
// f(x) over the interval [c,d] using N subdivisions. This program
// runs on a host and device (NVIDIA graphics chip with cuda
// certification). The function f(x) is implemented as a callable
// function on the device. The kernel computes the sums f(xi)+f(xi+deltaX).
// The host function computes of the individual sums computed on the
// device and multiplies by deltaX/2.
#include <iostream>
#include <ctime>
using namespace std;
#include <cuda.h>
#include <math_constants.h>
#include <cuda_runtime.h>
// function to integrate, defined as a function on the
// GPU device
__device__ float myfunction(float a)
{
return a*a+2.0f*a + 3.0f;
}
// kernel function to compute the summation used in the trapezoidal
// rule for numerical integration
__global__ __device__ void integratorKernel(float *a, float c, float deltaX, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float x = c + (float)idx * deltaX;
if (idx<N)
{
a[idx] = myfunction(x)+myfunction(x+deltaX);
}
}
// cudaIntegrate() is the host function that sets up the
// computation of the integral of f(x) over the interval
// [c,d].
__host__ float cudaIntegrate(float c, float d, int N)
{
// deltaX
float deltaX = (d-c)/N;
// error code variable
cudaError_t errorcode = cudaSuccess;
// size of the arrays in bytes
int size = N*sizeof(float);
// allocate array on host and device
float* a_h = (float *)malloc(size);
float* a_d;
if (( errorcode = cudaMalloc((void **)&a_d,size))!= cudaSuccess)
{
cout << "cudaMalloc(): " << cudaGetErrorString(errorcode) << endl;
exit(1);
}
// do calculation on device
int block_size = 256;
int n_blocks = N/block_size + ( N % block_size == 0 ? 0:1);
// cout << "blocks: " << n_blocks << endl;
// cout << "block size: " << block_size << endl;
integratorKernel <<< n_blocks, block_size >>> (a_d, c, deltaX, N);
// copy results from device to host
if((errorcode = cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost))!=cudaSuccess)
{
cout << "cudaMemcpy(): " << cudaGetErrorString(errorcode) << endl;
exit(1);
}
// add up results
float sum = 0.0;
for(int i=0; i<N; i++) sum += a_h[i];
sum *= deltaX/2.0;
// clean up
free(a_h);
cudaFree(a_d);
return sum;
}
// utility host function to convert the length of time into
// micro seconds
__host__ double diffclock(clock_t clock1, clock_t clock2)
{
double diffticks = clock1-clock2;
double diffms = diffticks/(CLOCKS_PER_SEC/1000);
return diffms;
}
// host main program
int main()
{
clock_t start = clock();
float answer = cudaIntegrate(0.0,1.0,1000);
clock_t end = clock();
cout << "The answer is " << answer << endl;
cout << "Computation time: " << diffclock(end,start);
cout << " micro seconds" << endl;
return 0;
}
``` |

The allocation of space on the host and device and the transfer of the partial values from the device to the host are computationally expensive. A good question is can these be significantly improved. The answer is *yes*, but how will not be addressed here, since the technique, at this point, would obfuscate our discussion.

For any parallel algorithm, the *speedup* is the ratio of the execution times of the sequential and parallel implementations. All the threads in the same block execute simultaneously and each takes constant time. The number of blocks is \(n\_blocks = \frac{n}{256}\). However, on line 85 of *main()* \(\Theta(n)\) steps are taken to complete the computation; so the total execution time is the sum of the parallel execution times plus the sequential time to complete the computation. Now speed up can be computed:

\[S(n) = \frac{T(1)}{\Theta(n\_blocks)+\Theta(n)} = \frac{\Theta(n)}{\frac{n}{256} + \Theta(n)} \approx \Theta(1)\]

for reasonable n. So it appears that question of *can the allocation of space and transfer to data be improved?* turns out to be very important.

- Take time to carefully read through the implementation. What do you
**not**understand? - Why does the screen flicker when this program runs?