官术网_书友最值得收藏!

Comparing latency between the CPU and the GPU code 

The programs for CPU and the GPU addition are written in a modular way so you can play around with the value of N. If N is small, then you will not notice any significant time difference between the CPU and the GPU code. But if you N is sufficiently large, then you will notice the significant difference in the CPU execution time and the GPU execution time for the same-vector addition. The time taken for the execution of a particular block can be measured by adding the following lines to the existing code:

clock_t start_d = clock();
printf("Doing GPU Vector add\n");
gpuAdd << <N, 1 >> >(d_a, d_b, d_c);
cudaThreadSynchronize();
clock_t end_d = clock();
double time_d = (double)(end_d - start_d) / CLOCKS_PER_SEC;
printf("No of Elements in Array:%d \n Device time %f seconds \n host time %f Seconds\n", N, time_d, time_h);

Time is measured by calculating the total number of clock cycles taken to perform a particular operation. This can be done by taking the difference of starting and ending the clock tick count, measured using the clock() function. This is divided by the number of clock cycles per second, to get the execution time. When N is taken as 10,000,000 in the previous vector addition programs of the CPU and the GPU and executed simultaneously, the output is as follows:

As can be seen from the output, the execution time or throughput is improved from 25 milliseconds to almost 1 millisecond when the same function is implemented on GPU. This proves what we have seen in theory earlier that executing code in parallel on GPU helps in the improvement of throughput. CUDA provides an efficient and accurate method for measuring the performance of CUDA programs, using CUDA events, which will be explained in the later chapters.

主站蜘蛛池模板: 伊吾县| 水富县| 潼南县| 报价| 左贡县| 米林县| 罗江县| 凤城市| 台南县| 略阳县| 乌拉特前旗| 且末县| 祥云县| 济宁市| 宁远县| 潞城市| 吴堡县| 汪清县| 枣强县| 应城市| 巍山| 长沙县| 宿松县| 四子王旗| 开封县| 金堂县| 崇左市| 湘西| 洛浦县| 易门县| 嘉义市| 海丰县| 安阳市| 道真| 千阳县| 开远市| 枝江市| 寿宁县| 铁岭县| 和田市| 溆浦县|