官术网_书友最值得收藏!

Executing threads on a device

We have seen that, while configuring kernel parameters, we can start multiple blocks and multiple threads in parallel. So, in which order do these blocks and threads start and finish their execution? It is important to know this if we want to use the output of one thread in other threads. To understand this, we have modified the kernel in the hello,CUDA! program we saw in the first chapter, by including a print statement in the kernel call, which prints the block number. The modified code is as follows:

#include <iostream>
#include <stdio.h>
__global__ void myfirstkernel(void)
{
//blockIdx.x gives the block number of current kernel
printf("Hello!!!I'm thread in block: %d\n", blockIdx.x);
}
int main(void)
{
//A kernel call with 16 blocks and 1 thread per block
myfirstkernel << <16,1>> >();

//Function used for waiting for all kernels to finish
cudaDeviceSynchronize();

printf("All threads are finished!\n");
return 0;
}

As can be seen from the code, we are launching a kernel with 16 blocks in parallel with each block having a single thread. In the kernel code, we are printing the block ID of the kernel execution. We can think that 16 copies of the same myfirstkernel start execution in parallel. Each of these copies will have a unique block ID, which can be accessed by the blockIdx.x CUDA directive, and a unique thread ID, which can be accessed by threadIdx.x. These IDs will tell us which block and thread are executing the kernel. When you run the program many times, you will find that, each time, blocks execute in a different order. One sample output can be shown as follows:  

One question you should ask is how many different output patterns will the previous program produce? The correct answer is 16! It will produce n factorial number of outputs, where n indicates the number of blocks started in parallel. So, whenever you are writing the program in CUDA, you should be careful that the blocks execute in random order.

This program also contains one more CUDA directive: cudaDeviceSynchronize(). Why is it used? It is used because a  kernel launch is an asynchronous process, which means it returns control to the CPU thread immediately after starting up the GPU process before the kernel has finished executing. In the previous code, the next line in CPU thread is print and application exit will terminate console before the kernel has finished execution. So, if we do not include this directive, you will not see any print statements of the kernel execution. The output that is generated later by the kernel has nowhere to go, and you won't see it. To see the outputs generated by the kernel, we will include this directive, which ensures that the kernel finishes before the application is allowed to exit, and the output from the kernel will find a waiting standard output queue.

主站蜘蛛池模板: 陇南市| 色达县| 霍山县| 五峰| 吐鲁番市| 临夏市| 金堂县| 潮安县| 中牟县| 安陆市| 瑞昌市| 扶余县| 偏关县| 玛曲县| 大冶市| 金门县| 纳雍县| 金坛市| 南皮县| 基隆市| 黄浦区| 万全县| 凤庆县| 南汇区| 黔东| 萨嘎县| 肥东县| 平罗县| 佛学| 阜平县| 连平县| 井研县| 卢龙县| 汉源县| 林芝县| 诸城市| 吴川市| 海林市| 金沙县| 揭东县| 凉山|