- Hands-On GPU:Accelerated Computer Vision with OpenCV and CUDA
- Bhaumik Vaidya
- 488字
- 2021-08-13 15:48:21
Executing threads on a device
We have seen that, while configuring kernel parameters, we can start multiple blocks and multiple threads in parallel. So, in which order do these blocks and threads start and finish their execution? It is important to know this if we want to use the output of one thread in other threads. To understand this, we have modified the kernel in the hello,CUDA! program we saw in the first chapter, by including a print statement in the kernel call, which prints the block number. The modified code is as follows:
#include <iostream>
#include <stdio.h>
__global__ void myfirstkernel(void)
{
//blockIdx.x gives the block number of current kernel
printf("Hello!!!I'm thread in block: %d\n", blockIdx.x);
}
int main(void)
{
//A kernel call with 16 blocks and 1 thread per block
myfirstkernel << <16,1>> >();
//Function used for waiting for all kernels to finish
cudaDeviceSynchronize();
printf("All threads are finished!\n");
return 0;
}
As can be seen from the code, we are launching a kernel with 16 blocks in parallel with each block having a single thread. In the kernel code, we are printing the block ID of the kernel execution. We can think that 16 copies of the same myfirstkernel start execution in parallel. Each of these copies will have a unique block ID, which can be accessed by the blockIdx.x CUDA directive, and a unique thread ID, which can be accessed by threadIdx.x. These IDs will tell us which block and thread are executing the kernel. When you run the program many times, you will find that, each time, blocks execute in a different order. One sample output can be shown as follows:

One question you should ask is how many different output patterns will the previous program produce? The correct answer is 16! It will produce n factorial number of outputs, where n indicates the number of blocks started in parallel. So, whenever you are writing the program in CUDA, you should be careful that the blocks execute in random order.
This program also contains one more CUDA directive: cudaDeviceSynchronize(). Why is it used? It is used because a kernel launch is an asynchronous process, which means it returns control to the CPU thread immediately after starting up the GPU process before the kernel has finished executing. In the previous code, the next line in CPU thread is print and application exit will terminate console before the kernel has finished execution. So, if we do not include this directive, you will not see any print statements of the kernel execution. The output that is generated later by the kernel has nowhere to go, and you won't see it. To see the outputs generated by the kernel, we will include this directive, which ensures that the kernel finishes before the application is allowed to exit, and the output from the kernel will find a waiting standard output queue.
- Three.js開發指南:基于WebGL和HTML5在網頁上渲染3D圖形和動畫(原書第3版)
- UML+OOPC嵌入式C語言開發精講
- 大學計算機基礎(第2版)(微課版)
- Getting Started with Laravel 4
- Active Directory with PowerShell
- 從零開始學Linux編程
- C語言程序設計
- Visual Studio 2015高級編程(第6版)
- Python 3.7從入門到精通(視頻教學版)
- Android應用開發深入學習實錄
- JavaScript動態網頁編程
- Image Processing with ImageJ
- OpenCV 3計算機視覺:Python語言實現(原書第2版)
- 零基礎C#學習筆記
- Java程序設計實用教程(第2版)