- Hands-On GPU Programming with Python and CUDA
- Dr. Brian Tuomanen
- 645字
- 2021-06-10 19:25:33
Parallelization and Amdahl's Law
Before we can dive in and unlock the potential of GPUs, we first have to realize where their computational power lies in comparison to a modern Intel/AMD central processing unit (CPU)—the power does not lie in the fact that it has a higher clock speed than a CPU, nor in the complexity or particular design of the individual cores. An individual GPU core is actually quite simplistic, and at a disadvantage when compared to a modern individual CPU core, which use many fancy engineering tricks, such as branch prediction to reduce the latency of computations. Latency refers to the beginning-to-end duration of performing a single computation.
The power of the GPU derives from the fact that there are many, many more cores than in a CPU, which means a huge step forward in throughput. Throughput here refers to the number of computations that can be performed simultaneously. Let's use an analogy to get a better understanding of what this means. A GPU is like a very wide city road that is designed to handle many slower-moving cars at once (high throughput, high latency), whereas a CPU is like a narrow highway that can only admit a few cars at once, but can get each individual car to its destination much quicker (low throughput, low latency).
We can get an idea of the increase in throughput by seeing how many cores these new GPUs have. To give you an idea, the average Intel or AMD CPU has only two to eight cores—while an entry-level, consumer-grade NVIDIA GTX 1050 GPU has 640 cores, and a new top-of-the-line NVIDIA RTX 2080 Ti has 4,352 cores! We can exploit this massive throughput, provided we know how properly to parallelize any program or algorithm we wish to speed up. By parallelize, we mean to rewrite a program or algorithm so that we can split up our workload to run in parallel on multiple processors simultaneously. Let's think about an analogy from real-life.
Suppose that you are building a house and that you already have all of the designs and materials in place. You hire a single laborer, and you estimate it will take 100 hours to construct the house. Let's suppose that this particular house can be built in such a way that the work can be perfectly divided between every additional laborer you hire—that is to say, it will take 50 hours for two laborers, 25 hours for four laborers, and 10 hours for ten laborers to construct the house—the number of hours to construct your house will be 100 divided by the number of laborers you hire. This is an example of a parallelizable task.
We notice that this task is twice as fast to complete for two laborers, and ten times as fast for ten laborers to complete together (that is, in parallel) as opposed to one laborer building the house alone (that is, in serial)—that is, if N is the number of laborers, then it will be N times as fast. In this case, N is known as the speedup of parallelizing our task over the serial version of our task.
Before we begin to program a parallel version of a given algorithm, we often start by coming up with an estimate of the potential speedup that parallelization would bring to our task. This can help us determine whether it is worth expending resources and time writing a parallelization of our program or not. Because real life is more complicated than the example we gave here, it's pretty obvious that we won't be able to parallelize every program perfectly, all of the time—most of the time, only a part of our program will be nicely parallelizable, while the rest will have to run in serial.
- 樂學Windows操作系統
- Ubuntu Linux操作系統
- Linux網絡內核分析與開發
- macOS效率手冊
- Microsoft Operations Management Suite Cookbook
- 蘋果OS X Mavericks 10.9應用大全
- 細說Linux基礎知識
- Mastering Reactive JavaScript
- 一學就會:Windows Vista應用完全自學手冊
- Linux服務器配置與管理
- 計算機系統的自主設計
- Social Data Visualization with HTML5 and JavaScript
- 新編電腦辦公(Windows 10+ Office 2013版)從入門到精通
- Kali Linux高級滲透測試(原書第3版)
- Learn CUDA Programming