官术网_书友最值得收藏!

  • R High Performance Programming
  • Aloysius Lim William Tjhi
  • 560字
  • 2021-08-06 19:17:05

Three constraints on computing performance – CPU, RAM, and disk I/O

First, let's see how R programs are executed in a computer. This is a very simplified version of what actually happens, but it suffices for us to understand the performance limitations of R. The following figure illustrates the steps required to execute an R program.

Three constraints on computing performance – CPU, RAM, and disk I/O

Steps to execute an R program

Take for example, this simple R program, which loads some data from a CSV file, computes the column sums, and writes the results into another CSV file:

data <- read.csv("mydata.csv")
totals <- colSums(data)
write.csv(totals, "totals.csv")

We use the numbering to understand the preceding diagram:

  1. When we load and run an R program, the R code is first loaded into RAM.
  2. The R interpreter then translates the R code into machine code and loads the machine code into the CPU.
  3. The CPU executes the program.
  4. The program loads the data to be processed from the hard disk into RAM (read.csv() in the example).
  5. The data is loaded in small chunks into the CPU for processing.
  6. The CPU processes the data one chunk at a time, and exchanges chunks of data with RAM until all the data has been processed (in the example, the CPU executes the instructions of the colSums() function to compute the column sums on the data set).
  7. Sometimes, the processed data is stored back onto the hard drive (write.csv() in the example).

From this depiction of the computing process, we can see a few places where performance bottlenecks can occur:

  • The speed and performance of the CPU determines how quickly computing instructions, such as colSums() in the example, are executed. This includes the interpretation of the R code into the machine code and the actual execution of the machine code to process the data.
  • The size of RAM available on the computer limits the amount of data that can be processed at any given time. In this example, if the mydata.csv file contains more data than can be held in the RAM, the call to read.csv() will fail.
  • The speed at which the data can be read from or written to the hard disk (read.csv() and write.csv() in the example), that is, the speed of the disk input/output (I/O) affects how quickly the data can be loaded into the memory and stored back onto the hard disk.

Sometimes, you might encounter these limiting factors one at a time. For example, when a dataset is small enough to be quickly read from the disk and fully stored in the RAM, but the computations performed on it are complex, then only the CPU constraint is encountered. At other times, you might find them occurring together in various combinations. For example, when a dataset is very large, it takes a long time to load it from the disk, only one small chunk of it can be loaded at any given time into the memory, and it takes a long time to perform any computations on it. In either case, these are the symptoms of performance problems. In order to diagnose the problems and find solutions for them, we need to look at what is happening behind the scenes that might be causing these constraints to occur.

Let's now take a look at how R is designed and how it works, and see what the implications are for its performance.

主站蜘蛛池模板: 大竹县| 东辽县| 开原市| 探索| 安阳县| 延寿县| 竹溪县| 清河县| 当阳市| 合川市| 石柱| 南雄市| 改则县| 南安市| 龙州县| 阿图什市| 津市市| 桂平市| 玉树县| 开鲁县| 晋中市| 昌江| 文山县| 云阳县| 呼和浩特市| 双城市| 游戏| 仁怀市| 宁明县| 平山县| 若羌县| 洛隆县| 阳山县| 南安市| 洱源县| 山东省| 红河县| 云梦县| 班玛县| 宜黄县| 福贡县|