The velocity of data generated and transferred to the Hadoop cluster also impacts cluster sizing. Take two scenarios of data population, such as data generated in GBs per minute, as shown in the following diagram:
In the preceding diagram, both scenarios have generated the same data each day, but with a different velocity. In the first scenario, there are spikes of data, whereas the second sees a consistent flow of data. In scenario 1, you will need more hardware with additional CPUs or GPUs and storage over scenario 2. There are many other influencing parameters that can impact the sizing of the cluster; for example, the type of data can influence the compression factor of your cluster. Compression can be achieved with gzip, bzip, and other compression utilities. If the data is textual, the compression is usually higher. Similarly, intermediate storage requirements also add up to an additional 25% to 35%. Intermediate storage is used by MapReduce tasks to store intermediate results of processing. You can access an example Hadoop sizing calculatorhere.