官术网_书友最值得收藏!

  • Deep Learning for Beginners
  • Dr. Pablo Rivas Laura Montoya
  • 411字
  • 2021-06-11 18:20:17

Altering the distribution of data

It has been demonstrated that changing the distribution of the targets, particularly in the case of regression, can have positive benefits in the performance of a learning algorithm (Andrews, D. F., et al. (1971)).

Here, we'll discuss one particularly useful transformation known as Quantile Transformation. This methodology aims to look at the data and manipulate it in such a way that its histogram follows either a normal distribution or a uniform distribution. It achieves this by looking at estimates of quantiles. 

We can use the following commands to transform the same data as in the previous section:

from sklearn.preprocessing import QuantileTransformer
transformer = QuantileTransformer(output_distribution='normal')
df[[4,9]] = transformer.fit_transform(df[[4,9]])

This will effectively map the data into a new distribution, namely, a normal distribution. 

Here, the term  normal distribution refers to a Gaussian-like probability density function ( PDF). This is a classic distribution found in any statistics textbook. It is usually identified by its bell-like shape when plotted. 

Note that we are also using the fit_transform() method, which does both fit() and transform() at the same time, which is convenient.

As can be seen in Figure 3.5, the variable related to cholesterol data, x5, was easily transformed into a normal distribution with a bell shape. However, for x10, the heavy presence of data in a particular region causes the distribution to have a bell shape, but with a long tail, which is not ideal:

Figure 3.5 – Scatter plot of the normally transformed columns  x 5 and  x 10 and their corresponding Gaussian-like histograms 

The process of transforming the data for a uniform distribution is very similar. We simply need to make a small change in one line, on the QuantileTransformer() constructor, as follows:

transformer = QuantileTransformer(output_distribution='uniform')

Now, the data is transformed into a uniform distribution, as shown in Figure 3.6:

Figure 3.6 – Scatter plot of the uniformly transformed columns  x 5 and  x 10 and their corresponding uniform histograms 

From the figure, we can see that the data has been uniformly distributed across each variable. Once again, the clustering of data in a particular region has the effect of causing a large concentration of values in the same space, which is not ideal. This artifact also creates a gap in the distribution of the data that is usually difficult to handle, unless we use techniques to augment the data, which we'll discuss next.

主站蜘蛛池模板: 永吉县| 伊金霍洛旗| 长岭县| 八宿县| 东兰县| 长沙市| 永年县| 江孜县| 佛山市| 探索| 富蕴县| 沈丘县| 红河县| 祥云县| 克拉玛依市| 五华县| 阿坝| 余庆县| 汉川市| 哈巴河县| 当雄县| 高尔夫| 雷州市| 西藏| 大洼县| 连城县| 阿坝| 双辽市| 广昌县| 墨竹工卡县| 沾化县| 夏河县| 竹山县| 齐齐哈尔市| 石渠县| 普格县| 获嘉县| 马尔康县| 南投市| 苏尼特左旗| 临邑县|