官术网_书友最值得收藏!

Standard pre-processing

The pre-processing we will perform for this experiment is called feature-based normalization, which we perform using scikit-learn's MinMaxScaler class. Continuing with the Jupyter Notebook from the rest of this chapter, first, we import this class:

fromsklearn.preprocessing import MinMaxScaler

This class takes each feature and scales it to the range 0 to 1. This pre-processor replaces the minimum value with 0, the maximum with 1, and the other values somewhere in between based on a linear mapping.

To apply our pre-processor, we run the transform function on it. Transformers often need to be trained first, in the same way that the classifiers do. We can combine these steps by running the fit_transform function instead:

X_transformed = MinMaxScaler().fit_transform(X)

Here, X_transformed will have the same shape as X. However, each column will have a maximum of 1 and a minimum of 0.

There are various other forms of normalizing in this way, which is effective for other applications and feature types:

  • Ensure the sum of the values for each sample equals to 1, using sklearn.preprocessing.Normalizer
  • Force each feature to have a zero mean and a variance of 1, using sklearn.preprocessing.StandardScaler, which is a commonly used starting point for normalization
  • Turn numerical features into binary features, where any value above a threshold is 1 and any below is 0, using sklearn.preprocessing.Binarizer

We will use combinations of these pre-processors in later chapters, along with other types of Transformers object.

Pre-processing is a critical step in the data mining pipeline and one that can mean the difference between a bad and great result.

主站蜘蛛池模板: 麦盖提县| 宜兴市| 丁青县| 汨罗市| 东莞市| 岳池县| 象州县| 洞口县| 新晃| 柳河县| 垣曲县| 延长县| 冕宁县| 麻城市| 斗六市| 乐昌市| 滦平县| 田阳县| 和静县| 嵊州市| 邻水| 瓮安县| 盱眙县| 柘城县| 怀集县| 会同县| 工布江达县| 嘉鱼县| 崇左市| 莒南县| 祁连县| 东丰县| 克山县| 大连市| 科技| 库尔勒市| 延长县| 石棉县| 夏邑县| 西乌珠穆沁旗| 睢宁县|