官术网_书友最值得收藏!

Pipelines

As experiments grow, so does the complexity of the operations. We may split up our dataset, binarize features, perform feature-based scaling, perform sample-based scaling, and many more operations.

Keeping track of these operations can get quite confusing and can result in being unable to replicate the result. Problems include forgetting a step, incorrectly applying a transformation, or adding a transformation that wasn't needed.

Another issue is the order of the code. In the previous section, we created our X_transformed dataset and then created a new estimator for the cross validation.If we had multiple steps, we would need to track these changes to the dataset in code.

Pipelines are a construct that addresses these problems (and others, which we will see in the next chapter). Pipelines store the steps in your data mining workflow. They can take your raw data in, perform all the necessary transformations, and then create a prediction. This allows us to use pipelines in functions such as cross_val_score, where they expect an estimator. First, import the Pipeline object:

fromsklearn.pipeline import Pipeline

Pipelines take a list of steps as input, representing the chain of the data mining application. The last step needs to be an Estimator, while all previous steps are Transformers. The input dataset is altered by each Transformer, with the output of one step being the input of the next step. Finally, we classify the samples by the last step's estimator. In our pipeline, we have two steps:

  1. Use MinMaxScaler to scale the feature values from 0 to 1
  2. Use KNeighborsClassifier as the classification algorithms

We then represent each step using a tuple ('name', step). We can then create our pipeline:

scaling_pipeline = Pipeline([('scale', MinMaxScaler()), 
('predict', KNeighborsClassifier())])

The key here is the list of tuples. The first tuple is our scaling step and the second tuple is the predicting step. We give each step a name: the first we call scale and the second we call predict, but you can choose your own names. The second part of the tuple is the actual Transformer or estimator object.

Running this pipeline is now very easy, using the cross-validation code from before:

scores = cross_val_score(scaling_pipeline, X_broken, y, scoring='accuracy') 
print("The pipeline scored an average accuracy for is {0:.1f}%".format(np.mean(transformed_scores) * 100))

This gives us the same score as before (82.3 percent), which is expected, as we are running exactly the same steps, just with an improved interface.

In later chapters, we will use more advanced testing methods and setting up pipelines is a great way to ensure that the code complexity does not grow unmanageably.

主站蜘蛛池模板: 绿春县| 凤凰县| 黄陵县| 石楼县| 武清区| 平昌县| 长宁区| 江安县| 梅河口市| 普安县| 图木舒克市| 敦化市| 黄石市| 华池县| 偏关县| 高邮市| 甘洛县| 开远市| 阳东县| 凤阳县| 陈巴尔虎旗| 沈阳市| 原阳县| 鹿邑县| 微博| 黄石市| 堆龙德庆县| 凉山| 靖西县| 同德县| 时尚| 北宁市| 金湖县| 沂南县| 玉环县| 威海市| 菏泽市| 建宁县| 台安县| 平泉县| 蒲城县|