官术网_书友最值得收藏!

Engineering new features

In the previous few examples, we saw that changing the features can have quite a large impact on the performance of the algorithm. Through our small amount of testing, we had more than 10 percent variance just from the features.

You can create features that come from a simple function in pandas by doing something like this:

dataset["New Feature"] = feature_creator()

The feature_creator function must return a list of the feature's value for each sample in the dataset. A common pattern is to use the dataset as a parameter:

dataset["New Feature"] = feature_creator(dataset)

You can create those features more directly by setting all the values to a single default value, like 0 in the next line:

dataset["My New Feature"] = 0

You can then iterate over the dataset, computing the features as you go. We used
this format in this chapter to create many of our features:

for index, row in dataset.iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
# Some calculation here to alter row
dataset.set_value(index, "FeatureName", feature_value)

Keep in mind that this pattern isn't very efficient. If you are going to do this, try all of your features at once.

A common best practice is to touch every sample as little as possible, preferably only once.

Some example features that you could try and implement are as follows:

  • How many days has it been since each team's previous match? Teams may be tired if they play too many games in a short time frame.
  • How many games of the last five did each team win? This will give a more stable form of the HomeLastWin and VisitorLastWin features we extracted earlier (and can be extracted in a very similar way).
  • Do teams have a good record when visiting certain other teams? For instance, one team may play well in a particular stadium, even if they are the visitors.

If you are facing trouble extracting features of these types, check the pandasdocumentation at http://pandas.pydata.org/pandas-docs/stable/ for help. Alternatively, you can try an online forum such as Stack Overflow for assistance.

More extreme examples could use player data to estimate the strength of each team's sides to predict who won. These types of complex features are used every day by gamblers and sports betting agencies to try to turn a profit by predicting the outcome of sports matches.

主站蜘蛛池模板: 筠连县| 建昌县| 晋城| 清徐县| 朝阳区| 宁国市| 潞城市| 黎川县| 黑水县| 边坝县| 胶州市| 息烽县| 长宁区| 共和县| 玉环县| 府谷县| 东台市| 元谋县| 宾川县| 宜兴市| 樟树市| 鄂伦春自治旗| 晋江市| 吉隆县| 盘锦市| 都昌县| 确山县| 通江县| 新蔡县| 安义县| 大邑县| 北宁市| 新乐市| 车致| 宁明县| 泰兴市| 尼玛县| 泸州市| 常熟市| 都兰县| 宝丰县|