官术网_书友最值得收藏!

Obtaining a dataset

As you can imagine, one of the most important aspects of the model building process is obtaining a high-quality dataset. A dataset is used to train the model on what the output should be in the case of the aforementioned case of supervised learning. In the case of unsupervised learning, labeling is required for the dataset. A common misconception when creating a dataset is that bigger is better. This is far from the truth in a lot of cases. Continuing the preceding example, what if all of the poll results answered the same way for every single question? At that point, your dataset is composed of all the same data points and your model will not be able to properly predict any of the other candidates. This outcome is called overfitting. A diverse but representative dataset is required for machine learning algorithms to properly build a production-ready model. 

In Chapter 11Training and Building Production Models, we will deep dive into the methodology of obtaining quality datasets, looking at helpful resources, ways to manage your datasets, and transforming data, commonly referred to as data wrangling.

主站蜘蛛池模板: 南开区| 临城县| 湛江市| 图木舒克市| 淄博市| 抚宁县| 广平县| 贞丰县| 临海市| 裕民县| 绥滨县| 屏东县| 天台县| 肥乡县| 武威市| 白水县| 平远县| 邹城市| 永靖县| 全州县| 合肥市| 四子王旗| 樟树市| 柳江县| 息烽县| 裕民县| 烟台市| 图木舒克市| 阿合奇县| 射洪县| 城步| 邮箱| 凤阳县| 吉首市| 咸阳市| 太保市| 裕民县| 福海县| 紫阳县| 邵阳县| 平度市|