官术网_书友最值得收藏!

Classic datasets versus real-world datasets

Data scientists and machine-learning practitioners often use classic datasets to demonstrate the behavior of certain models. The Iris dataset, composed of 150 samples of three types of iris flowers, is one of the most commonly used to demonstrate or to teach predictive analytics. It has been around since 1936!

The Boston housing dataset and the Titanic dataset are other very popular datasets for predictive analytics. For text classification, the Reuters or the 20 newsgroups text datasets are very common, while image recognition datasets are used to benchmark deep learning models. These classic datasets are used to establish baselines when evaluating the performances of algorithms and models. Their characteristics are well known, and data scientists know what performances to expect.

These classic datasets can be downloaded:

However, classic datasets can be weak equivalents of real datasets, which have been extracted and aggregated from a perse set of sources: databases, APIs, free form documents, social networks, spreadsheets, and so on. In a real-life situation, the data scientist must often deal with messy data that has missing values, absurd outliers, human errors, weird formatting, strange inputs, and skewed distributions.

The first task in a predictive analytics project is to clean up the data. In the following section, we will look at the main issues with raw data and what strategies can be applied. Since we will ultimately be using a linear model for our predictions, we will process the data with that in mind.

主站蜘蛛池模板: 赞皇县| 孟州市| 黄龙县| 北宁市| 衡山县| 三台县| 错那县| 西安市| 昭苏县| 沙洋县| 景德镇市| 枞阳县| 永修县| 云霄县| 镶黄旗| 玉溪市| 赤壁市| 尚志市| 和政县| 罗定市| 莲花县| 江川县| 梨树县| 西林县| 沽源县| 乐都县| 青田县| 仁寿县| 井冈山市| 佛冈县| 苏尼特右旗| 资讯 | 民乐县| 靖安县| 浪卡子县| 邯郸市| 马公市| 盘锦市| 峨边| 海林市| 万安县|