- Effective Amazon Machine Learning
- Alexis Perrier
- 274字
- 2021-07-03 00:17:50
Classic datasets versus real-world datasets
Data scientists and machine-learning practitioners often use classic datasets to demonstrate the behavior of certain models. The Iris dataset, composed of 150 samples of three types of iris flowers, is one of the most commonly used to demonstrate or to teach predictive analytics. It has been around since 1936!
The Boston housing dataset and the Titanic dataset are other very popular datasets for predictive analytics. For text classification, the Reuters or the 20 newsgroups text datasets are very common, while image recognition datasets are used to benchmark deep learning models. These classic datasets are used to establish baselines when evaluating the performances of algorithms and models. Their characteristics are well known, and data scientists know what performances to expect.
These classic datasets can be downloaded:
- Iris: http://archive.ics.uci.edu/ml/datasets/Iris
- Boston housing: https://archive.ics.uci.edu/ml/datasets/Housing
- Titanic dataset: https://www.kaggle.com/c/titanic or http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/
- Reuters: https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection
- 20 newsgroups: http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
- Image recognition and deep learning: http://deeplearning.net/datasets/
However, classic datasets can be weak equivalents of real datasets, which have been extracted and aggregated from a perse set of sources: databases, APIs, free form documents, social networks, spreadsheets, and so on. In a real-life situation, the data scientist must often deal with messy data that has missing values, absurd outliers, human errors, weird formatting, strange inputs, and skewed distributions.
The first task in a predictive analytics project is to clean up the data. In the following section, we will look at the main issues with raw data and what strategies can be applied. Since we will ultimately be using a linear model for our predictions, we will process the data with that in mind.
- Python數據挖掘:入門、進階與實用案例分析
- 分布式數據庫系統:大數據時代新型數據庫技術(第3版)
- 數據庫應用基礎教程(Visual FoxPro 9.0)
- Creating Dynamic UIs with Android Fragments(Second Edition)
- 大話Oracle Grid:云時代的RAC
- 深度剖析Hadoop HDFS
- Python金融實戰
- Python金融數據分析(原書第2版)
- 深入淺出Greenplum分布式數據庫:原理、架構和代碼分析
- gnuplot Cookbook
- 跨領域信息交換方法與技術(第二版)
- Hadoop 3實戰指南
- 標簽類目體系:面向業務的數據資產設計方法論
- Trino權威指南(原書第2版)
- Creating Mobile Apps with Appcelerator Titanium