官术网_书友最值得收藏!

Loading external datasets in Python

Thanks to the SciPy community, there are tons of resources out there for getting our hands on some data.

A particularly useful resource comes in the form of the sklearn.datasets package of scikit-learn. This package comes preinstalled with some small datasets that do not require us to download any files from external websites. These datasets include the following:

  • load_boston: The Boston dataset contains housing prices in different suburbs of Boston along with a number of interesting features such as per capita crime rate by town, proportion of residential land, non-retail business, and so on
  • load_iris: The Iris dataset contains three different types of iris flowers (setosa, versicolor, and virginica), along with four features describing the width and length of the sepals and petals
  • load_diabetes: The diabetes dataset lets us classify patients as having diabetes or not, based on features such as patient age, sex, body mass index, average blood pressure, and six blood serum measurements
  • load_digits: The digits dataset contains 8 x 8 pixel images of digits 0-9
  • load_linnerud: The Linnerud dataset contains three physiological and three exercise variables measured on twenty middle-aged men in a fitness club

In addition, scikit-learn allows us to download datasets directly from external repositories, such as the following:

  • fetch_olivetti_faces: The Olivetta face dataset contains ten different images each of 40 distinct subjects
  • fetch_20newsgroups: The 20 newsgroup dataset contains around 18,000 newsgroups posts on 20 topics

Even better, it is possible to download datasets directly from the machine learning database at http://mldata.org. For example, to download the MNIST dataset of handwritten digits, simply type as follows:

In [1]: from sklearn import datasets
In [2]: mnist = datasets.fetch_mldata('MNIST original')

Note that this might take a while, depending on your internet connection. The MNIST database contains a total of 70,000 examples of handwritten digits (28 x 28 pixel images, labeled from 0 to 9). Data and labels are delivered in two separate containers, which we can inspect as follows:

In [3]: mnist.data.shape
Out[3]: (70000, 784)
In [4]: mnist.target.shape
Out[4]: (70000,)

Here, we can see that mnist.data contains 70,000 images of 28 x 28 = 784 pixels each. Labels are stored in mnist.target, where there is only one label per image.

We can further inspect the values of all targets, but we don't just want to print them all. Instead, we are interested to see all distinct target values, which is easy to do with NumPy:

In [5]: np.unique(mnist.target)
Out[5]: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
Another Python library for data analysis that you should have heard about is Pandas ( http://pandas.pydata.org). Pandas implements a number of powerful data operations for both databases and spreadsheets. However great the library, at this point, Pandas is a bit too advanced for our purposes.
主站蜘蛛池模板: 芮城县| 金沙县| 沭阳县| 罗源县| 邵阳县| 上杭县| 阳山县| 左贡县| 齐齐哈尔市| 雅安市| 唐山市| 泸溪县| 巴彦淖尔市| 札达县| 军事| 镇平县| 汉川市| 康乐县| 凤翔县| 灵寿县| 尼玛县| 锡林浩特市| 日照市| 祁东县| 苍梧县| 普宁市| 革吉县| 西林县| 仁怀市| 阿坝| 罗甸县| 武功县| 儋州市| 六安市| 湟中县| 麦盖提县| 长子县| 阿拉善右旗| 龙州县| 行唐县| 壶关县|