官术网_书友最值得收藏!

Getting the data

The 20 newsgroups dataset is a fairly well-known dataset among the NLP community. It is near-ideal for demonstration purposes. This dataset has a near-uniform distribution across 20 classes. This uniform distribution makes iterating rapidly on classification and clustering techniques easy.

We will use the famous 20 newsgroups dataset for our demonstrations as well:

from sklearn.datasets import fetch_20newsgroups  # import packages which help us download dataset 
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, download_if_missing=True)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, download_if_missing=True)

Most modern NLP methods rely heavily on machine learning methods. These methods need words that are written as strings of text to be converted into a numerical representation. This numerical representation can be as simple as assigning a unique integer ID to slightly more comprehensive vector of float values. In the case of the latter, this is sometimes referred to as vectorization.

主站蜘蛛池模板: 全椒县| 武胜县| 绵阳市| 托克逊县| 徐闻县| 富锦市| 玉林市| 阿巴嘎旗| 卫辉市| 紫阳县| 奈曼旗| 柘城县| 南靖县| 青岛市| 固阳县| 吉安市| 墨江| 永德县| 肥乡县| 宁明县| 商城县| 沁阳市| 开原市| 咸阳市| 讷河市| 黔东| 闵行区| 聂荣县| 江山市| 自贡市| 井研县| 德格县| 莎车县| 南川市| 渝中区| 榆林市| 曲沃县| 陇南市| 富顺县| 琼海市| 农安县|