官术网_书友最值得收藏!

Getting the data

The 20 newsgroups dataset is a fairly well-known dataset among the NLP community. It is near-ideal for demonstration purposes. This dataset has a near-uniform distribution across 20 classes. This uniform distribution makes iterating rapidly on classification and clustering techniques easy.

We will use the famous 20 newsgroups dataset for our demonstrations as well:

from sklearn.datasets import fetch_20newsgroups  # import packages which help us download dataset 
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, download_if_missing=True)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, download_if_missing=True)

Most modern NLP methods rely heavily on machine learning methods. These methods need words that are written as strings of text to be converted into a numerical representation. This numerical representation can be as simple as assigning a unique integer ID to slightly more comprehensive vector of float values. In the case of the latter, this is sometimes referred to as vectorization.

主站蜘蛛池模板: 柘荣县| 九台市| 横山县| 阿尔山市| 临沧市| 太白县| 伊金霍洛旗| 河西区| 沙洋县| 孝感市| 伊金霍洛旗| 柏乡县| 龙陵县| 沙坪坝区| 泾川县| 平度市| 静宁县| 霍林郭勒市| 葵青区| 循化| 鸡东县| 南华县| 雷州市| 阳春市| 中方县| 鞍山市| 泌阳县| 灌南县| 武陟县| 罗山县| 东乌珠穆沁旗| 从江县| 关岭| 荣昌县| 浦东新区| 师宗县| 安西县| 庆元县| 十堰市| 六盘水市| 启东市|