官术网_书友最值得收藏!

Getting the data

The 20 newsgroups dataset is a fairly well-known dataset among the NLP community. It is near-ideal for demonstration purposes. This dataset has a near-uniform distribution across 20 classes. This uniform distribution makes iterating rapidly on classification and clustering techniques easy.

We will use the famous 20 newsgroups dataset for our demonstrations as well:

from sklearn.datasets import fetch_20newsgroups  # import packages which help us download dataset 
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, download_if_missing=True)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, download_if_missing=True)

Most modern NLP methods rely heavily on machine learning methods. These methods need words that are written as strings of text to be converted into a numerical representation. This numerical representation can be as simple as assigning a unique integer ID to slightly more comprehensive vector of float values. In the case of the latter, this is sometimes referred to as vectorization.

主站蜘蛛池模板: 山东| 华安县| 平江县| 大田县| 达日县| 谷城县| 斗六市| 万载县| 遂昌县| 内黄县| 新津县| 柳林县| 西和县| 鄂温| 汽车| 佛山市| 定陶县| 海南省| 承德县| 错那县| 海门市| 白城市| 嘉兴市| 昆山市| 长沙市| 宁安市| 凤冈县| 广昌县| 柳州市| 淮滨县| 郧西县| 汨罗市| 肥东县| 沙田区| 绥宁县| 建平县| 静乐县| 玉龙| 眉山市| 西平县| 娱乐|