官术网_书友最值得收藏!

Getting the data

The 20 newsgroups dataset is a fairly well-known dataset among the NLP community. It is near-ideal for demonstration purposes. This dataset has a near-uniform distribution across 20 classes. This uniform distribution makes iterating rapidly on classification and clustering techniques easy.

We will use the famous 20 newsgroups dataset for our demonstrations as well:

from sklearn.datasets import fetch_20newsgroups  # import packages which help us download dataset 
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, download_if_missing=True)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, download_if_missing=True)

Most modern NLP methods rely heavily on machine learning methods. These methods need words that are written as strings of text to be converted into a numerical representation. This numerical representation can be as simple as assigning a unique integer ID to slightly more comprehensive vector of float values. In the case of the latter, this is sometimes referred to as vectorization.

主站蜘蛛池模板: 随州市| 鄂托克旗| 沙雅县| 安化县| 昭觉县| 防城港市| 马尔康县| 安福县| 曲阳县| 磴口县| 汨罗市| 靖宇县| 安丘市| 兴文县| 内江市| 洛隆县| 呼伦贝尔市| 读书| 鄄城县| 菏泽市| 阜平县| 巧家县| 北流市| 新丰县| 萨迦县| 安福县| 宽城| 任丘市| 鹿邑县| 怀柔区| 安远县| 香河县| 武安市| 龙海市| 如东县| 留坝县| 乌兰察布市| 文山县| 富宁县| 堆龙德庆县| 阿勒泰市|