- Natural Language Processing with Python Quick Start Guide
- Nirant Kasliwal
- 141字
- 2021-06-10 18:36:37
Getting the data
The 20 newsgroups dataset is a fairly well-known dataset among the NLP community. It is near-ideal for demonstration purposes. This dataset has a near-uniform distribution across 20 classes. This uniform distribution makes iterating rapidly on classification and clustering techniques easy.
We will use the famous 20 newsgroups dataset for our demonstrations as well:
from sklearn.datasets import fetch_20newsgroups # import packages which help us download dataset
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, download_if_missing=True)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, download_if_missing=True)
Most modern NLP methods rely heavily on machine learning methods. These methods need words that are written as strings of text to be converted into a numerical representation. This numerical representation can be as simple as assigning a unique integer ID to slightly more comprehensive vector of float values. In the case of the latter, this is sometimes referred to as vectorization.
- MySQL 8從入門(mén)到精通(視頻教學(xué)版)
- 信息可視化的藝術(shù):信息可視化在英國(guó)
- 單片機(jī)C語(yǔ)言程序設(shè)計(jì)實(shí)訓(xùn)100例:基于STC8051+Proteus仿真與實(shí)戰(zhàn)
- Oracle 12c中文版數(shù)據(jù)庫(kù)管理、應(yīng)用與開(kāi)發(fā)實(shí)踐教程 (清華電腦學(xué)堂)
- Internet of Things with ESP8266
- Spring+Spring MVC+MyBatis從零開(kāi)始學(xué)
- Django實(shí)戰(zhàn):Python Web典型模塊與項(xiàng)目開(kāi)發(fā)
- 深入分析GCC
- Python應(yīng)用開(kāi)發(fā)技術(shù)
- 金融商業(yè)數(shù)據(jù)分析:基于Python和SAS
- Microsoft XNA 4.0 Game Development Cookbook
- Getting Started with the Lazarus IDE
- C# 10核心技術(shù)指南
- Ajax與jQuery程序設(shè)計(jì)
- IBM Cognos TM1 Cookbook