- Natural Language Processing with Python Quick Start Guide
- Nirant Kasliwal
- 141字
- 2021-06-10 18:36:37
Getting the data
The 20 newsgroups dataset is a fairly well-known dataset among the NLP community. It is near-ideal for demonstration purposes. This dataset has a near-uniform distribution across 20 classes. This uniform distribution makes iterating rapidly on classification and clustering techniques easy.
We will use the famous 20 newsgroups dataset for our demonstrations as well:
from sklearn.datasets import fetch_20newsgroups # import packages which help us download dataset
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, download_if_missing=True)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, download_if_missing=True)
Most modern NLP methods rely heavily on machine learning methods. These methods need words that are written as strings of text to be converted into a numerical representation. This numerical representation can be as simple as assigning a unique integer ID to slightly more comprehensive vector of float values. In the case of the latter, this is sometimes referred to as vectorization.
- Network Automation Cookbook
- Groovy for Domain:specific Languages(Second Edition)
- Java EE 7 Development with NetBeans 8
- Mastering Apache Spark 2.x(Second Edition)
- Advanced Oracle PL/SQL Developer's Guide(Second Edition)
- Oracle GoldenGate 12c Implementer's Guide
- 深入理解C指針
- JavaScript應用開發實踐指南
- Hadoop 2.X HDFS源碼剖析
- Vue.js光速入門及企業項目開發實戰
- Learning Concurrency in Python
- Neo4j 3.x入門經典
- Android高級開發實戰:UI、NDK與安全
- MySQL數據庫應用實戰教程(慕課版)
- Keil Cx51 V7.0單片機高級語言編程與μVision2應用實踐