- Natural Language Processing with Python Quick Start Guide
- Nirant Kasliwal
- 141字
- 2021-06-10 18:36:37
Getting the data
The 20 newsgroups dataset is a fairly well-known dataset among the NLP community. It is near-ideal for demonstration purposes. This dataset has a near-uniform distribution across 20 classes. This uniform distribution makes iterating rapidly on classification and clustering techniques easy.
We will use the famous 20 newsgroups dataset for our demonstrations as well:
from sklearn.datasets import fetch_20newsgroups # import packages which help us download dataset
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, download_if_missing=True)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, download_if_missing=True)
Most modern NLP methods rely heavily on machine learning methods. These methods need words that are written as strings of text to be converted into a numerical representation. This numerical representation can be as simple as assigning a unique integer ID to slightly more comprehensive vector of float values. In the case of the latter, this is sometimes referred to as vectorization.
- Mastering NetBeans
- Learning Java Functional Programming
- Designing Machine Learning Systems with Python
- FreeSWITCH 1.8
- Java入門經典(第6版)
- PhpStorm Cookbook
- Apache Mahout Clustering Designs
- 軟件測試實用教程
- Cocos2d-x by Example:Beginner's Guide(Second Edition)
- Instant Apache Camel Messaging System
- Mastering Machine Learning with R
- 川哥教你Spring Boot 2實戰
- 微服務設計
- Web前端開發全程實戰:HTML5+CSS3+JavaScript+jQuery+Bootstrap
- Java編程動手學