書名： Natural Language Processing with Python Quick Start Guide
作者名： Nirant Kasliwal
本章字數： 114字
更新時間： 2021-06-10 18:36:35

Understanding and preparing the data

Text and language is inherently unstructured. We might want to clean it in certain ways, such as expanding abbreviations and acronyms, removing punctuation, and so on. We also want to select a few samples that are the best representatives of the data we might see in the wild.

The other common practice is to prepare a gold dataset. A gold dataset is the best available data under reasonable conditions. This is not the best available data under ideal conditions. Creating the gold dataset often involves manual tagging and cleaning processes.

The next few sections are dedicated to text cleaning and text representations at this stage of the NLP workflow.

官术网_书友最值得收藏!

Natural Language Processing with Python Quick Start Guide

Understanding and preparing the data