官术网_书友最值得收藏!

Understanding and preparing the data

Text and language is inherently unstructured. We might want to clean it in certain ways, such as expanding abbreviations and acronyms, removing punctuation, and so on. We also want to select a few samples that are the best representatives of the data we might see in the wild.

The other common practice is to prepare a gold dataset. A gold dataset is the best available data under reasonable conditions. This is not the best available data under ideal conditions. Creating the gold dataset often involves manual tagging and cleaning processes.

The next few sections are dedicated to text cleaning and text representations at this stage of the NLP workflow.

主站蜘蛛池模板: 聊城市| 桦甸市| 黑龙江省| 依安县| 黎平县| 永安市| 拜城县| 肥东县| 射阳县| 民乐县| 永寿县| 历史| 南安市| 凤城市| 巢湖市| 岢岚县| 吉安市| 丹阳市| 岑溪市| 伊金霍洛旗| 文山县| 红安县| 安国市| 曲松县| 嵩明县| 军事| 微博| 云林县| 图们市| 六安市| 如东县| 齐河县| 海门市| 元氏县| 开远市| 改则县| 平利县| 洛隆县| 资溪县| 黄浦区| 开化县|