官术网_书友最值得收藏!

Understanding and preparing the data

Text and language is inherently unstructured. We might want to clean it in certain ways, such as expanding abbreviations and acronyms, removing punctuation, and so on. We also want to select a few samples that are the best representatives of the data we might see in the wild.

The other common practice is to prepare a gold dataset. A gold dataset is the best available data under reasonable conditions. This is not the best available data under ideal conditions. Creating the gold dataset often involves manual tagging and cleaning processes.

The next few sections are dedicated to text cleaning and text representations at this stage of the NLP workflow.

主站蜘蛛池模板: 邢台市| 会昌县| 巨鹿县| 仙居县| 绩溪县| 栾城县| 渝北区| 友谊县| 桑植县| 合水县| 黄石市| 曲水县| 阿合奇县| 扶沟县| 新化县| 修文县| 西昌市| 威远县| 夏河县| 五寨县| 墨竹工卡县| 凯里市| 全州县| 临武县| 青龙| 镇安县| 桐梓县| 岳阳市| 高安市| 民乐县| 安化县| 盘锦市| 汉沽区| 长阳| 山丹县| 丹东市| 景东| 汤阴县| 永德县| 资中县| 宁明县|