官术网_书友最值得收藏!

Summary

In this chapter, you have learned about various types of data and ways to deal with unstructured text data. Text data is usually untidy and needs to be cleaned and pre-processed. Pre-processing steps mainly consist of tokenization, stemming, lemmatization, and stop-word removal. After pre-processing, features are extracted from texts using various methods, such as BoW and TF-IDF. This step converts unstructured text data into structured numeric data. New features are created from existing features using a technique called feature engineering. In the last part of the chapter, we explored various ways of visualizing text data, such as word clouds.

In the next chapter, you will learn how to develop machine learning models to classify texts using the features you have learned to extract in this chapter. Moreover, different sampling techniques and model evaluation parameters will be introduced.

主站蜘蛛池模板: 嘉黎县| 松阳县| 五指山市| 祁东县| 吴旗县| 萍乡市| 民县| 清河县| 漯河市| 庆安县| 康平县| 彰化县| 兰溪市| 托里县| 进贤县| 达日县| 县级市| 宜宾县| 盐山县| 凌云县| 梁河县| 自治县| 顺义区| 通渭县| 仙游县| 屏山县| 江川县| 灯塔市| 德格县| 吴旗县| 灵武市| 鄂尔多斯市| 镇坪县| 乌拉特前旗| 溆浦县| 四子王旗| 沛县| 清河县| 凤山市| 惠安县| 连城县|