官术网_书友最值得收藏!

Introduction

In the previous chapter, we learned about the concepts of Natural Language Processing (NLP) and text analytics. We also took a quick look at various preprocessing steps. In this chapter, we will learn how to make text understandable to machine learning algorithms.

As we know, to use a machine learning algorithm on textual data, we need a numerical or vector representation of text data since most of these algorithms are unable to work directly with plain text or strings. But before converting the text data into numerical form, we will need to pass it through some preprocessing steps such as tokenization, stemming, lemmatization, and stop-word removal.

So, in this chapter, we will learn a little bit more about these preprocessing steps and how to extract features from the preprocessed text and convert them into vectors. We will also explore two popular methods for feature extraction (Bag of Words and Term Frequency-Inverse Document Frequency), as well as various methods for finding similarity between different texts. By the end of this chapter, you will have gained an in-depth understanding of how text data can be visualized.

主站蜘蛛池模板: 阳春市| 理塘县| 杭锦后旗| 洱源县| 兰考县| 乐清市| 湖南省| 孝昌县| 济阳县| 嘉黎县| 峡江县| 金堂县| 江都市| 红桥区| 晴隆县| 凭祥市| 灌南县| 新余市| 安仁县| 丰城市| 景泰县| 大化| 奉贤区| 重庆市| 亚东县| 河源市| 邯郸市| 南京市| 蛟河市| 新河县| 山东| 江口县| 枞阳县| 太湖县| 青田县| 福鼎市| 稷山县| 襄汾县| 通道| 揭西县| 东阳市|