官术网_书友最值得收藏!

NLP for machine learning

Unlike humans, computers do not understand text – at least not in the same way that we do. In order to create machine learning models that are able to learn from data, we must first learn to represent natural language in a way that computers are able to process.

When we discussed machine learning fundamentals, you may have noticed that loss functions all deal with numerical data so as to be able to minimize loss. Because of this, we wish to represent our text in a numerical format that can form the basis of our input into a neural network. Here, we will cover a couple of basic ways of numerically representing our data. 

Bag-of-words

The first and most simple way of representing text is by using a bag-of-words representation. This method simply counts the words in a given sentence or document and counts all the words. These counts are then transformed into a vector where each element of the vector is the count of the times each word in the corpus appears within the sentence. The corpus is simply all the words that appear across all the sentences/documents being analyzed. Take the following two sentences:

The cat sat on the mat

The dog sat on the cat

We can represent each of these sentences as a count of words:

Figure 1.15 – Table of word counts

Then, we can transform these into inpidual vectors: 

The cat sat on the mat -> [2,1,0,1,1,1]

The dog sat on the cat -> [2,1,1,1,1,0]

This numeric representation could then be used as the input features to a machine learning model where the feature vector is .

Sequential representation

We will see later in this book that more complex neural network models, including RNNs and LSTMs, do not just take a single vector as input, but can take a whole sequence of vectors in the form of a matrix. Because of this, in order to better capture the order of words and thus the meaning of any sentence, we are able to represent this in the form of a sequence of one-hot encoded vectors:

Figure 1.16 – One-hot encoded vectors

主站蜘蛛池模板: 蓬安县| 麻栗坡县| 阿图什市| 建宁县| 扶风县| 德保县| 克拉玛依市| 合阳县| 华池县| 卓资县| 镶黄旗| 洪雅县| 阿坝| 潜山县| 南投县| 邢台县| 北海市| 合作市| 囊谦县| 陈巴尔虎旗| 成武县| 芮城县| 霞浦县| 南澳县| 青岛市| 鹤山市| 庄河市| 稷山县| 南木林县| 神池县| 镇坪县| 沧源| 蒙城县| 固镇县| 桃园县| 周口市| 乌拉特中旗| 保亭| 河北省| 岳阳县| 台北县|