書(shū)名： Natural Language Processing with Python Quick Start Guide
作者名： Nirant Kasliwal
本章字?jǐn)?shù)： 467字
更新時(shí)間： 2021-06-10 18:36:37

Text to numbers

We will be using a bag of words model for our example. We simply convert the number of times every word occurs per document. Therefore, each document is a bag and we count the frequency of each word in that bag. This also means that we lose any ordering information that's present in the text. Next, we assign each unique word an integer ID. All of these unique words become our vocabulary. Each word in our vocabulary is treated as a machine learning feature. Let's make our vocabulary first.

Scikit-learn has a high-level component that will create feature vectors for us. This is called CountVectorizer. We recommend reading more about it from the scikit-learn docs:

# Extracting features from text files
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

print(f'Shape of Term Frequency Matrix: {X_train_counts.shape}')

By using count_vect.fit_transform(twenty_train.data), we are learning the vocabulary dictionary, which returns a Document-Term matrix of shape [n_samples, n_features]. This means that we have n_samples documents or bags with n_features unique words across them.

We will now be able to extract a meaningful relationship between these words and the tags or classes they belong to. One of the simplest ways to do this is to count the number of times a word occurs in each class.

We have a small issue with this – long documents then tend to influence the result a lot more. We can normalize this effect by dividing the word frequency by the total words in that document. We call this Term Frequency, or simply TF.

Words like the, a, and of are common across all documents and don't really help us distinguish between document classes or separate them. We want to emphasize rarer words, such as Manmohan and Modi, over common words. One way to do this is to use inverse document frequency, or IDF. Inverse document frequency is a measure of whether the term is common or rare in all documents.

We multiply TF with IDF to get our TF-IDF metric, which is always greater than zero. TF-IDF is calculated for a triplet of term t, document d, and vocab dictionary D.

We can directly calculate TF-IDF using the following lines of code:

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

print(f'Shape of TFIDF Matrix: {X_train_tfidf.shape}')

The last line will output the dimension of the Document-Term matrix, which is (11314, 130107).

Please note that in the preceding example we used each word as a feature, so the TF-IDF was calculated for each word. When we use a single word as a feature, we call it a unigram. If we were to use two consecutive words as a feature instead, we'd call it a bigram. In general, for n-words, we would call it an n-gram.

官术网_书友最值得收藏!

Natural Language Processing with Python Quick Start Guide

Text to numbers