官术网_书友最值得收藏!

Text to numbers

We will be using a bag of words model for our example. We simply convert the number of times every word occurs per document. Therefore, each document is a bag and we count the frequency of each word in that bag. This also means that we lose any ordering information that's present in the text. Next, we assign each unique word an integer ID. All of these unique words become our vocabulary. Each word in our vocabulary is treated as a machine learning feature. Let's make our vocabulary first.

Scikit-learn has a high-level component that will create feature vectors for us. This is called CountVectorizer. We recommend reading more about it from the scikit-learn docs:

# Extracting features from text files
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

print(f'Shape of Term Frequency Matrix: {X_train_counts.shape}')

By using count_vect.fit_transform(twenty_train.data), we are learning the vocabulary dictionary, which returns a Document-Term matrix of shape [n_samples, n_features]. This means that we have n_samples documents or bags with n_features unique words across them.

We will now be able to extract a meaningful relationship between these words and the tags or classes they belong to. One of the simplest ways to do this is to count the number of times a word occurs in each class.

We have a small issue with this  long documents then tend to influence the result a lot more. We can normalize this effect by dividing the word frequency by the total words in that document. We call this Term Frequency, or simply TF.

Words like the, a, and of are common across all documents and don't really help us distinguish between document classes or separate them. We want to emphasize rarer words, such as Manmohan and Modi, over common words. One way to do this is to use inverse document frequency, or IDF. Inverse document frequency is a measure of whether the term is common or rare in all documents.

We multiply TF with IDF to get our TF-IDF metric, which is always greater than zero. TF-IDF is calculated for a triplet of term t, document d, and vocab dictionary D.

We can directly calculate TF-IDF using the following lines of code:

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

print(f'Shape of TFIDF Matrix: {X_train_tfidf.shape}')

The last line will output the dimension of the Document-Term matrix, which is (11314, 130107).

Please note that in the preceding example we used each word as a feature, so the TF-IDF was calculated for each word. When we use a single word as a feature, we call it a unigram. If we were to use two consecutive words as a feature instead, we'd call it a bigram. In general, for n-words, we would call it an n-gram.

主站蜘蛛池模板: 拉孜县| 绵阳市| 梁河县| 白水县| 罗山县| 怀安县| 乐至县| 星座| 兴宁市| 南丰县| 马山县| 平遥县| 苍南县| 三明市| 萍乡市| 平顶山市| 兴仁县| 金华市| 光山县| 和林格尔县| 万盛区| 怀柔区| 扎赉特旗| 建湖县| 泰顺县| 湟中县| 武宣县| 建德市| 页游| 华坪县| 自治县| 临海市| 蓬溪县| 灌云县| 安仁县| 吉林市| 门头沟区| 咸丰县| 阿瓦提县| 兴业县| 怀来县|