- Natural Language Processing with Python Quick Start Guide
- Nirant Kasliwal
- 467字
- 2021-06-10 18:36:37
Text to numbers
We will be using a bag of words model for our example. We simply convert the number of times every word occurs per document. Therefore, each document is a bag and we count the frequency of each word in that bag. This also means that we lose any ordering information that's present in the text. Next, we assign each unique word an integer ID. All of these unique words become our vocabulary. Each word in our vocabulary is treated as a machine learning feature. Let's make our vocabulary first.
Scikit-learn has a high-level component that will create feature vectors for us. This is called CountVectorizer. We recommend reading more about it from the scikit-learn docs:
# Extracting features from text files
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
print(f'Shape of Term Frequency Matrix: {X_train_counts.shape}')
By using count_vect.fit_transform(twenty_train.data), we are learning the vocabulary dictionary, which returns a Document-Term matrix of shape [n_samples, n_features]. This means that we have n_samples documents or bags with n_features unique words across them.
We will now be able to extract a meaningful relationship between these words and the tags or classes they belong to. One of the simplest ways to do this is to count the number of times a word occurs in each class.
We have a small issue with this – long documents then tend to influence the result a lot more. We can normalize this effect by dividing the word frequency by the total words in that document. We call this Term Frequency, or simply TF.
Words like the, a, and of are common across all documents and don't really help us distinguish between document classes or separate them. We want to emphasize rarer words, such as Manmohan and Modi, over common words. One way to do this is to use inverse document frequency, or IDF. Inverse document frequency is a measure of whether the term is common or rare in all documents.
We multiply TF with IDF to get our TF-IDF metric, which is always greater than zero. TF-IDF is calculated for a triplet of term t, document d, and vocab dictionary D.
We can directly calculate TF-IDF using the following lines of code:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print(f'Shape of TFIDF Matrix: {X_train_tfidf.shape}')
The last line will output the dimension of the Document-Term matrix, which is (11314, 130107).
Please note that in the preceding example we used each word as a feature, so the TF-IDF was calculated for each word. When we use a single word as a feature, we call it a unigram. If we were to use two consecutive words as a feature instead, we'd call it a bigram. In general, for n-words, we would call it an n-gram.
- 流量的秘密:Google Analytics網站分析與優化技巧(第2版)
- Mastering Ember.js
- NLTK基礎教程:用NLTK和Python庫構建機器學習應用
- Practical Data Science Cookbook(Second Edition)
- OpenNI Cookbook
- VSTO開發入門教程
- Visual Basic程序設計實驗指導(第4版)
- bbPress Complete
- Flutter跨平臺開發入門與實戰
- 微信小程序全棧開發技術與實戰(微課版)
- Python語言實用教程
- Unity 3D腳本編程:使用C#語言開發跨平臺游戲
- HTML+CSS+JavaScript編程入門指南(全2冊)
- Sitecore Cookbook for Developers
- Puppet 5 Beginner's Guide(Third Edition)