官术网_书友最值得收藏!

Text to numbers

We will be using a bag of words model for our example. We simply convert the number of times every word occurs per document. Therefore, each document is a bag and we count the frequency of each word in that bag. This also means that we lose any ordering information that's present in the text. Next, we assign each unique word an integer ID. All of these unique words become our vocabulary. Each word in our vocabulary is treated as a machine learning feature. Let's make our vocabulary first.

Scikit-learn has a high-level component that will create feature vectors for us. This is called CountVectorizer. We recommend reading more about it from the scikit-learn docs:

# Extracting features from text files
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

print(f'Shape of Term Frequency Matrix: {X_train_counts.shape}')

By using count_vect.fit_transform(twenty_train.data), we are learning the vocabulary dictionary, which returns a Document-Term matrix of shape [n_samples, n_features]. This means that we have n_samples documents or bags with n_features unique words across them.

We will now be able to extract a meaningful relationship between these words and the tags or classes they belong to. One of the simplest ways to do this is to count the number of times a word occurs in each class.

We have a small issue with this  long documents then tend to influence the result a lot more. We can normalize this effect by dividing the word frequency by the total words in that document. We call this Term Frequency, or simply TF.

Words like the, a, and of are common across all documents and don't really help us distinguish between document classes or separate them. We want to emphasize rarer words, such as Manmohan and Modi, over common words. One way to do this is to use inverse document frequency, or IDF. Inverse document frequency is a measure of whether the term is common or rare in all documents.

We multiply TF with IDF to get our TF-IDF metric, which is always greater than zero. TF-IDF is calculated for a triplet of term t, document d, and vocab dictionary D.

We can directly calculate TF-IDF using the following lines of code:

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

print(f'Shape of TFIDF Matrix: {X_train_tfidf.shape}')

The last line will output the dimension of the Document-Term matrix, which is (11314, 130107).

Please note that in the preceding example we used each word as a feature, so the TF-IDF was calculated for each word. When we use a single word as a feature, we call it a unigram. If we were to use two consecutive words as a feature instead, we'd call it a bigram. In general, for n-words, we would call it an n-gram.

主站蜘蛛池模板: 从江县| 宁武县| 遵化市| 霞浦县| 枣阳市| 荣昌县| 论坛| 灵寿县| 鄂托克前旗| 胶州市| 沛县| 云霄县| 长岛县| 德格县| 炎陵县| 新泰市| 克什克腾旗| 靖边县| 兴和县| 光泽县| 嵩明县| 石屏县| 昌图县| 攀枝花市| 乌拉特后旗| 柳州市| 克拉玛依市| 利辛县| 建宁县| 怀化市| 诸城市| 米林县| 会同县| 收藏| 远安县| 乾安县| 定州市| 新巴尔虎左旗| 舒兰市| 勃利县| 阳高县|