官术网_书友最值得收藏!

Solving our initial challenge

We now put everything together and demonstrate our system for the following new post that we assign to the variable new_post:

Disk drive problems. Hi, I have a problem with my hard disk.

After 1 year it is working only sporadically now.

I tried to format it, but now it doesn't boot any more.

Any ideas? Thanks.

As we have learned previously, we will first have to vectorize this post before we predict its label as follows:

>>> new_post_vec = vectorizer.transform([new_post])
>>> new_post_label = km.predict(new_post_vec)[0]

Now that we have the clustering, we do not need to compare new_post_vec to all post vectors. Instead, we can focus only on the posts of the same cluster. Let us fetch their indices in the original dataset:

>>> similar_indices = (km.labels_==new_post_label).nonzero()[0]

The comparison in the bracket results in a Boolean array, and nonzero converts that array into a smaller array containing the indices of the True elements.

Using similar_indices, we then simply have to build a list of posts together with their similarity scores as follows:

>>> similar = []
>>> for i in similar_indices:
...    dist = sp.linalg.norm((new_post_vec - vectorized[i]).toarray())
...    similar.append((dist, dataset.data[i]))
>>> similar = sorted(similar)
>>> print(len(similar))
44

We found 44 posts in the cluster of our post. To give the user a quick idea of what kind of similar posts are available, we can now present the most similar post (show_at_1), the least similar one (show_at_3), and an in-between post (show_at_2), all of which are from the same cluster as follows:

>>> show_at_1 = similar[0]
>>> show_at_2 = similar[len(similar)/2]
>>> show_at_3 = similar[-1]

The following table shows the posts together with their similarity values:

It is interesting how the posts reflect the similarity measurement score. The first post contains all the salient words from our new post. The second one also revolves around hard disks, but lacks concepts such as formatting. Finally, the third one is only slightly related. Still, for all the posts, we would say that they belong to the same domain as that of the new post.

Another look at noise

We should not expect a perfect clustering, in the sense that posts from the same newsgroup (for example, comp.graphics) are also clustered together. An example will give us a quick impression of the noise that we have to expect:

>>> post_group = zip(dataset.data, dataset.target)
>>> z = (len(post[0]), post[0], dataset.target_names[post[1]]) for post in post_group
>>> print(sorted(z)[5:7])
[(107, 'From: "kwansik kim" <kkim@cs.indiana.edu>\nSubject: Where is FAQ ?\n\nWhere can I find it ?\n\nThanks, Kwansik\n\n', 'comp.graphics'), (110, 'From: lioness@maple.circa.ufl.edu\nSubject: What is 3dO?\n\n\nSomeone please fill me in on what 3do.\n\nThanks,\n\nBH\n', 'comp.graphics')]

For both of these posts, there is no real indication that they belong to comp.graphics, considering only the wording that is left after the preprocessing step:

>>> analyzer = vectorizer.build_analyzer() 
>>> list(analyzer(z[5][1]))
[u'kwansik', u'kim', u'kkim', u'cs', u'indiana', u'edu', u'subject', u'faq', u'thank', u'kwansik']
>>> list(analyzer(z[6][1]))
[u'lioness', u'mapl', u'circa', u'ufl', u'edu', u'subject', u'3do', u'3do', u'thank', u'bh']

This is only after tokenization, lower casing, and stop word removal. If we also subtract those words that will be later filtered out via min_df and max_df, which will be done later in fit_transform, it gets even worse:

>>> list(set(analyzer(z[5][1])).intersection(
         vectorizer.get_feature_names()))
[u'cs', u'faq', u'thank']>>> list(set(analyzer(z[6][1])).intersection(
vectorizer.get_feature_names()))
[u'bh', u'thank']

Furthermore, most of the words occur frequently in other posts as well, as we can check with the IDF scores. Remember that the higher the TF-IDF, the more discriminative a term is for a given post. And as IDF is a multiplicative factor here, a low value of it signals that it is not of great value in general:

>>> for term in ['cs', 'faq', 'thank', 'bh', 'thank']:
...     print('IDF(%s)=%.2f'%(term, 
             vectorizer._tfidf.idf_[vectorizer.vocabulary_[term]])
IDF(cs)=3.23
IDF(faq)=4.17
IDF(thank)=2.23
IDF(bh)=6.57
IDF(thank)=2.23

So, except for bh, which is close to the maximum overall IDF value of 6.74, the terms don't have much discriminative power. Understandably, posts from different newsgroups will be clustered together.

For our goal, however, this is no big deal, as we are only interested in cutting down the number of posts that we have to compare a new post to. After all, the particular newsgroup from where our training data came from is of no special interest.

主站蜘蛛池模板: 潜山县| 达孜县| 安丘市| 阿坝县| 信阳市| 鞍山市| 左权县| 商南县| 如东县| 营口市| 南康市| 册亨县| 新巴尔虎左旗| 昭觉县| 铜梁县| 尚义县| 阿拉善盟| 北票市| 北流市| 金塔县| 阳朔县| 恭城| 磐安县| 永仁县| 格尔木市| 凤城市| 介休市| 墨江| 沾益县| 通城县| 宁陵县| 章丘市| 那坡县| 衡山县| 肇源县| 元朗区| 都江堰市| 青浦区| 台中县| 夹江县| 通许县|