官术网_书友最值得收藏!

Chapter 3. Clustering – Finding Related Posts

In the previous chapter, we have learned how to find classes or categories of individual data points. With a handful of training data items that were paired with their respective classes, we learned a model that we can now use to classify future data items. We called this supervised learning, as the learning was guided by a teacher; in our case the teacher had the form of correct classifications.

Let us now imagine that we do not possess those labels by which we could learn the classification model. This could be, for example, because they were too expensive to collect. What could we have done in that case?

Well, of course, we would not be able to learn a classification model. Still, we could find some pattern within the data itself. This is what we will do in this chapter, where we consider the challenge of a "question and answer" website. When a user browses our site looking for some particular information, the search engine will most likely point him/her to a specific answer. To improve the user experience, we now want to show all related questions with their answers. If the presented answer is not what he/she was looking for, he/she can easily see the other available answers and hopefully stay on our site.

The naive approach would be to take the post, calculate its similarity to all other posts, and display the top N most similar posts as links on the page. This will quickly become very costly. Instead, we need a method that quickly finds all related posts.

We will achieve this goal in this chapter using clustering. This is a method of arranging items so that similar items are in one cluster and dissimilar items are in distinct ones. The tricky thing that we have to tackle first is how to turn text into something on which we can calculate similarity. With such a measurement for similarity, we will then proceed to investigate how we can leverage that to quickly arrive at a cluster that contains similar posts. Once there, we will only have to check out those documents that also belong to that cluster. To achieve this, we will introduce the marvelous Scikit library, which comes with diverse machine-learning methods that we will also use in the following chapters.

主站蜘蛛池模板: 蒲城县| 神农架林区| 朝阳县| 溆浦县| 高要市| 宾阳县| 莲花县| 大石桥市| 定结县| 海阳市| 天镇县| 宁陕县| 宁乡县| 城市| 平谷区| 象州县| 金华市| 高台县| 仙游县| 阳江市| 离岛区| 霍山县| 丹阳市| 鄂州市| 科技| 金溪县| 墨玉县| 云浮市| 揭东县| 开阳县| 大关县| 盐池县| 和顺县| 敦化市| 瑞金市| 依兰县| 浑源县| 瑞安市| 武宁县| 轮台县| 萨迦县|