官术网_书友最值得收藏!

Semantics and topic modeling

Gensim is famous for its powerful semantic and topic modeling algorithms. Topic modeling is a typical text mining task of discovering the hidden semantic structures in a document. Semantic structure in plain English is the distribution of word occurrences. It is obviously an unsupervised learning task. What we need to do is to feed in plain text and let the model figure out the abstract "topics". We will study topic modeling in detail in Chapter 3, Mining the 20 Newsgroups Dataset with Clustering and Topic Modeling Algorithms.

In addition to robust semantic modeling methods, gensim also provides the following functionalities:

  • Word embedding: Also known as word vectorization, this is an innovative way to represent words while preserving words' co-occurrence features. We will study word embedding in detail in Chapter 10, Machine Learning Best Practices.
  • Similarity querying: This functionality retrieves objects that are similar to the given query object. It's a feature built on top of word embedding.
  • Distributed computingThis functionality makes it possible to efficiently learn from millions of documents.

Last but not least, as mentioned in the first chapter, scikit-learn is the main package we use throughout this entire book. Luckily, it provides all text processing features we need, such as tokenization, besides comprehensive machine learning functionalities. Plus, it comes with a built-in loader for the 20 newsgroups dataset.

Now that the tools are available and properly installed, what about the data?

主站蜘蛛池模板: 垫江县| 玉树县| 金坛市| 林芝县| 洞口县| 社旗县| 松原市| 日照市| 红原县| 浮梁县| 丰镇市| 阿勒泰市| 曲麻莱县| 文成县| 丹棱县| 金山区| 麦盖提县| 资中县| 徐汇区| 宁远县| 措美县| 昌平区| 治县。| 两当县| 青冈县| 红安县| 乌鲁木齐市| 沅陵县| 铜川市| 嘉祥县| 淅川县| 应城市| 铜梁县| 泰顺县| 阳高县| 宣武区| 灵武市| 原阳县| 仪征市| 民丰县| 富顺县|