官术网_书友最值得收藏!

MALLET

The Machine Learning for Language Toolkit (MALLET) is a large library of natural language processing algorithms and utilities. It can be used in a variety of tasks such as document classification, document clustering, information extraction, and topic modelling. It features a command-line interface as well as a Java API for several algorithms such as Naive Bayes, HMM, Latent Dirichlet topic models, logistic regression, and conditional random fields.

MALLET is available under the Common Public License 1.0, which means that you can even use it in commercial applications. It can be downloaded from http://mallet.cs.umass.edu. A MALLET instance is represented by name, label, data, and source. However, there are two methods to import data into the MALLET format, as shown in the following list:

  • Instance per file: Each file or document corresponds to an instance and MALLET accepts the directory name for the input.
  • Instance per line: Each line corresponds to an instance, where the following format is assumed—the instance_name label token. Data will be a feature vector, consisting of distinct words that appear as tokens and their occurrence count.

The library is comprised of the following packages:

  • cc.mallet.classify: These are algorithms for training and classifying instances, including AdaBoost, bagging, C4.5, as well as other decision tree models, multivariate logistic regression, Naive Bayes, and Winnow2.
  • cc.mallet.cluster: These are unsupervised clustering algorithms, including greedy agglomerative, hill climbing, k-best, and k-means clustering.
  • cc.mallet.extract: This implements tokenizers, document extractors, document viewers, cleaners, and so on.
  • cc.mallet.fst: This implements sequence models, including conditional random fields, HMM, maximum entropy Markov models, and corresponding algorithms and evaluators.
  • cc.mallet.grmm: This implements graphical models and factor graphs such as inference algorithms, learning, and testing, for example, loopy belief propagation, Gibbs sampling, and so on.
  • cc.mallet.optimize: These are optimization algorithms for finding the maximum of a function, such as gradient ascent, limited-memory BFGS, stochastic meta ascent, and so on.
  • cc.mallet.pipe: These are methods as pipelines to process data into MALLET instances.
  • cc.mallet.topics: These are topics modelling algorithms, such as Latent Dirichlet allocation, four-level pachinko allocation, hierarchical PAM, DMRT, and so on.
  • cc.mallet.types: This implements fundamental data types such as dataset, feature vector, instance, and label.
  • cc.mallet.util: These are miscellaneous utility functions such as command-line processing, search, math, test, and so on.
主站蜘蛛池模板: 重庆市| 巩留县| 安达市| 广丰县| 托里县| 淅川县| 清徐县| 嘉义县| 石渠县| 吉林市| 凤阳县| 铜陵市| 唐河县| 邵阳市| 原阳县| 徐水县| 钦州市| 通榆县| 开平市| 景德镇市| 河北省| 揭东县| 文山县| 铜陵市| 新蔡县| 姚安县| 玉门市| 翁源县| 楚雄市| 郯城县| 科技| 柳河县| 建平县| 夏河县| 茂名市| 东乌珠穆沁旗| 涿鹿县| 两当县| 凤凰县| 昭苏县| 遵义市|