官术网_书友最值得收藏!

Feature selection

The number of explanatory features (input variables) of a sample can be enormous wherein you get xi=(xi1, xi2, xi3, ... , xid) as a training sample (observation/example) and d is very large. An example of this can be a document classification task3, where you get 10,000 different words and the input variables will be the number of occurrences of different words.

This enormous number of input variables can be problematic and sometimes a curse because we have many input variables and few training samples to help us in the learning procedure. To avoid this curse of having an enormous number of input variables (curse of dimensionality), data scientists use dimensionality reduction techniques in order to select a subset from the input variables. For example, in the text classification task they can do the following:

  • Extracting relevant inputs (for instance, mutual information measure)
  • Principal component analysis (PCA)
  • Grouping (cluster) similar words (this uses a similarity measure)
主站蜘蛛池模板: 巴楚县| 德格县| 舞阳县| 丁青县| 开封市| 读书| 抚宁县| 湟中县| 海城市| 文成县| 泾阳县| 乐至县| 金华市| 绥宁县| 佛冈县| 玉林市| 印江| 清镇市| 新民市| 明水县| 桦川县| 神池县| 临泉县| 淮滨县| 淳化县| 宁陕县| 丹巴县| 卓尼县| 军事| 武强县| 顺义区| 武强县| 三门县| 南召县| 孝义市| 鄂温| 祁东县| 南皮县| 墨竹工卡县| 九寨沟县| 高清|