官术网_书友最值得收藏!

Feature selection

The number of explanatory features (input variables) of a sample can be enormous wherein you get xi=(xi1, xi2, xi3, ... , xid) as a training sample (observation/example) and d is very large. An example of this can be a document classification task3, where you get 10,000 different words and the input variables will be the number of occurrences of different words.

This enormous number of input variables can be problematic and sometimes a curse because we have many input variables and few training samples to help us in the learning procedure. To avoid this curse of having an enormous number of input variables (curse of dimensionality), data scientists use dimensionality reduction techniques in order to select a subset from the input variables. For example, in the text classification task they can do the following:

  • Extracting relevant inputs (for instance, mutual information measure)
  • Principal component analysis (PCA)
  • Grouping (cluster) similar words (this uses a similarity measure)
主站蜘蛛池模板: 油尖旺区| 安塞县| 任丘市| 民县| 逊克县| 晴隆县| 黄浦区| 麦盖提县| 凤凰县| 兴安县| 娄烦县| 平谷区| 林周县| 二手房| 新源县| 紫云| 扬中市| 海宁市| 南雄市| 忻城县| 临沭县| 泰州市| 印江| 盈江县| 沁源县| 长岛县| 永吉县| 广汉市| 桐城市| 丁青县| 涡阳县| 治多县| 古交市| 建平县| 廊坊市| 邯郸县| 介休市| 泌阳县| 郯城县| 涡阳县| 靖远县|