官术网_书友最值得收藏!

Feature selection

The number of explanatory features (input variables) of a sample can be enormous wherein you get xi=(xi1, xi2, xi3, ... , xid) as a training sample (observation/example) and d is very large. An example of this can be a document classification task3, where you get 10,000 different words and the input variables will be the number of occurrences of different words.

This enormous number of input variables can be problematic and sometimes a curse because we have many input variables and few training samples to help us in the learning procedure. To avoid this curse of having an enormous number of input variables (curse of dimensionality), data scientists use dimensionality reduction techniques in order to select a subset from the input variables. For example, in the text classification task they can do the following:

  • Extracting relevant inputs (for instance, mutual information measure)
  • Principal component analysis (PCA)
  • Grouping (cluster) similar words (this uses a similarity measure)
主站蜘蛛池模板: 南漳县| 雅江县| 当阳市| 平潭县| 南陵县| 临湘市| 五峰| 松溪县| 永福县| 怀集县| 兴化市| 胶州市| 盘山县| 军事| 平度市| 南皮县| 信阳市| 湘潭市| 乳山市| 安塞县| 华容县| 澄迈县| 兴国县| 常德市| 泸州市| 彭阳县| 兴和县| 光泽县| 阿鲁科尔沁旗| 临漳县| 合阳县| 蓝田县| 衡南县| 海宁市| 洛宁县| 蒙山县| 银川市| 梅州市| 定州市| 舞阳县| 安新县|