官术网_书友最值得收藏!

  • Machine Learning with Swift
  • Alexander Sosnovshchenko
  • 184字
  • 2021-06-24 18:54:56

One-hot encoding

Most of the machine learning algorithms can't work with the categorical variables, so usually we want to convert them to the one-hot vectors (statisticians prefer to call them dummy variables). Let's convert first, and then I will explain what this is:

In []: 
features = pd.get_dummies(features, columns = ['color']) 
features.head() 
Out[]: 

So now, instead of one column, color, we have four columns: color_light black, color_pink gold, color_purple polka dot, and color_space gray. The color of each sample is encoded as 1 in the corresponding column. Why do we need this if we could simply replace colors with the numbers from 1 to 4? Well, this is the problem: why to prefer 1 to 4 over the 4 to 1, or powers of 2, or prime numbers? These colors on their own don't carry any quantitative information associated to them. They can't be sorted from the largest to the smallest. If we introduce this information artificially, the machine learning algorithm may attempt to utilize that meaningless information, and we will end up with the classifier that sees regularities where there are none.

主站蜘蛛池模板: 措勤县| 安新县| 乌兰浩特市| 陆河县| 新绛县| 衡阳县| 铜川市| 阿巴嘎旗| 滕州市| 大安市| 如皋市| 湘阴县| 邢台县| 泌阳县| 阿坝县| 新闻| 阿坝| 蒙自县| 田阳县| 永康市| 长兴县| 鹤壁市| 五指山市| 巫溪县| 富顺县| 靖江市| 米易县| 鄂温| 彩票| 乐清市| 泗洪县| 昆明市| 赣榆县| 旌德县| 兰州市| 浑源县| 汪清县| 太仓市| 平凉市| 浦江县| 东山县|