官术网_书友最值得收藏!

Creating dummy variables

Creating dummy variables is a method to create separate variable for each category of a categorical variable., Although, the categorical variable contains plenty of information and might show a causal relationship with output variable, it can't be used in the predictive models like linear and logistic regression without any processing.

In our dataset, sex is a categorical variable with two categories that are male and female. We can create two dummy variables out of this, as follows:

dummy_sex=pd.get_dummies(data['sex'],prefix='sex')

The result of this statement is, as follows:

Fig. 2.17: Dummy variable for the sex variable in the Titanic dataset

This process is called dummifying, the variable creates two new variables that take either 1 or 0 value depending on what the sex of the passenger was. If the sex was female, sex_female would be 1 and sex_male would be 0. If the sex was male, sex_male would be 1 and sex_female would be 0. In general, all but one dummy variable in a row will have a 0 value. The variable derived from the value (for that row) in the original column will have a value of 1.

These two new variables can be joined to the source data frame, so that they can be used in the models. The method to that is illustrated, as follows:

column_name=data.columns.values.tolist()
column_name.remove('sex')
data[column_name].join(dummy_sex)

The column names are converted to a list and the sex is removed from the list before joining these two dummy variables to the dataset, as it will not make sense to have a sex variable with these two dummy variables.

主站蜘蛛池模板: 双城市| 克拉玛依市| 龙泉市| 大兴区| 西贡区| 波密县| 镇远县| 唐河县| 随州市| 威远县| 新安县| 宁晋县| 巩留县| 樟树市| 双城市| 普安县| 进贤县| 饶阳县| 石阡县| 临猗县| 鱼台县| 新沂市| 盐津县| 孝义市| 星子县| 措美县| 桃园市| 丹阳市| 准格尔旗| 丹阳市| 锦州市| 威海市| 枣阳市| 南华县| 兰州市| 泽普县| 泸西县| 鹤庆县| 大理市| 五寨县| 安福县|