官术网_书友最值得收藏!

Handling missing values

We regularly encounter empty fields in data records. It's best that we accept this and learn how to handle this kind of issue in a robust manner. Real data can not only have gaps-it can also have wrong values, because of faulty measuring equipment, for example. In Pandas, missing numerical values will be designated as NaN, objects as None, and the datetime64 objects as NaT. The outcome of arithmetic operations with NaN values is also NaN. Descriptive statistics methods, such as summation and average, behave differently. As we observed in an earlier example, in such a case, NaN values are treated as zero values. However, if all the values are NaN during, say, summation, the sum returned is still NaN. In aggregation operations, NaN values in the column that we group are ignored. We will again load the WHO_first9cols.csv file into a DataFrame. Remember that this file contains empty fields. Let's only select the first three rows, including the headers of the Country and Net primary school enrolment ratio male (%) columns as follows:

df = df[['Country', df.columns[-2]]][:2] 
print("New df\n", df) 

We get a DataFrame with two NaN values:

New df
       Country  Net primary school enrolment ratio male (%)
0  Afghanistan                                          NaN
1      Albania                                           94

[2 rows x 2 columns]

The Pandas isnull() function checks for missing values as follows:

print("Null Values\n", pd.isnull(df)) 

The output for our DataFrame is as follows:

Null Values
  Country Net primary school enrolment ratio male (%)
0   False                                        True
1   False                                       False

To count the number of NaN values for each column, we can sum the Boolean values returned by isnull(). This works because, during summation, True values are considered as ones and False values are treated as zeros:

Total Null Values
Country                                        0
Net primary school enrolment ratio male (%)    1
dtype: int64

Likewise, we can check with the DataFrame notnull() method for any non-missing values that are present:

print("Not Null Values\n", df.notnull()) 

The result of the notnull() method is the opposite of the isnull() function:

Not Null Values
  Country Net primary school enrolment ratio male (%)
0    True                                       False
1    True                                        True

When we double values in a DataFrame that has NaN values, the product will still contain NaN values, since doubling is an arithmetic operation:

print("Last Column Doubled\n", 2 * df[df.columns[-1]]) 

We double the last column, which contains numerical values (doubling string values repeats the string):

Last Column Doubled
0    NaN
1    188
Name: Net primary school enrolment ratio male (%), dtype: float64

If we add a NaN value, however, the NaN value wins:

print("Last Column plus NaN\n", df[df.columns[-1]] + np.nan) 

As you can see, the NaN values declared total victory:

Last Column plus NaN
0   NaN1   NaN
Name: Net primary school enrolment ratio male (%), dtype: float64

Replace the missing values with a scalar value. For example, replace 0 with the fillna() method (we can't always replace missing values with zeros, but sometimes this is good enough):

print("Zero filled\n", df.fillna(0)) 

The effect of the preceding line is to replace the NaN value with 0:

Zero filled
       Country  Net primary school enrolment ratio male (%)
0  Afghanistan                                            0
1      Albania                                           94

The code for this section is in the ch-03.ipynb file of this book's code bundle:

主站蜘蛛池模板: 沧源| 桃园县| 阿瓦提县| 青州市| 万安县| 巩义市| 乌拉特后旗| 苗栗县| 富裕县| 抚州市| 原平市| 天台县| 湘西| 罗山县| 兰坪| 滕州市| 太白县| 鸡泽县| 阿拉善左旗| 石屏县| 习水县| 佛坪县| 龙里县| 重庆市| 察隅县| 普安县| 炉霍县| 鄂尔多斯市| 山阳县| 屯留县| 旬阳县| 潜山县| 岚皋县| 五莲县| 通州市| 富平县| 滕州市| 富宁县| 启东市| 阿鲁科尔沁旗| 定安县|