官术网_书友最值得收藏!

Working with missing data

In this section, we will discuss missing, NaN, or null values, in Pandas data structures. It is a very common situation to arrive with missing data in an object. One such case that creates missing data is reindexing:

>>> df8 = pd.DataFrame(np.arange(12).reshape(4,3), 
 columns=['a', 'b', 'c'])
 a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
>>> df9 = df8.reindex(columns = ['a', 'b', 'c', 'd'])
 a b c d
0 0 1 2 NaN
1 3 4 5 NaN
2 6 7 8 NaN
4 9 10 11 NaN
>>> df10 = df8.reindex([3, 2, 'a', 0])
 a b c
3 9 10 11
2 6 7 8
a NaN NaN NaN
0 0 1 2

To manipulate missing values, we can use the isnull() or notnull() functions to detect the missing values in a Series object, as well as in a DataFrame object:

>>> df10.isnull()
 a b c
3 False False False
2 False False False
a True True True
0 False False False

On a Series, we can drop all null data and index values by using the dropna function:

>>> s4 = pd.Series({'001': 'Nam', '002': 'Mary',
 '003': 'Peter'},
 index=['002', '001', '024', '065'])
>>> s4
002 Mary
001 Nam
024 NaN
065 NaN
dtype: object
>>> s4.dropna() # dropping all null value of Series object
002 Mary
001 Nam
dtype: object

With a DataFrame object, it is a little bit more complex than with Series. We can tell which rows or columns we want to drop and also if all entries must be null or a single null value is enough. By default, the function will drop any row containing a missing value:

>>> df9.dropna() # all rows will be dropped
Empty DataFrame
Columns: [a, b, c, d]
Index: []
>>> df9.dropna(axis=1)
 a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11

Another way to control missing values is to use the supported parameters of functions that we introduced in the previous section. They are also very useful to solve this problem. In our experience, we should assign a fixed value in missing cases when we create data objects. This will make our objects cleaner in later processing steps. For example, consider the following:

>>> df11 = df8.reindex([3, 2, 'a', 0], fill_value = 0)
>>> df11
 a b c
3 9 10 11
2 6 7 8
a 0 0 0
0 0 1 2

We can alse use the fillna function to fill a custom value in missing values:

>>> df9.fillna(-1)
 a b c d
0 0 1 2 -1
1 3 4 5 -1
2 6 7 8 -1
3 9 10 11 -1
主站蜘蛛池模板: 中西区| 延川县| 镶黄旗| 邛崃市| 巩义市| 郑州市| 甘孜| 嫩江县| 兴宁市| 兴国县| 白朗县| 辽宁省| 鲁甸县| 拜泉县| 泗洪县| 镶黄旗| 仪陇县| 农安县| 昌宁县| 汤原县| 日照市| 中方县| 亚东县| 卢龙县| 昭苏县| 沭阳县| 浙江省| 新巴尔虎左旗| 乌兰县| 互助| 北宁市| 乌拉特后旗| 美姑县| 巨鹿县| 石林| 磴口县| 仪陇县| 云阳县| 托里县| 西昌市| 卢氏县|