官术网_书友最值得收藏!

Upsampling time series data

In upsampling, the frequency of the time series is increased. As a result, we have more sample points than data points. One of the main questions is how to account for the entries in the series where we have no measurement.

Let's start with hourly data for a single day:

>>> rng = pd.date_range('4/29/2015 8:00', periods=10, freq='H')
>>> ts = pd.Series(np.random.randint(0, 100, len(rng)), index=rng)
>>> ts.head()
2015-04-29 08:00:00 30
2015-04-29 09:00:00 27
2015-04-29 10:00:00 54
2015-04-29 11:00:00 9
2015-04-29 12:00:00 48
Freq: H, dtype: int64

If we upsample to data points taken every 15 minutes, our time series will be extended with NaN values:

>>> ts.resample('15min')
>>> ts.head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 NaN
2015-04-29 08:30:00 NaN
2015-04-29 08:45:00 NaN
2015-04-29 09:00:00 27

There are various ways to deal with missing values, which can be controlled by the fill_method keyword argument to resample. Values can be filled either forward or backward:

>>> ts.resample('15min', fill_method='ffill').head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 30
2015-04-29 08:30:00 30
2015-04-29 08:45:00 30
2015-04-29 09:00:00 27
Freq: 15T, dtype: int64
>>> ts.resample('15min', fill_method='bfill').head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 27
2015-04-29 08:30:00 27
2015-04-29 08:45:00 27
2015-04-29 09:00:00 27

With the limit parameter, it is possible to control the number of missing values to be filled:

>>> ts.resample('15min', fill_method='ffill', limit=2).head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 30
2015-04-29 08:30:00 30
2015-04-29 08:45:00 NaN
2015-04-29 09:00:00 27
Freq: 15T, dtype: float64

If you want to adjust the labels during resampling, you can use the loffset keyword argument:

>>> ts.resample('15min', fill_method='ffill', limit=2, loffset='5min').head()
2015-04-29 08:05:00 30
2015-04-29 08:20:00 30
2015-04-29 08:35:00 30
2015-04-29 08:50:00 NaN
2015-04-29 09:05:00 27
Freq: 15T, dtype: float64

There is another way to fill in missing values. We could employ an algorithm to construct new data points that would somehow fit the existing points, for some definition of somehow. This process is called interpolation.

We can ask Pandas to interpolate a time series for us:

>>> tsx = ts.resample('15min')
>>> tsx.interpolate().head()
2015-04-29 08:00:00 30.00
2015-04-29 08:15:00 29.25
2015-04-29 08:30:00 28.50
2015-04-29 08:45:00 27.75
2015-04-29 09:00:00 27.00
Freq: 15T, dtype: float64

We saw the default interpolate method – a linear interpolation – in action. Pandas assumes a linear relationship between two existing points.

Pandas supports over a dozen interpolation functions, some of which require the scipy library to be installed. We will not cover interpolation methods in this chapter, but we encourage you to explore the various methods yourself. The right interpolation method will depend on the requirements of your application.

主站蜘蛛池模板: 曲靖市| 皋兰县| 札达县| 新龙县| 三原县| 贵德县| 沙雅县| 香港| 黄梅县| 临高县| 天柱县| 乌鲁木齐市| 呼图壁县| 云龙县| 曲松县| 蒲江县| 息烽县| 绥德县| 卢龙县| 泸州市| 泰兴市| 宜川县| 炎陵县| 安泽县| 巫山县| 昌平区| 原阳县| 崇信县| 漠河县| 辰溪县| 望城县| 小金县| 印江| 临清市| 渝北区| 三都| 阳谷县| 阳新县| 元谋县| 绍兴市| 亳州市|