官术网_书友最值得收藏!

Upsampling time series data

In upsampling, the frequency of the time series is increased. As a result, we have more sample points than data points. One of the main questions is how to account for the entries in the series where we have no measurement.

Let's start with hourly data for a single day:

>>> rng = pd.date_range('4/29/2015 8:00', periods=10, freq='H')
>>> ts = pd.Series(np.random.randint(0, 100, len(rng)), index=rng)
>>> ts.head()
2015-04-29 08:00:00 30
2015-04-29 09:00:00 27
2015-04-29 10:00:00 54
2015-04-29 11:00:00 9
2015-04-29 12:00:00 48
Freq: H, dtype: int64

If we upsample to data points taken every 15 minutes, our time series will be extended with NaN values:

>>> ts.resample('15min')
>>> ts.head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 NaN
2015-04-29 08:30:00 NaN
2015-04-29 08:45:00 NaN
2015-04-29 09:00:00 27

There are various ways to deal with missing values, which can be controlled by the fill_method keyword argument to resample. Values can be filled either forward or backward:

>>> ts.resample('15min', fill_method='ffill').head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 30
2015-04-29 08:30:00 30
2015-04-29 08:45:00 30
2015-04-29 09:00:00 27
Freq: 15T, dtype: int64
>>> ts.resample('15min', fill_method='bfill').head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 27
2015-04-29 08:30:00 27
2015-04-29 08:45:00 27
2015-04-29 09:00:00 27

With the limit parameter, it is possible to control the number of missing values to be filled:

>>> ts.resample('15min', fill_method='ffill', limit=2).head()
2015-04-29 08:00:00 30
2015-04-29 08:15:00 30
2015-04-29 08:30:00 30
2015-04-29 08:45:00 NaN
2015-04-29 09:00:00 27
Freq: 15T, dtype: float64

If you want to adjust the labels during resampling, you can use the loffset keyword argument:

>>> ts.resample('15min', fill_method='ffill', limit=2, loffset='5min').head()
2015-04-29 08:05:00 30
2015-04-29 08:20:00 30
2015-04-29 08:35:00 30
2015-04-29 08:50:00 NaN
2015-04-29 09:05:00 27
Freq: 15T, dtype: float64

There is another way to fill in missing values. We could employ an algorithm to construct new data points that would somehow fit the existing points, for some definition of somehow. This process is called interpolation.

We can ask Pandas to interpolate a time series for us:

>>> tsx = ts.resample('15min')
>>> tsx.interpolate().head()
2015-04-29 08:00:00 30.00
2015-04-29 08:15:00 29.25
2015-04-29 08:30:00 28.50
2015-04-29 08:45:00 27.75
2015-04-29 09:00:00 27.00
Freq: 15T, dtype: float64

We saw the default interpolate method – a linear interpolation – in action. Pandas assumes a linear relationship between two existing points.

Pandas supports over a dozen interpolation functions, some of which require the scipy library to be installed. We will not cover interpolation methods in this chapter, but we encourage you to explore the various methods yourself. The right interpolation method will depend on the requirements of your application.

主站蜘蛛池模板: 抚远县| 屏南县| 新津县| 黔南| 当涂县| 河西区| 德钦县| 界首市| 德令哈市| 东兰县| 丽江市| 清徐县| 咸宁市| 嘉黎县| 房产| 宿州市| 洪泽县| 布拖县| 腾冲县| 岢岚县| 贡山| 昆山市| 黎平县| 东海县| 横山县| 邢台市| 罗源县| 丰顺县| 富宁县| 临夏县| 凌海市| 鸡泽县| 绿春县| 嘉兴市| 云南省| 富阳市| 麻阳| 泰来县| 波密县| 金阳县| 高州市|