官术网_书友最值得收藏!

Downsampling time series data

Downsampling reduces the number of samples in the data. During this reduction, we are able to apply aggregations over data points. Let's imagine a busy airport with thousands of people passing through every hour. The airport administration has installed a visitor counter in the main area, to get an impression of exactly how busy their airport is.

They are receiving data from the counter device every minute. Here are the hypothetical measurements for a day, beginning at 08:00, ending 600 minutes later at 18:00:

>>> rng = pd.date_range('4/29/2015 8:00', periods=600, freq='T')
>>> ts = pd.Series(np.random.randint(0, 100, len(rng)), index=rng)
>>> ts.head()
2015-04-29 08:00:00 9
2015-04-29 08:01:00 60
2015-04-29 08:02:00 65
2015-04-29 08:03:00 25
2015-04-29 08:04:00 19

To get a better picture of the day, we can downsample this time series to larger intervals, for example, 10 minutes. We can choose an aggregation function as well. The default aggregation is to take all the values and calculate the mean:

>>> ts.resample('10min').head()
2015-04-29 08:00:00 49.1
2015-04-29 08:10:00 56.0
2015-04-29 08:20:00 42.0
2015-04-29 08:30:00 51.9
2015-04-29 08:40:00 59.0
Freq: 10T, dtype: float64

In our airport example, we are also interested in the sum of the values, that is, the combined number of visitors for a given time frame. We can choose the aggregation function by passing a function or a function name to the how parameter works:

>>> ts.resample('10min', how='sum').head()
2015-04-29 08:00:00 442
2015-04-29 08:10:00 409
2015-04-29 08:20:00 532
2015-04-29 08:30:00 433
2015-04-29 08:40:00 470
Freq: 10T, dtype: int64

Or we can reduce the sampling interval even more by resampling to an hourly interval:

>>> ts.resample('1h', how='sum').head()
2015-04-29 08:00:00 2745
2015-04-29 09:00:00 2897
2015-04-29 10:00:00 3088
2015-04-29 11:00:00 2616
2015-04-29 12:00:00 2691
Freq: H, dtype: int64

We can ask for other things as well. For example, what was the maximum number of people that passed through our airport within one hour:

>>> ts.resample('1h', how='max').head()
2015-04-29 08:00:00 97
2015-04-29 09:00:00 98
2015-04-29 10:00:00 99
2015-04-29 11:00:00 98
2015-04-29 12:00:00 99
Freq: H, dtype: int64

Or we can define a custom function if we are interested in more unusual metrics. For example, we could be interested in selecting a random sample for each hour:

>>> import random
>>> ts.resample('1h', how=lambda m: random.choice(m)).head()
2015-04-29 08:00:00 28
2015-04-29 09:00:00 14
2015-04-29 10:00:00 68
2015-04-29 11:00:00 31
2015-04-29 12:00:00 5 

If you specify a function by string, Pandas uses highly optimized versions.

The built-in functions that can be used as argument to how are: sum, mean, std, sem, max, min, median, first, last, ohlc. The ohlc metric is popular in finance. It stands for open-high-low-close. An OHLC chart is a typical way to illustrate movements in the price of a financial instrument over time.

While in our airport this metric might not be that valuable, we can compute it nonetheless:

>>> ts.resample('1h', how='ohlc').head()
 open high low close
2015-04-29 08:00:00 9 97 0 14
2015-04-29 09:00:00 68 98 3 12
2015-04-29 10:00:00 71 99 1 1
2015-04-29 11:00:00 59 98 0 4
2015-04-29 12:00:00 56 99 3
 55
主站蜘蛛池模板: 桂东县| 治县。| 南陵县| 白水县| 阿拉尔市| 招远市| 南陵县| 临西县| 北流市| 阿拉善盟| 赣州市| 进贤县| 日喀则市| 施秉县| 扎兰屯市| 和静县| 玛曲县| 泾川县| 平江县| 阿合奇县| 都昌县| 新乐市| 正镶白旗| 内乡县| 临沧市| 万荣县| 抚顺市| 新巴尔虎右旗| 广昌县| 开鲁县| 海南省| 雷波县| 永顺县| 全椒县| 东兰县| 本溪市| 济南市| 沙湾县| 思南县| 宁化县| 璧山县|