官术网_书友最值得收藏!

Downsampling time series data

Downsampling reduces the number of samples in the data. During this reduction, we are able to apply aggregations over data points. Let's imagine a busy airport with thousands of people passing through every hour. The airport administration has installed a visitor counter in the main area, to get an impression of exactly how busy their airport is.

They are receiving data from the counter device every minute. Here are the hypothetical measurements for a day, beginning at 08:00, ending 600 minutes later at 18:00:

>>> rng = pd.date_range('4/29/2015 8:00', periods=600, freq='T')
>>> ts = pd.Series(np.random.randint(0, 100, len(rng)), index=rng)
>>> ts.head()
2015-04-29 08:00:00 9
2015-04-29 08:01:00 60
2015-04-29 08:02:00 65
2015-04-29 08:03:00 25
2015-04-29 08:04:00 19

To get a better picture of the day, we can downsample this time series to larger intervals, for example, 10 minutes. We can choose an aggregation function as well. The default aggregation is to take all the values and calculate the mean:

>>> ts.resample('10min').head()
2015-04-29 08:00:00 49.1
2015-04-29 08:10:00 56.0
2015-04-29 08:20:00 42.0
2015-04-29 08:30:00 51.9
2015-04-29 08:40:00 59.0
Freq: 10T, dtype: float64

In our airport example, we are also interested in the sum of the values, that is, the combined number of visitors for a given time frame. We can choose the aggregation function by passing a function or a function name to the how parameter works:

>>> ts.resample('10min', how='sum').head()
2015-04-29 08:00:00 442
2015-04-29 08:10:00 409
2015-04-29 08:20:00 532
2015-04-29 08:30:00 433
2015-04-29 08:40:00 470
Freq: 10T, dtype: int64

Or we can reduce the sampling interval even more by resampling to an hourly interval:

>>> ts.resample('1h', how='sum').head()
2015-04-29 08:00:00 2745
2015-04-29 09:00:00 2897
2015-04-29 10:00:00 3088
2015-04-29 11:00:00 2616
2015-04-29 12:00:00 2691
Freq: H, dtype: int64

We can ask for other things as well. For example, what was the maximum number of people that passed through our airport within one hour:

>>> ts.resample('1h', how='max').head()
2015-04-29 08:00:00 97
2015-04-29 09:00:00 98
2015-04-29 10:00:00 99
2015-04-29 11:00:00 98
2015-04-29 12:00:00 99
Freq: H, dtype: int64

Or we can define a custom function if we are interested in more unusual metrics. For example, we could be interested in selecting a random sample for each hour:

>>> import random
>>> ts.resample('1h', how=lambda m: random.choice(m)).head()
2015-04-29 08:00:00 28
2015-04-29 09:00:00 14
2015-04-29 10:00:00 68
2015-04-29 11:00:00 31
2015-04-29 12:00:00 5 

If you specify a function by string, Pandas uses highly optimized versions.

The built-in functions that can be used as argument to how are: sum, mean, std, sem, max, min, median, first, last, ohlc. The ohlc metric is popular in finance. It stands for open-high-low-close. An OHLC chart is a typical way to illustrate movements in the price of a financial instrument over time.

While in our airport this metric might not be that valuable, we can compute it nonetheless:

>>> ts.resample('1h', how='ohlc').head()
 open high low close
2015-04-29 08:00:00 9 97 0 14
2015-04-29 09:00:00 68 98 3 12
2015-04-29 10:00:00 71 99 1 1
2015-04-29 11:00:00 59 98 0 4
2015-04-29 12:00:00 56 99 3
 55
主站蜘蛛池模板: 玛沁县| 双牌县| 宜州市| 青州市| 乌拉特前旗| 安新县| 东安县| 南开区| 莱阳市| 梁山县| 富阳市| 沁源县| 白山市| 海门市| 龙井市| 兴安县| 兴义市| 鹤庆县| 乡城县| 仲巴县| 临城县| 尖扎县| 汨罗市| 建德市| 思茅市| 夏津县| 普陀区| 泾川县| 虎林市| 乐山市| 平乡县| 资源县| 南投县| 开封县| 马边| 龙江县| 镇宁| 常德市| 志丹县| 桦甸市| 山东|