pandas 教程

Pandas 数据结构

Pandas 基本操作

Pandas API

original icon
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.knowledgedict.com/tutorial/pandas-aggregations.html

Pandas聚合


当有了滚动,扩展和ewm对象创建了以后,就有几种方法可以对数据执行聚合。

DataFrame 应用聚合

让我们创建一个 DataFrame 并在其上应用聚合。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
      index = pd.date_range('1/1/2019', periods=10),
      columns = ['A', 'B', 'C', 'D'])

print (df)
print("=======================================")
r = df.rolling(window=3,min_periods=1)
print (r)

执行上面示例代码,得到以下结果 -

                   A         B         C         D
2019-01-01 -0.901602 -1.778484  0.728295 -0.758108
2019-01-02 -0.826162  0.994140  0.976164 -0.918249
2019-01-03  0.260841  0.905993  1.505967 -0.124883
2019-01-04 -0.112230 -0.111885  0.702712 -0.871768
2019-01-05 -0.239969  1.435918 -0.160140 -0.547702
2019-01-06 -0.126897 -2.628206 -0.280658  0.167422
2019-01-07  0.367903  0.994337 -0.529830  0.195990
2019-01-08 -0.530872 -0.384915 -0.397150 -0.024074
2019-01-09 -0.418925  0.049046 -0.816616  0.308107
2019-01-10 -0.176857  2.573145  0.010211 -1.427078
=======================================
Rolling [window=3,min_periods=1,center=False,axis=0]

可以通过向整个 DataFrame 传递一个函数来进行聚合,或者通过标准的获取项目方法来选择一个列。

在整个数据框上应用聚合

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
      index = pd.date_range('1/1/2000', periods=10),
      columns = ['A', 'B', 'C', 'D'])
print df

r = df.rolling(window=3,min_periods=1)
print r.aggregate(np.sum)

执行示例代码,得到以下结果 -

                   A         B         C         D
2020-01-01  1.069090 -0.802365 -0.323818 -1.994676
2020-01-02  0.190584  0.328272 -0.550378  0.559738
2020-01-03  0.044865  0.478342 -0.976129  0.106530
2020-01-04 -1.349188 -0.391635 -0.292740  1.412755
2020-01-05  0.057659 -1.331901 -0.297858 -0.500705
2020-01-06  2.651680 -1.459706 -0.726023  0.294283
2020-01-07  0.666481  0.679205 -1.511743  2.093833
2020-01-08 -0.284316 -1.079759  1.433632  0.534043
2020-01-09  1.115246 -0.268812  0.190440 -0.712032
2020-01-10 -0.121008  0.136952  1.279354  0.275773
============================================
                   A         B         C         D
2020-01-01  1.069090 -0.802365 -0.323818 -1.994676
2020-01-02  1.259674 -0.474093 -0.874197 -1.434938
2020-01-03  1.304539  0.004249 -1.850326 -1.328409
2020-01-04 -1.113739  0.414979 -1.819248  2.079023
2020-01-05 -1.246664 -1.245194 -1.566728  1.018580
2020-01-06  1.360151 -3.183242 -1.316621  1.206333
2020-01-07  3.375821 -2.112402 -2.535624  1.887411
2020-01-08  3.033846 -1.860260 -0.804134  2.922160
2020-01-09  1.497411 -0.669366  0.112329  1.915845
2020-01-10  0.709922 -1.211619  2.903427  0.097785

在数据框的单个列上应用聚合

示例代码

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
      index = pd.date_range('1/1/2000', periods=10),
      columns = ['A', 'B', 'C', 'D'])
print (df)
print("====================================")
r = df.rolling(window=3,min_periods=1)
print (r['A'].aggregate(np.sum))

执行上面示例代码,得到以下结果 -

                   A         B         C         D
2000-01-01 -1.095530 -0.415257 -0.446871 -1.267795
2000-01-02 -0.405793 -0.002723  0.040241 -0.131678
2000-01-03 -0.136526  0.742393 -0.692582 -0.271176
2000-01-04  0.318300 -0.592146 -0.754830  0.239841
2000-01-05 -0.125770  0.849980  0.685083  0.752720
2000-01-06  1.410294  0.054780  0.297992 -0.034028
2000-01-07  0.463223 -1.239204 -0.056420  0.440893
2000-01-08 -2.244446 -0.516937 -2.039601 -0.680606
2000-01-09  0.991139  0.026987 -2.391856  0.585565
2000-01-10  0.112228 -0.701284 -1.139827  1.484032
====================================
2000-01-01   -1.095530
2000-01-02   -1.501323
2000-01-03   -1.637848
2000-01-04   -0.224018
2000-01-05    0.056004
2000-01-06    1.602824
2000-01-07    1.747747
2000-01-08   -0.370928
2000-01-09   -0.790084
2000-01-10   -1.141079
Freq: D, Name: A, dtype: float64

在 DataFrame 的多列上应用聚合

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
      index = pd.date_range('1/1/2018', periods=10),
      columns = ['A', 'B', 'C', 'D'])
print (df)
print ("==========================================")
r = df.rolling(window=3,min_periods=1)
print (r[['A','B']].aggregate(np.sum))

执行上面示例代码,得到以下结果 -

                   A         B         C         D
2018-01-01  0.518897  0.988917  0.435691 -1.005703
2018-01-02  1.793400  0.130314  2.313787  0.870057
2018-01-03 -0.297601  0.504137 -0.951311 -0.146720
2018-01-04  0.282177  0.142360 -0.059013  0.633174
2018-01-05  2.095398 -0.153359  0.431514 -1.185657
2018-01-06  0.134847  0.188138  0.828329 -1.035120
2018-01-07  0.780541  0.138942 -1.001229  0.714896
2018-01-08  0.579742 -0.642858  0.835013 -1.504110
2018-01-09 -1.692986 -0.861327 -1.125359  0.006687
2018-01-10 -0.263689  1.182349 -0.916569  0.617476
==========================================
                   A         B
2018-01-01  0.518897  0.988917
2018-01-02  2.312297  1.119232
2018-01-03  2.014697  1.623369
2018-01-04  1.777976  0.776811
2018-01-05  2.079975  0.493138
2018-01-06  2.512422  0.177140
2018-01-07  3.010786  0.173722
2018-01-08  1.495130 -0.315777
2018-01-09 -0.332703 -1.365242
2018-01-10 -1.376932 -0.321836

在 DataFrame 的单个列上应用多个函数

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
      index = pd.date_range('2019/01/01', periods=10),
      columns = ['A', 'B', 'C', 'D'])
print (df)

print("==========================================")

r = df.rolling(window=3,min_periods=1)
print (r['A'].aggregate([np.sum,np.mean]))

执行上面示例代码,得到以下结果 -

                   A         B         C         D
2019-01-01  1.022641 -1.431910  0.780941 -0.029811
2019-01-02 -0.302858  0.009886 -0.359331 -0.417708
2019-01-03 -1.396564  0.944374 -0.238989 -1.873611
2019-01-04  0.396995 -1.152009 -0.560552 -0.144212
2019-01-05 -2.513289 -1.085277 -1.016419 -1.586994
2019-01-06 -0.513179  0.823411  0.670734  1.196546
2019-01-07 -0.363239 -0.991799  0.587564 -1.100096
2019-01-08  1.474317  1.265496 -0.216486 -0.224218
2019-01-09  2.235798 -1.381457 -0.950745 -0.209564
2019-01-10 -0.061891 -0.025342  0.494245 -0.081681
==========================================
                 sum      mean
2019-01-01  1.022641  1.022641
2019-01-02  0.719784  0.359892
2019-01-03 -0.676780 -0.225593
2019-01-04 -1.302427 -0.434142
2019-01-05 -3.512859 -1.170953
2019-01-06 -2.629473 -0.876491
2019-01-07 -3.389707 -1.129902
2019-01-08  0.597899  0.199300
2019-01-09  3.346876  1.115625
2019-01-10  3.648224  1.216075

在 DataFrame 的多列上应用多个函数

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),
      index = pd.date_range('2020/01/01', periods=10),
      columns = ['A', 'B', 'C', 'D'])

print (df)
print("==========================================")
r = df.rolling(window=3,min_periods=1)
print (r[['A','B']].aggregate([np.sum,np.mean]))

执行上面示例代码,得到以下结果 -

                   A         B         C         D
2020-01-01  1.053702  0.355985  0.746638 -0.233968
2020-01-02  0.578520 -1.171843 -1.764249 -0.709913
2020-01-03 -0.491185  0.975212  0.200139 -3.372621
2020-01-04 -1.331328  0.776316  0.216623  0.202313
2020-01-05 -1.023147 -0.913686  1.457512  0.999232
2020-01-06  0.995328 -0.979826 -1.063695  0.057925
2020-01-07  0.576668  1.065767 -0.270744 -0.513707
2020-01-08  0.520258  0.969043 -0.119177 -0.125620
2020-01-09 -0.316480  0.549085  1.862249  1.091265
2020-01-10  0.461321 -0.368662 -0.988323  0.543011
==========================================
                   A                   B          
                 sum      mean       sum      mean
2020-01-01  1.053702  1.053702  0.355985  0.355985
2020-01-02  1.632221  0.816111 -0.815858 -0.407929
2020-01-03  1.141037  0.380346  0.159354  0.053118
2020-01-04 -1.243993 -0.414664  0.579686  0.193229
2020-01-05 -2.845659 -0.948553  0.837843  0.279281
2020-01-06 -1.359146 -0.453049 -1.117195 -0.372398
2020-01-07  0.548849  0.182950 -0.827744 -0.275915
2020-01-08  2.092254  0.697418  1.054985  0.351662
2020-01-09  0.780445  0.260148  2.583896  0.861299
2020-01-10  0.665099  0.221700  1.149466  0.383155

将不同的函数应用于 DataFrame 的不同列

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 4),
      index = pd.date_range('2020/01/01', periods=3),
      columns = ['A', 'B', 'C', 'D'])
print (df)
print("==========================================")
r = df.rolling(window=3,min_periods=1)
print (r.aggregate({'A' : np.sum,'B' : np.mean}))

执行上面示例代码,得到以下结果 -

                   A         B         C         D
2020-01-01 -0.246302 -0.057202  0.923807 -1.019698
2020-01-02  0.285287  1.467206 -0.368735 -0.397260
2020-01-03 -0.163219 -0.401368  1.254569  0.580188
==========================================
                   A         B
2020-01-01 -0.246302 -0.057202
2020-01-02  0.038985  0.705002
2020-01-03 -0.124234  0.336212

Elasticsearch是一个开源的分布式搜索和分析引擎,它提供了强大的聚合功能,用于分析和汇总数据。示例代码:这些示例代码展示了一些常用 ...
Elasticsearch 的聚合功能十分强大,可在数据上做复杂的分析统计。它提供的聚合分析功能有指标聚合(metrics aggregat ...
在Elasticsearch中,聚合是用于从数据中提取有关信息的功能强大的工具。###TermsAggregation(词条聚合)词条聚合用 ...
Pandas对象之间的基本迭代的行为取决于类型。当迭代一个系列时,它被视为数组式,基本迭代产生这些值。其他数据结构,如:DataFrame和 ...
Pandas 是一款开放源码的 BSD 许可的 Python 库,为 Python 编程语言提供了高性能,易于使用的数据结构和数据分析工具。 ...