Python Pandas: Using Aggregate Vs Apply To Define New Columns
Solution 1:
To step back slightly, a faster way to do this particular "aggregation" is to just use sum (it's optimised in cython) a couple of times.
In [11]: %timeit g.apply(h)
1000 loops, best of 3: 1.79 ms per loop
In [12]: %timeit g['val1'].sum() / g['val2'].sum()
1000 loops, best of 3: 600 µs per loop
IMO The groupby code is pretty hairy, and usually lazily "blackbox" peek at what's going on, by creating a list of what values it's seeing:
def h1(x):
a.append(x)
return h(x)
a = []
Warning: sometimes the type of data in this list is not consistent (where pandas tries a few different things before doing whatever calculation)... as in this example!
The second aggregation gets stuck applying on each column, so the group (which raises an error):
01041681391717171911Name:val1,dtype:int64
This is subSeries of the val1 column where (a, b) = (1, 3).
This may well be a bug, after this raises perhaps it could try something else (my suspicion is that this is why the firsts version works, it's special cased to)...
For those interested the a
I get is:
In[21]: aOut[21]:
[SNDArray([125755456, 131767536, 13, 17, 17, 11]),
Series([], name: val1, dtype: int64),
01041681391717171911Name: val1, dtype: int64]
I've no idea what the SNDArray is all about...
Post a Comment for "Python Pandas: Using Aggregate Vs Apply To Define New Columns"