Pandas Rolling Moving Average Using Table Method And Time Period
Solution 1:
I think @Salmonstrikes's code is doing right but not fast. In this situtation, you should not use any "transform", "'table' method", or "'numba' speed up", you just need to do a simple rolling mean. So I reduce some element of @Salmonstrikes's code, which runs faster.
import numpy as np
import pandas as pd
from pandas.tseries.offsets import * #for BMonthEnd/MonthEnd()
import time
df = pd.DataFrame(dict(
efcode = np.random.randint(0, 2, size=10000),
date = pd.date_range(start=pd.datetime(1990,1,1), end=pd.datetime(1990,1,1) + Day(9999), freq='D'),
liq_daily = np.random.randint(1, 100, size=10000),
liq_daily_usd = np.random.randint(1, 100, size=10000),
net_vwap_avg = np.random.randint(1, 100, size=10000)
))
proc_list = ['liq_daily', 'liq_daily_usd', 'net_vwap_avg']
start_time=time.time()
for p in [3, 5, 10, 22, 45, 67, 125, 252, 504, 756, 1260]:
df[[(q + '_' + str(p) + 'd') for q in proc_list]] = df.groupby('efcode')[proc_list].transform(lambda x: x.rolling(p, min_periods=int( 0.8 * p)).mean())
end_time=time.time()
print(f"Your origin execution time is: {end_time-start_time}")
df = df.sort_values(by=['efcode', 'date']).reset_index(drop=True)
df2=df.copy()
proc_list = ['liq_daily', 'liq_daily_usd', 'net_vwap_avg']
start_time=time.time()
for p in [3, 5, 10, 22, 45, 67, 125, 252, 504, 756, 1260]:
df2[[f"{q}_{p}d" for q in proc_list]] = (
df2
.groupby('efcode')[proc_list]
.rolling(p, min_periods=int(0.8 * p))
.mean()
.reset_index(drop=True)
)
end_time=time.time()
print(f"My execution time is: {end_time-start_time}")
start_time=time.time()
for p in [3, 5, 10, 22, 45, 67, 125, 252, 504, 756, 1260]:
df[[f"{q}_{p}d" for q in proc_list]] = (
df
.groupby('efcode')[proc_list]
.rolling(p, min_periods=int(0.8 * p), method='table')
.mean(engine='numba')
.reset_index(drop=True)
)
end_time=time.time()
print(f"Salmonstrikes execution time is: {end_time-start_time}")
Solution 2:
UPDATE
@Doraelee 's answer is good: simple and gets the job done fast. After giving it some more thought and experimentation, I now realize that method='table'
is more suited to very wide dataframes; i.e., with lots of columns. For narrow frames like in your case (3 columns), the performance boost from the vectorization and parallelism that happens across columns with method='table'
is negligible.
I've included some more code below for benchmarking. You'll notice that the performance boost for the wide frame is larger than the boost for the narrow frame. In fact, method='table'
can be slower on the narrow frame (as you've noticed yourself), and not just because of the compilation overhead. Perhaps Numba could be configured somehow to avoid this slowdown, or maybe there's just not a reliable implementation in Pandas yet -- I don't know.
Note that it's tough to compare the wide times against the narrow times because the complexity is quite different due to (i) the different lengths of groupings (ii) the parallelism invoked by Numba -- my vanilla Pandas rolling.mean
appears to be serial-only by default.
import numpy as np
import pandas as pd
from datetime import datetime
# it's tough to get apples-to-apples for wide vs narrow comparisons with parallelism on
num_wide_rows = 10 ** 4
num_wide_cols = 10 ** 4
num_narrow_cols = 10
num_narrow_rows = 10 ** 5# seed generator
np.random.seed(22)
rolling_period = 14
min_periods = int(0.8 * rolling_period)
# create wide DF
wide_group_id_list = np.random.randint(low=1, high=10+1, size=num_wide_rows) # 10 possible groups
wide_group_id_list.sort()
wide_data = np.random.rand(num_wide_rows, num_wide_cols)
wide_df = pd.DataFrame(data=wide_data)
wide_df.insert(0, 'group_id', wide_group_id_list)
# create narrow DF
narrow_group_id_list = np.random.randint(low=1, high=10+1, size=num_narrow_rows) # 10 possible groups
narrow_group_id_list.sort()
narrow_data = np.random.rand(num_narrow_rows, num_narrow_cols)
narrow_df = pd.DataFrame(data=narrow_data)
narrow_df.insert(0, 'group_id', narrow_group_id_list)
deftime_operation(title, df, method='single'):
kwargs = {'engine': 'numba'} if method == 'table'else {}
t_begin = datetime.now()
for i inrange(3): # repetitions
df.groupby('group_id').rolling(rolling_period, min_periods=min_periods, method=method).mean(**kwargs)
t_final = datetime.now()
delta_t = t_final - t_begin
print(f"'{title}' took {delta_t}.")
# (this step may be unnecessary) perform a cheap rolling mean in the hopes of smart, one-off precompilation for the timing test
narrow_df.head(2*min_periods).groupby('group_id').rolling(rolling_period, min_periods=min_periods, method='table').mean(engine='numba')
# timing experiment
time_operation('wide_df/method=single', wide_df, method='single')
time_operation('wide_df/method=table', wide_df, method='table')
time_operation('narrow_df/method=single', narrow_df, method='single')
time_operation('narrow_df/method=table', narrow_df, method='table')
Output from my laptop:
'wide_df/method=single' took 0:00:47.604131.
'wide_df/method=table' took 0:00:15.580090.
'narrow_df/method=single' took 0:00:00.365677.
'narrow_df/method=table' took 0:00:04.876920.
ORIGINAL ANSWER
Here's some code that reproduces the results from your example, and uses the table
method:
import numpy as np
import pandas as pd
from pandas.tseries.offsets import * #for BMonthEnd/MonthEnd()
np.random.seed(22) # basic reproducibility
df = pd.DataFrame(dict(
efcode = np.random.randint(0, 2, size=10000),
date = pd.date_range(start=pd.datetime(1990,1,1), end=pd.datetime(1990,1,1) + Day(9999), freq='D'),
liq_daily = np.random.randint(1, 100, size=10000),
liq_daily_usd = np.random.randint(1, 100, size=10000),
net_vwap_avg = np.random.randint(1, 100, size=10000)
))
proc_list = ['liq_daily', 'liq_daily_usd', 'net_vwap_avg']
# sort by efcode first, date next
df = df.sort_values(by=['efcode', 'date']).reset_index(drop=True)
for p in [3, 5, 10, 22, 45, 67, 125, 252, 504, 756, 1260]:
df[[f"{q}_{p}d"for q in proc_list]] = (
df
.groupby('efcode')[proc_list]
.rolling(p, min_periods=int(0.8 * p), method='table')
.mean(engine='numba')
.reset_index(drop=True)
)
Points to note:
- For an apples-to-apples comparison, please sort the dataframe appropriately in your example as well
- The
table
method is currently only callable with thenumba
engine according to the documentation- This means that the overhead from compilation might actually slow the code down for smaller dataframes like in this example; hopefully, you notice a speedup with larger datasets
efcode
remains in as the index as an outcome of thegroupby
operation -- this is why I reset and drop the index in the final step
Post a Comment for "Pandas Rolling Moving Average Using Table Method And Time Period"