Skip to content Skip to sidebar Skip to footer

Pandas Rolling Moving Average Using Table Method And Time Period

I have some code that is pretty slow, that calculates multiple rolling averages over different time periods (e.g. 3 - 1260 days, or 5 years), but it is very slow and somewhat memor

Solution 1:

I think @Salmonstrikes's code is doing right but not fast. In this situtation, you should not use any "transform", "'table' method", or "'numba' speed up", you just need to do a simple rolling mean. So I reduce some element of @Salmonstrikes's code, which runs faster.

import numpy as np
import pandas as pd
from pandas.tseries.offsets import *  #for BMonthEnd/MonthEnd()
import time


df = pd.DataFrame(dict(
    efcode = np.random.randint(0, 2, size=10000),
    date = pd.date_range(start=pd.datetime(1990,1,1), end=pd.datetime(1990,1,1) + Day(9999), freq='D'),
    liq_daily = np.random.randint(1, 100, size=10000),
    liq_daily_usd = np.random.randint(1, 100, size=10000),
    net_vwap_avg = np.random.randint(1, 100, size=10000)
))

proc_list = ['liq_daily', 'liq_daily_usd', 'net_vwap_avg']

start_time=time.time()
for p in [3, 5, 10, 22, 45, 67, 125, 252, 504, 756, 1260]:
    df[[(q + '_' + str(p) + 'd') for q in proc_list]] = df.groupby('efcode')[proc_list].transform(lambda x: x.rolling(p, min_periods=int( 0.8 * p)).mean())
end_time=time.time()

print(f"Your origin execution time is: {end_time-start_time}")

df = df.sort_values(by=['efcode', 'date']).reset_index(drop=True)

df2=df.copy()

proc_list = ['liq_daily', 'liq_daily_usd', 'net_vwap_avg']

start_time=time.time()

for p in [3, 5, 10, 22, 45, 67, 125, 252, 504, 756, 1260]:
    df2[[f"{q}_{p}d" for q in proc_list]] = (
        df2
        .groupby('efcode')[proc_list]
        .rolling(p, min_periods=int(0.8 * p))
        .mean()
        .reset_index(drop=True)
    )
end_time=time.time()

print(f"My execution time is: {end_time-start_time}")

start_time=time.time()

for p in [3, 5, 10, 22, 45, 67, 125, 252, 504, 756, 1260]:
    df[[f"{q}_{p}d" for q in proc_list]] = (
        df
        .groupby('efcode')[proc_list]
        .rolling(p, min_periods=int(0.8 * p), method='table')
        .mean(engine='numba')
        .reset_index(drop=True)
    )
end_time=time.time()

print(f"Salmonstrikes execution time is: {end_time-start_time}")

speed comparison

Solution 2:

UPDATE

@Doraelee 's answer is good: simple and gets the job done fast. After giving it some more thought and experimentation, I now realize that method='table' is more suited to very wide dataframes; i.e., with lots of columns. For narrow frames like in your case (3 columns), the performance boost from the vectorization and parallelism that happens across columns with method='table' is negligible.

I've included some more code below for benchmarking. You'll notice that the performance boost for the wide frame is larger than the boost for the narrow frame. In fact, method='table' can be slower on the narrow frame (as you've noticed yourself), and not just because of the compilation overhead. Perhaps Numba could be configured somehow to avoid this slowdown, or maybe there's just not a reliable implementation in Pandas yet -- I don't know.

Note that it's tough to compare the wide times against the narrow times because the complexity is quite different due to (i) the different lengths of groupings (ii) the parallelism invoked by Numba -- my vanilla Pandas rolling.mean appears to be serial-only by default.

import numpy as np
import pandas as pd
from datetime import datetime


# it's tough to get apples-to-apples for wide vs narrow comparisons with parallelism on

num_wide_rows = 10 ** 4
num_wide_cols = 10 ** 4

num_narrow_cols = 10
num_narrow_rows = 10 ** 5# seed generator
np.random.seed(22)
rolling_period = 14
min_periods = int(0.8 * rolling_period)

# create wide DF
wide_group_id_list = np.random.randint(low=1, high=10+1, size=num_wide_rows) # 10 possible groups
wide_group_id_list.sort()
wide_data = np.random.rand(num_wide_rows, num_wide_cols)
wide_df = pd.DataFrame(data=wide_data)
wide_df.insert(0, 'group_id', wide_group_id_list)

# create narrow DF
narrow_group_id_list = np.random.randint(low=1, high=10+1, size=num_narrow_rows) # 10 possible groups
narrow_group_id_list.sort()
narrow_data = np.random.rand(num_narrow_rows, num_narrow_cols)
narrow_df = pd.DataFrame(data=narrow_data)
narrow_df.insert(0, 'group_id', narrow_group_id_list)

deftime_operation(title, df, method='single'):
    kwargs = {'engine': 'numba'} if method == 'table'else {}
    t_begin = datetime.now()
    for i inrange(3): # repetitions
        df.groupby('group_id').rolling(rolling_period, min_periods=min_periods, method=method).mean(**kwargs)
    t_final = datetime.now()
    delta_t = t_final - t_begin
    print(f"'{title}' took {delta_t}.")

# (this step may be unnecessary) perform a cheap rolling mean in the hopes of smart, one-off precompilation for the timing test
narrow_df.head(2*min_periods).groupby('group_id').rolling(rolling_period, min_periods=min_periods, method='table').mean(engine='numba')

# timing experiment
time_operation('wide_df/method=single', wide_df, method='single')
time_operation('wide_df/method=table', wide_df, method='table')
time_operation('narrow_df/method=single', narrow_df, method='single')
time_operation('narrow_df/method=table', narrow_df, method='table')

Output from my laptop:

'wide_df/method=single' took 0:00:47.604131.
'wide_df/method=table' took 0:00:15.580090.
'narrow_df/method=single' took 0:00:00.365677.
'narrow_df/method=table' took 0:00:04.876920.

ORIGINAL ANSWER

Here's some code that reproduces the results from your example, and uses the table method:

import numpy as np
import pandas as pd
from pandas.tseries.offsets import *  #for BMonthEnd/MonthEnd()

np.random.seed(22) # basic reproducibility
df = pd.DataFrame(dict(
    efcode = np.random.randint(0, 2, size=10000),
    date = pd.date_range(start=pd.datetime(1990,1,1), end=pd.datetime(1990,1,1) + Day(9999), freq='D'),
    liq_daily = np.random.randint(1, 100, size=10000),
    liq_daily_usd = np.random.randint(1, 100, size=10000),
    net_vwap_avg = np.random.randint(1, 100, size=10000)
))

proc_list = ['liq_daily', 'liq_daily_usd', 'net_vwap_avg']

# sort by efcode first, date next
df = df.sort_values(by=['efcode', 'date']).reset_index(drop=True)
    
for p in [3, 5, 10, 22, 45, 67, 125, 252, 504, 756, 1260]:
    df[[f"{q}_{p}d"for q in proc_list]] = (
        df
        .groupby('efcode')[proc_list]
        .rolling(p, min_periods=int(0.8 * p), method='table')
        .mean(engine='numba')
        .reset_index(drop=True)
    )

Points to note:

  • For an apples-to-apples comparison, please sort the dataframe appropriately in your example as well
  • The table method is currently only callable with the numba engine according to the documentation
    • This means that the overhead from compilation might actually slow the code down for smaller dataframes like in this example; hopefully, you notice a speedup with larger datasets
  • efcode remains in as the index as an outcome of the groupby operation -- this is why I reset and drop the index in the final step

Post a Comment for "Pandas Rolling Moving Average Using Table Method And Time Period"