Skip to content Skip to sidebar Skip to footer

How To Find Duplicate Based Upon Multiple Columns In A Rolling Window In Pandas?

Sample Data {'transaction': {'merchant': 'merchantA', 'amount': 20, 'time': '2019-02-13T10:00:00.000Z'}} {'transaction': {'merchant': 'merchantB', 'amount': 90, 'time': '2019-02-13

Solution 1:

First, you could form rolling 120 second blocs of data. You could then apply;

block and evaluate using duplicated: df = df[df.duplicated(subset=['val1','val2',’val3’], keep=False)]

Or groupby: df.groupby(['val1','val2',’val3’]).count()

Or even a SQL distinct. https://www.w3schools.com/sql/sql_distinct.asp

Please post what you have tried. The above methods work for strings, floats, datetimes and integer data types.

Solution 2:

So i made it work but not with rolling windows as it doesn't support string type. the feature is reported and requested on Pandas Repo as well.

My solution snippet to the problem:

    if len(df.index) > 0:
        res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
        res['timediff'] = (data['transaction']['time'] - res['time']).dt.total_seconds().abs() <= 120
        if res.timediff.any():
            continue
    df = df.append(df1)
print(df)

Sample data:

{"transaction":{"merchant":"merchantA","amount":20,"time":"2019-02-13T10:00:00.000Z"}}{"transaction":{"merchant":"merchantB","amount":90,"time":"2019-02-13T11:00:01.000Z"}}{"transaction":{"merchant":"merchantC","amount":10,"time":"2019-02-13T11:00:10.000Z"}}{"transaction":{"merchant":"merchantD","amount":10,"time":"2019-02-13T11:00:20.000Z"}}{"transaction":{"merchant":"merchantE","amount":10,"time":"2019-02-13T11:01:30.000Z"}}{"transaction":{"merchant":"merchantF","amount":10,"time":"2019-02-13T11:03:00.000Z"}}{"transaction":{"merchant":"merchantE","amount":10,"time":"2019-02-13T11:02:00.000Z"}}{"transaction":{"merchant":"merchantF","amount":10,"time":"2019-02-13T11:02:20.000Z"}}{"transaction":{"merchant":"merchantE","amount":10,"time":"2019-02-13T11:02:30.000Z"}}{"transaction":{"merchant":"merchantF","amount":10,"time":"2019-02-13T11:05:20.000Z"}}{"transaction":{"merchant":"merchantE","amount":10,"time":"2019-02-13T11:00:30.000Z"}}

Output:

merchantamounttime2019-02-13 10:00:00  merchantA202019-02-13 10:00:002019-02-13 11:00:01  merchantB902019-02-13 11:00:012019-02-13 11:00:10  merchantC102019-02-13 11:00:102019-02-13 11:00:20  merchantD102019-02-13 11:00:202019-02-13 11:01:30  merchantE102019-02-13 11:01:302019-02-13 11:03:00  merchantF102019-02-13 11:03:002019-02-13 11:05:20  merchantF102019-02-13 11:05:20

Post a Comment for "How To Find Duplicate Based Upon Multiple Columns In A Rolling Window In Pandas?"