How To Find Duplicate Based Upon Multiple Columns In A Rolling Window In Pandas?
Sample Data {'transaction': {'merchant': 'merchantA', 'amount': 20, 'time': '2019-02-13T10:00:00.000Z'}} {'transaction': {'merchant': 'merchantB', 'amount': 90, 'time': '2019-02-13
Solution 1:
First, you could form rolling 120 second blocs of data. You could then apply;
block and evaluate using duplicated: df = df[df.duplicated(subset=['val1','val2',’val3’], keep=False)]
Or groupby: df.groupby(['val1','val2',’val3’]).count()
Or even a SQL distinct. https://www.w3schools.com/sql/sql_distinct.asp
Please post what you have tried. The above methods work for strings, floats, datetimes and integer data types.
Solution 2:
So i made it work but not with rolling windows as it doesn't support string type. the feature is reported and requested on Pandas Repo as well.
My solution snippet to the problem:
if len(df.index) > 0:
res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
res['timediff'] = (data['transaction']['time'] - res['time']).dt.total_seconds().abs() <= 120
if res.timediff.any():
continue
df = df.append(df1)
print(df)
Sample data:
{"transaction":{"merchant":"merchantA","amount":20,"time":"2019-02-13T10:00:00.000Z"}}{"transaction":{"merchant":"merchantB","amount":90,"time":"2019-02-13T11:00:01.000Z"}}{"transaction":{"merchant":"merchantC","amount":10,"time":"2019-02-13T11:00:10.000Z"}}{"transaction":{"merchant":"merchantD","amount":10,"time":"2019-02-13T11:00:20.000Z"}}{"transaction":{"merchant":"merchantE","amount":10,"time":"2019-02-13T11:01:30.000Z"}}{"transaction":{"merchant":"merchantF","amount":10,"time":"2019-02-13T11:03:00.000Z"}}{"transaction":{"merchant":"merchantE","amount":10,"time":"2019-02-13T11:02:00.000Z"}}{"transaction":{"merchant":"merchantF","amount":10,"time":"2019-02-13T11:02:20.000Z"}}{"transaction":{"merchant":"merchantE","amount":10,"time":"2019-02-13T11:02:30.000Z"}}{"transaction":{"merchant":"merchantF","amount":10,"time":"2019-02-13T11:05:20.000Z"}}{"transaction":{"merchant":"merchantE","amount":10,"time":"2019-02-13T11:00:30.000Z"}}
Output:
merchantamounttime2019-02-13 10:00:00 merchantA202019-02-13 10:00:002019-02-13 11:00:01 merchantB902019-02-13 11:00:012019-02-13 11:00:10 merchantC102019-02-13 11:00:102019-02-13 11:00:20 merchantD102019-02-13 11:00:202019-02-13 11:01:30 merchantE102019-02-13 11:01:302019-02-13 11:03:00 merchantF102019-02-13 11:03:002019-02-13 11:05:20 merchantF102019-02-13 11:05:20
Post a Comment for "How To Find Duplicate Based Upon Multiple Columns In A Rolling Window In Pandas?"