Discard Points With X,y Coordinate Close To Eachother In Dataframe
I have the following dataframe (it is actually several hundred MB long): X Y Size 0 10 20 5 1 11 21 2 2 9 35 1 3 8 7 7 4 9 19 2 I want discard any
Solution 1:
As the question is stated, the behavior of the desired algorithm is not clear about how to deal with the chaining of distances.
If chaining is allowed, one solution is to cluster the dataset using a density-based clustering algorithm such as DBSCAN.
You just need to set the neighboorhood radius eps
to delta and the min_sample
parameter to 1 to allow isolated points as clusters. Then, you can find in each group which point has the maximum size.
from sklearn.cluster import DBSCAN
X = df[['X', 'Y']]
db = DBSCAN(eps=3, min_samples=1).fit(X)
df['grp'] = db.labels_
df_new = df.loc[df.groupby('grp').idxmax()['Size']]
print(df_new)
>>>
X Y Size grp
010205029351138772
Solution 2:
You can use below script and also try improving it.
#get all euclidean distances using sklearn; #it will create an array of euc distances; #then get index from df whose euclidean distance is less than 3from sklearn.metrics.pairwise import euclidean_distances
Z = df[['X', 'Y']]
euc = euclidean_distances(Z, Z)
idx = [(i, j) for i inrange(len(euc)-1) for j inrange(i+1, len(euc)) if euc[i, j] < 3]
# collect all index of df that has euc dist < 3 and get the max value# then collect all index in df NOT in euc and add the row with max size# create a new called df_new by combining the rest in df and row with max sizefrom itertools import chain
df_idx = list(set(chain(*idx)))
df2 = df.iloc[df_idx]
idx_max = df2[df2['Size'] == df2['Size'].max()].index.tolist()
df_new = pd.concat([df.iloc[~df.index.isin(df_idx)], df2.iloc[idx_max]])
df_new
Result:
X Y Size
2 9 35 1
3 8 7 7
0 10 20 5
Post a Comment for "Discard Points With X,y Coordinate Close To Eachother In Dataframe"