Match Regex To Its Type In Another Dataframe
How to match data value with its regex type but the regex is in another dataframe? Here is the sample Data df and Regex df. Note that these two df have different shape as the regex
Solution 1:
refer to:Match column with its own regex in another column Python
just apply a new Coumun suggestion, it's logic depend on your description.
def func(dfRow):
#find the same Country and Type
sameDF = regexDF.loc[(regexDF['Country'] == dfRow['Country']) & (regexDF['Type'] == dfRow['Type'])]
if sameDF.size > 0 and re.match(sameDF.iloc[0]["Regex"],dfRow["Data"]):
return 0
#find the same Country, then find mathec Type
sameCountryDF = regexDF.loc[(regexDF['Country'] == dfRow['Country'])]
for index, row in sameCountryDF.iterrows():
if re.match(row["Regex"], dfRow["Data"]):
return row["Type"]
df["Suggestion"]=df.apply(func, axis=1)
Solution 2:
I suggest the following, merging by Country
and doing both operations in the same DataFrame (finding regex that match for the type in data_df
and for the type in regex_df
) as follows:
# First I merge only on country
new_df = pd.merge(df, df_regex, on="Country")
# Then I define an indicator for types that differ between the two DF
new_df["indicator"] = np.where(new_df["Type_x"] == new_df["Type_y"], "both", "right")
# I see if the regex matches Data for the `Type` in df
new_df['Data Quality'] = new_df.apply(lambda x:
np.where(re.match(x['Regex'], x['Data']) and
(x["indicator"] == "both"),
1, 0), axis=1)
# Then I fill Suggestion by looking if the regex matches data for the type in df_regex
new_df['Suggestion'] = new_df.apply(lambda x:
np.where(re.match(x['Regex'], x['Data']) and
(x["indicator"] == "right"),
x["Type_y"], ""), axis=1)
# I remove lines where there is no suggestion and I just added lines from df_regex
new_df = new_df.loc[~((new_df["indicator"] == "right") & (new_df["Suggestion"] == "")), :]
new_df = new_df.sort_values(["Country", "Type_x", "Data"])
# After sorting I move Suggestion up one line
new_df["Suggestion"] = new_df["Suggestion"].shift(periods=-1)
new_df = new_df.loc[new_df["indicator"] == "both", :]
new_df = new_df.drop(columns=["indicator", "Type_y", "Regex"]).fillna("")
And you get this result:
Country Type_x Data Data Quality Suggestion
4 IT ABC IT1234567890 1
8 IT ABC IT56788897 0 XYZ
6 IT ABC MY45889976 0 XYZ
2 MY ABC 456792abc 0 DEF
0 MY ABC MY1234567890 1
10 PL PQR PL123456 1
The last line of your output seems to have the wrong Type
since it is not in data_df
.
By using your sample data I find ABC
for Data == "456792abc"
and your suggestion DEF
.
Post a Comment for "Match Regex To Its Type In Another Dataframe"