Skip to content Skip to sidebar Skip to footer

Match Regex To Its Type In Another Dataframe

How to match data value with its regex type but the regex is in another dataframe? Here is the sample Data df and Regex df. Note that these two df have different shape as the regex

Solution 1:

refer to:Match column with its own regex in another column Python

just apply a new Coumun suggestion, it's logic depend on your description.

def func(dfRow):
    #find the same Country and Type
    sameDF = regexDF.loc[(regexDF['Country'] == dfRow['Country']) & (regexDF['Type'] == dfRow['Type'])]
    if sameDF.size > 0 and re.match(sameDF.iloc[0]["Regex"],dfRow["Data"]):
        return 0
    #find the same Country, then find mathec Type
    sameCountryDF = regexDF.loc[(regexDF['Country'] == dfRow['Country'])]
    for index, row in sameCountryDF.iterrows():
        if re.match(row["Regex"], dfRow["Data"]):
            return row["Type"]

df["Suggestion"]=df.apply(func, axis=1)

Solution 2:

I suggest the following, merging by Country and doing both operations in the same DataFrame (finding regex that match for the type in data_df and for the type in regex_df) as follows:

# First I merge only on country
new_df = pd.merge(df, df_regex, on="Country")

# Then I define an indicator for types that differ between the two DF
new_df["indicator"] = np.where(new_df["Type_x"] == new_df["Type_y"], "both", "right")

# I see if the regex matches Data for the `Type` in df
new_df['Data Quality'] = new_df.apply(lambda x: 
                                     np.where(re.match(x['Regex'], x['Data']) and 
                                              (x["indicator"] == "both"),
                                                         1, 0), axis=1)

# Then I fill Suggestion by looking if the regex matches data for the type in df_regex
new_df['Suggestion'] = new_df.apply(lambda x: 
                                    np.where(re.match(x['Regex'], x['Data']) and 
                                              (x["indicator"] == "right"),
                                                         x["Type_y"], ""), axis=1)

# I remove lines where there is no suggestion and I just added lines from df_regex
new_df = new_df.loc[~((new_df["indicator"] == "right") & (new_df["Suggestion"] == "")), :]
new_df = new_df.sort_values(["Country", "Type_x", "Data"])

# After sorting I move Suggestion up one line
new_df["Suggestion"] = new_df["Suggestion"].shift(periods=-1)
new_df = new_df.loc[new_df["indicator"] == "both", :]
new_df = new_df.drop(columns=["indicator", "Type_y", "Regex"]).fillna("")

And you get this result:

  Country Type_x          Data  Data Quality Suggestion
4       IT    ABC  IT1234567890             1           
8       IT    ABC    IT56788897             0        XYZ
6       IT    ABC    MY45889976             0        XYZ
2       MY    ABC     456792abc             0        DEF
0       MY    ABC  MY1234567890             1           
10      PL    PQR      PL123456             1           

The last line of your output seems to have the wrong Type since it is not in data_df. By using your sample data I find ABC for Data == "456792abc" and your suggestion DEF.


Post a Comment for "Match Regex To Its Type In Another Dataframe"