Quick Inquiry

How to get all fuzzy matching substrings between two strings in python- #solutions

Yogesh Kumar
30-Apr-2022
0

Here is a code to find the substring of a given string based on the base string with settable top fuzzy_ratio.

This one uses nltk to generate ngrams.

Typical algorithm:

Generate ngrams from the given first string.
Example:
text2 = "The time of discomfort was 3 days ago."
total_length = 8

First we use param 5, 6, 7, 8.
param=5
ngrams = ['The time of discomfort was', 'time of discomfort was 3', 'of discomfort was 3 days', 'discomfort was 3 days ago.']

Compare it with second string.
Example:
text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."

@param=5

compare 'The time of discomfort was' vs text1 and get the fuzzy score
compare 'time of discomfort was 3' vs text1 and get the fuzzy score
and so on until all elements in ngrams_5 are finished
Save sub-string if fuzzy score is greater than or equal to given threshold.

@param=6

compare 'The time of discomfort was 3' vs text1 and get the fuzzy score
and so on

until @param=8

You can revise the code changing n_start to 5 or so, so that the ngrams of string1 will be compared to the ngrams of string2. At the moment the ngrams of string2 is the full text.

 # Generate ngrams for string2
n_start = 5  # st2_length
for n in range(n_start, st2_length + 1):
    ...

For comparison I use:

 fratio = fuzz.token_set_ratio(fs1, fs2)

Have a look at this also. You can try different ratios as well.

Your sample 'prescription of idx, 20mg to be given every four hours' has a fuzzy score of 52.

See sample console output.

 7                    prescription of idx, 20mg to be given every four hours           52

Code : -

 """
fuzzy_match.py

https://stackoverflow.com/questions/72017146/how-to-get-all-fuzzy-matching-substrings-between-two-strings-in-python

Dependent modules:
    pip install pandas
    pip install nltk
    pip install fuzzywuzzy
    pip install python-Levenshtein

"""


from nltk.util import ngrams
import pandas as pd
from fuzzywuzzy import fuzz


# Sample strings.
text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"


def myprocess(st1: str, st2: str, threshold):
    """
    Generate sub-strings from st1 based from st2.
    The sub-strings, full string and fuzzy ratio will be saved in csv file.
    """
    data = []
    st1_length = len(st1.split())
    st2_length = len(st2.split())

    # Generate ngrams for string1
    m_start = 5
    for m in range(m_start, st1_length + 1):  # st1_length >= m_start

        # If m=3, fs1 = 'Patient has checked', 'has checked in', 'checked in for' ...
        # If m=5, fs1 = 'Patient has checked in for', 'has checked in for abdominal', ...
        for s1 in ngrams(st1.split(), m):
            fs1 = ' '.join(s1)
            
            # Generate ngrams for string2
            n_start = st2_length
            for n in range(n_start, st2_length + 1):
                for s2 in ngrams(st2.split(), n):
                    fs2 = ' '.join(s2)

                    fratio = fuzz.token_set_ratio(fs1, fs2)  # there are other ratios

                    # Save sub string if ratio is within threshold.
                    if fratio >= threshold:
                        data.append([fs1, fs2, fratio])

    return data


def get_match(sub, full, colname1, colname2, threshold=50):
    """
    sub: is a string where we extract the sub-string.
    full: is a string as the base/reference.
    threshold: is the minimum fuzzy ratio where we will save the sub string. Max fuzz ratio is 100.
    """   
    save = myprocess(sub, full, threshold)

    df = pd.DataFrame(save)
    if len(df):
        df.columns = [colname1, colname2, 'fuzzy_ratio']

        is_sort_by_fuzzy_ratio_first = True

        if is_sort_by_fuzzy_ratio_first:
            df = df.sort_values(by=['fuzzy_ratio', colname1], ascending=[False, False])
        else:
            df = df.sort_values(by=[colname1, 'fuzzy_ratio'], ascending=[False, False])

        df = df.reset_index(drop=True)

        df.to_csv(f'{colname1}_{colname2}.csv', index=False)

        # Print to console. Show only the sub-string and the fuzzy ratio. High ratio implies high similarity.
        df1 = df[[colname1, 'fuzzy_ratio']]
        print(df1.to_string())
        print()

        print(f'sub: {sub}')
        print(f'base: {full}')
        print()


def main():
    get_match(text2, text1, 'string2', 'string1', threshold=50)  # output string2_string1.csv
    get_match(text3, text1, 'string3', 'string1', threshold=50)

    get_match(text2, text3, 'string2', 'string3', threshold=10)

    # Other param combo.


if __name__ == '__main__':
    main()

Console Output :-

  string2  fuzzy_ratio
0              discomfort was 3 days ago.           72
1           of discomfort was 3 days ago.           67
2      time of discomfort was 3 days ago.           60
3                of discomfort was 3 days           59
4  The time of discomfort was 3 days ago.           55
5           time of discomfort was 3 days           51

sub: The time of discomfort was 3 days ago.
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

                                                                    string3  fuzzy_ratio
0                                                 be given every four hours           61
1                                    idx, 20mg to be given every four hours           58
2        was given a prescription of idx, 20mg to be given every four hours           56
3                                              to be given every four hours           56
4   John was given a prescription of idx, 20mg to be given every four hours           56
5                                 of idx, 20mg to be given every four hours           55
6              was given a prescription of idx, 20mg to be given every four           52
7                    prescription of idx, 20mg to be given every four hours           52
8            given a prescription of idx, 20mg to be given every four hours           52
9                  a prescription of idx, 20mg to be given every four hours           52
10        John was given a prescription of idx, 20mg to be given every four           52
11                                              idx, 20mg to be given every           51
12                                        20mg to be given every four hours           50

sub: John was given a prescription of idx, 20mg to be given every four hours
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

                                  string2  fuzzy_ratio
0      time of discomfort was 3 days ago.           41
1           time of discomfort was 3 days           41
2                time of discomfort was 3           40
3                of discomfort was 3 days           40
4  The time of discomfort was 3 days ago.           40
5           of discomfort was 3 days ago.           39
6       The time of discomfort was 3 days           39
7              The time of discomfort was           38
8            The time of discomfort was 3           35
9              discomfort was 3 days ago.           34

sub: The time of discomfort was 3 days ago.
base: John was given a prescription of idx, 20mg to be given every four hours