alexa
Facebook
Twitter
LinkedIn
Instagram
Whatsapp
Call Now
Quick Inquiry

How to get all fuzzy matching substrings between two strings in python- #solutions

How to get all fuzzy matching substrings between two strings in python- #solutions

Here is a code to find the substring of a given string based on the base string with settable top fuzzy_ratio.

This one uses nltk to generate ngrams.

Typical algorithm:

  1. Generate ngrams from the given first string.
    Example:
    text2 = "The time of discomfort was 3 days ago."
    total_length = 8

First we use param 5, 6, 7, 8.
param=5
ngrams = ['The time of discomfort was', 'time of discomfort was 3', 'of discomfort was 3 days', 'discomfort was 3 days ago.']

  1. Compare it with second string.
    Example:
    text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."

@param=5

  • compare 'The time of discomfort was' vs text1 and get the fuzzy score
  • compare 'time of discomfort was 3' vs text1 and get the fuzzy score
  • and so on until all elements in ngrams_5 are finished
    Save sub-string if fuzzy score is greater than or equal to given threshold.

@param=6

  • compare 'The time of discomfort was 3' vs text1 and get the fuzzy score
  • and so on

until @param=8

You can revise the code changing n_start to 5 or so, so that the ngrams of string1 will be compared to the ngrams of string2. At the moment the ngrams of string2 is the full text.

 # Generate ngrams for string2
n_start = 5  # st2_length
for n in range(n_start, st2_length + 1):
    ... 

For comparison I use:

 fratio = fuzz.token_set_ratio(fs1, fs2) 

Have a look at this also. You can try different ratios as well.

Your sample 'prescription of idx, 20mg to be given every four hours' has a fuzzy score of 52.

See sample console output.

 7                    prescription of idx, 20mg to be given every four hours           52

Code : -  

 """
fuzzy_match.py

https://stackoverflow.com/questions/72017146/how-to-get-all-fuzzy-matching-substrings-between-two-strings-in-python

Dependent modules:
    pip install pandas
    pip install nltk
    pip install fuzzywuzzy
    pip install python-Levenshtein

"""


from nltk.util import ngrams
import pandas as pd
from fuzzywuzzy import fuzz


# Sample strings.
text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"


def myprocess(st1: str, st2: str, threshold):
    """
    Generate sub-strings from st1 based from st2.
    The sub-strings, full string and fuzzy ratio will be saved in csv file.
    """
    data = []
    st1_length = len(st1.split())
    st2_length = len(st2.split())

    # Generate ngrams for string1
    m_start = 5
    for m in range(m_start, st1_length + 1):  # st1_length >= m_start

        # If m=3, fs1 = 'Patient has checked', 'has checked in', 'checked in for' ...
        # If m=5, fs1 = 'Patient has checked in for', 'has checked in for abdominal', ...
        for s1 in ngrams(st1.split(), m):
            fs1 = ' '.join(s1)
            
            # Generate ngrams for string2
            n_start = st2_length
            for n in range(n_start, st2_length + 1):
                for s2 in ngrams(st2.split(), n):
                    fs2 = ' '.join(s2)

                    fratio = fuzz.token_set_ratio(fs1, fs2)  # there are other ratios

                    # Save sub string if ratio is within threshold.
                    if fratio >= threshold:
                        data.append([fs1, fs2, fratio])

    return data


def get_match(sub, full, colname1, colname2, threshold=50):
    """
    sub: is a string where we extract the sub-string.
    full: is a string as the base/reference.
    threshold: is the minimum fuzzy ratio where we will save the sub string. Max fuzz ratio is 100.
    """   
    save = myprocess(sub, full, threshold)

    df = pd.DataFrame(save)
    if len(df):
        df.columns = [colname1, colname2, 'fuzzy_ratio']

        is_sort_by_fuzzy_ratio_first = True

        if is_sort_by_fuzzy_ratio_first:
            df = df.sort_values(by=['fuzzy_ratio', colname1], ascending=[False, False])
        else:
            df = df.sort_values(by=[colname1, 'fuzzy_ratio'], ascending=[False, False])

        df = df.reset_index(drop=True)

        df.to_csv(f'{colname1}_{colname2}.csv', index=False)

        # Print to console. Show only the sub-string and the fuzzy ratio. High ratio implies high similarity.
        df1 = df[[colname1, 'fuzzy_ratio']]
        print(df1.to_string())
        print()

        print(f'sub: {sub}')
        print(f'base: {full}')
        print()


def main():
    get_match(text2, text1, 'string2', 'string1', threshold=50)  # output string2_string1.csv
    get_match(text3, text1, 'string3', 'string1', threshold=50)

    get_match(text2, text3, 'string2', 'string3', threshold=10)

    # Other param combo.


if __name__ == '__main__':
    main()

 

Console Output :-

  string2  fuzzy_ratio
0              discomfort was 3 days ago.           72
1           of discomfort was 3 days ago.           67
2      time of discomfort was 3 days ago.           60
3                of discomfort was 3 days           59
4  The time of discomfort was 3 days ago.           55
5           time of discomfort was 3 days           51

sub: The time of discomfort was 3 days ago.
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

                                                                    string3  fuzzy_ratio
0                                                 be given every four hours           61
1                                    idx, 20mg to be given every four hours           58
2        was given a prescription of idx, 20mg to be given every four hours           56
3                                              to be given every four hours           56
4   John was given a prescription of idx, 20mg to be given every four hours           56
5                                 of idx, 20mg to be given every four hours           55
6              was given a prescription of idx, 20mg to be given every four           52
7                    prescription of idx, 20mg to be given every four hours           52
8            given a prescription of idx, 20mg to be given every four hours           52
9                  a prescription of idx, 20mg to be given every four hours           52
10        John was given a prescription of idx, 20mg to be given every four           52
11                                              idx, 20mg to be given every           51
12                                        20mg to be given every four hours           50

sub: John was given a prescription of idx, 20mg to be given every four hours
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

                                  string2  fuzzy_ratio
0      time of discomfort was 3 days ago.           41
1           time of discomfort was 3 days           41
2                time of discomfort was 3           40
3                of discomfort was 3 days           40
4  The time of discomfort was 3 days ago.           40
5           of discomfort was 3 days ago.           39
6       The time of discomfort was 3 days           39
7              The time of discomfort was           38
8            The time of discomfort was 3           35
9              discomfort was 3 days ago.           34

sub: The time of discomfort was 3 days ago.
base: John was given a prescription of idx, 20mg to be given every four hours

252 0
7

Write a Comments


* Be the first to Make Comment

GoodFirms Badge
GoodFirms Badge

Fix Your Meeting With Our SEO Consultants in India To Grow Your Business Online