Here is a code to find the substring of a given string based on the base string with settable top fuzzy_ratio.
This one uses nltk to generate ngrams.
Typical algorithm:
First we use param 5, 6, 7, 8.
param=5
ngrams = ['The time of discomfort was', 'time of discomfort was 3', 'of discomfort was 3 days', 'discomfort was 3 days ago.']
@param=5
@param=6
until @param=8
You can revise the code changing n_start to 5 or so, so that the ngrams of string1 will be compared to the ngrams of string2. At the moment the ngrams of string2 is the full text.
# Generate ngrams for string2
n_start = 5 # st2_length
for n in range(n_start, st2_length + 1):
...
For comparison I use:
fratio = fuzz.token_set_ratio(fs1, fs2)
Have a look at this also. You can try different ratios as well.
Your sample 'prescription of idx, 20mg to be given every four hours'
has a fuzzy score of 52.
See sample console output.
7 prescription of idx, 20mg to be given every four hours 52
Code : -
"""
fuzzy_match.py
https://stackoverflow.com/questions/72017146/how-to-get-all-fuzzy-matching-substrings-between-two-strings-in-python
Dependent modules:
pip install pandas
pip install nltk
pip install fuzzywuzzy
pip install python-Levenshtein
"""
from nltk.util import ngrams
import pandas as pd
from fuzzywuzzy import fuzz
# Sample strings.
text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"
def myprocess(st1: str, st2: str, threshold):
"""
Generate sub-strings from st1 based from st2.
The sub-strings, full string and fuzzy ratio will be saved in csv file.
"""
data = []
st1_length = len(st1.split())
st2_length = len(st2.split())
# Generate ngrams for string1
m_start = 5
for m in range(m_start, st1_length + 1): # st1_length >= m_start
# If m=3, fs1 = 'Patient has checked', 'has checked in', 'checked in for' ...
# If m=5, fs1 = 'Patient has checked in for', 'has checked in for abdominal', ...
for s1 in ngrams(st1.split(), m):
fs1 = ' '.join(s1)
# Generate ngrams for string2
n_start = st2_length
for n in range(n_start, st2_length + 1):
for s2 in ngrams(st2.split(), n):
fs2 = ' '.join(s2)
fratio = fuzz.token_set_ratio(fs1, fs2) # there are other ratios
# Save sub string if ratio is within threshold.
if fratio >= threshold:
data.append([fs1, fs2, fratio])
return data
def get_match(sub, full, colname1, colname2, threshold=50):
"""
sub: is a string where we extract the sub-string.
full: is a string as the base/reference.
threshold: is the minimum fuzzy ratio where we will save the sub string. Max fuzz ratio is 100.
"""
save = myprocess(sub, full, threshold)
df = pd.DataFrame(save)
if len(df):
df.columns = [colname1, colname2, 'fuzzy_ratio']
is_sort_by_fuzzy_ratio_first = True
if is_sort_by_fuzzy_ratio_first:
df = df.sort_values(by=['fuzzy_ratio', colname1], ascending=[False, False])
else:
df = df.sort_values(by=[colname1, 'fuzzy_ratio'], ascending=[False, False])
df = df.reset_index(drop=True)
df.to_csv(f'{colname1}_{colname2}.csv', index=False)
# Print to console. Show only the sub-string and the fuzzy ratio. High ratio implies high similarity.
df1 = df[[colname1, 'fuzzy_ratio']]
print(df1.to_string())
print()
print(f'sub: {sub}')
print(f'base: {full}')
print()
def main():
get_match(text2, text1, 'string2', 'string1', threshold=50) # output string2_string1.csv
get_match(text3, text1, 'string3', 'string1', threshold=50)
get_match(text2, text3, 'string2', 'string3', threshold=10)
# Other param combo.
if __name__ == '__main__':
main()
Console Output :-
string2 fuzzy_ratio
0 discomfort was 3 days ago. 72
1 of discomfort was 3 days ago. 67
2 time of discomfort was 3 days ago. 60
3 of discomfort was 3 days 59
4 The time of discomfort was 3 days ago. 55
5 time of discomfort was 3 days 51
sub: The time of discomfort was 3 days ago.
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.
string3 fuzzy_ratio
0 be given every four hours 61
1 idx, 20mg to be given every four hours 58
2 was given a prescription of idx, 20mg to be given every four hours 56
3 to be given every four hours 56
4 John was given a prescription of idx, 20mg to be given every four hours 56
5 of idx, 20mg to be given every four hours 55
6 was given a prescription of idx, 20mg to be given every four 52
7 prescription of idx, 20mg to be given every four hours 52
8 given a prescription of idx, 20mg to be given every four hours 52
9 a prescription of idx, 20mg to be given every four hours 52
10 John was given a prescription of idx, 20mg to be given every four 52
11 idx, 20mg to be given every 51
12 20mg to be given every four hours 50
sub: John was given a prescription of idx, 20mg to be given every four hours
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.
string2 fuzzy_ratio
0 time of discomfort was 3 days ago. 41
1 time of discomfort was 3 days 41
2 time of discomfort was 3 40
3 of discomfort was 3 days 40
4 The time of discomfort was 3 days ago. 40
5 of discomfort was 3 days ago. 39
6 The time of discomfort was 3 days 39
7 The time of discomfort was 38
8 The time of discomfort was 3 35
9 discomfort was 3 days ago. 34
sub: The time of discomfort was 3 days ago.
base: John was given a prescription of idx, 20mg to be given every four hours
* Be the first to Make Comment