Finding Most Similar Sentences Among All In Python
Solution 1:
Why did it not work for you with cosine similarity and the TFIDF-vectorizer?
I tried it and it works with this code:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame(columns=["ID","DESCRIPTION"], data=np.matrix([[10,"Cancel ASN WMS Cancel ASN"],
[11,"MAXPREDO Validation is corect"],
[12,"Move to QC"],
[13,"Cancel ASN WMS Cancel ASN"],
[14,"MAXPREDO Validation is right"],
[15,"Verify files are sent every hours for this interface from Optima"],
[16,"MAXPREDO Validation are correct"],
[17,"Move to QC"],
[18,"Verify files are not sent"]
]))
corpus = list(df["DESCRIPTION"].values)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
threshold = 0.4for x inrange(0,X.shape[0]):
for y inrange(x,X.shape[0]):
if(x!=y):
if(cosine_similarity(X[x],X[y])>threshold):
print(df["ID"][x],":",corpus[x])
print(df["ID"][y],":",corpus[y])
print("Cosine similarity:",cosine_similarity(X[x],X[y]))
print()
The threshold can be adjusted as well, but will not yield the results you want with a threshold of 0.9.
The output for a threshold of 0.4 is:
10 : Cancel ASN WMS Cancel ASN
13 : Cancel ASN WMS Cancel ASN
Cosine similarity: [[1.]]11 : MAXPREDO Validation is corect
14 : MAXPREDO Validation is right
Cosine similarity: [[0.64183024]]12 : Move to QC
17 : Move to QC
Cosine similarity: [[1.]]15 : Verify files are sent every hours for this interface from Optima
18 : Verify files are not sent
Cosine similarity: [[0.44897995]]
With a threshold of 0.39 all your expected sentences are features in the output, but an additional pair with the indices [15,18] can be found as well:
10 : Cancel ASN WMS Cancel ASN
13 : Cancel ASN WMS Cancel ASN
Cosine similarity: [[1.]]11 : MAXPREDO Validation is corect
14 : MAXPREDO Validation is right
Cosine similarity: [[0.64183024]]11 : MAXPREDO Validation is corect
16 : MAXPREDO Validation are correct
Cosine similarity: [[0.39895808]]12 : Move to QC
17 : Move to QC
Cosine similarity: [[1.]]14 : MAXPREDO Validation is right
16 : MAXPREDO Validation are correct
Cosine similarity: [[0.39895808]]15 : Verify files are sent every hours for this interface from Optima
18 : Verify files are not sent
Cosine similarity: [[0.44897995]]
Solution 2:
A possible way would be to use word-embeddings to create vector-representations of your sentences. Like you use pretrained word-embeddings and let a rnn layer create a sentence vector-representation, where the word-embeddings of each sentence are combined. Then you have a vector, where you could calculate distances between. But you need to decide, which threshold you want to set, so a sentence is accepted as similar, since the scales of word-embeddings are not fixed.
Update
I did some experiments. In my opinion, this is a viable method for such a task, however, you might want to find out for yourself, how well it is working in your case. I created an example in my git repository.
Also the word-mover-distance algorithm can be used for this task. You can find more information about this topic in this medium article.
Solution 3:
One can use this Python 3 library to compute sentence similarity: https://github.com/UKPLab/sentence-transformers
Code example from https://www.sbert.net/docs/usage/semantic_textual_similarity.html:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-MiniLM-L12-v2')
# Two lists of sentences
sentences1 = ['The cat sits outside',
'A man is playing guitar',
'The new movie is awesome']
sentences2 = ['The dog plays in the garden',
'A woman watches TV',
'The new movie is so great']
#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)
#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)
#Output the pairs with their scorefor i inrange(len(sentences1)):
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))
The library contains the state-of-the-art sentence embedding models.
See https://stackoverflow.com/a/68728666/395857 to perform sentence clustering.
Post a Comment for "Finding Most Similar Sentences Among All In Python"