I have to predict cosine similarity between 1 & 2 column into 3rd column how to approach this problem in R

r
nlp
text_analytics

#1

I have 200k rows in dataset and in 1st and 2nd column consist of sentences i have to predict cosine similarity between 1 & 2 column into 3rd column
ex:- 1st column : Why do I love movies so much? Is this strange , 2nd column Why do you love moviesand in 3rd column: 0.70 (which is there cosine value )
Any reference link related to this problem will be helpful
screenshot of data set in the following image


#2

@deva123 First you should create fixed-length vectors for each and every sentence in both the columns. You can create such vectors using bag-of-words approach, tfidf, or word embeddings (word2vec and GLoVE). Once you have these vectors you can easily compute the cosine similarity between the sentences of the two columns.


#3

Thanks for the reply
Do you have any references link for this , I new to this topic and lot of blogs I saw were related to python i’m more familiar to R and most of the sources compare cosine similarity between Documents
I removed only punctuation and stop words

data1 <- data
library(NLP)
library(tm)
library(stringr)
library(text2vec)

dd<- sim2(data2$Question.1[1],data2$Question.2[2],method = "cosine",norm=12)

class(data2$Question.1[1])

# select 500 rows for faster running times
data_q1 = data2
prep_fun = function(x) {
  x %>% 
    # make text lower case
    str_to_lower %>% 
    # remove non-alphanumeric symbols
    str_replace_all("[^[:alnum:]]", " ") %>% 
    # collapse multiple spaces
    str_replace_all("\\s+", " ")
}


data_q1[,1:2] <- apply(data_q1[,1:2],2,prep_fun)