# I have to predict cosine similarity between 1 & 2 column into 3rd column how to approach this problem in R

#1

I have 200k rows in dataset and in 1st and 2nd column consist of sentences i have to predict cosine similarity between 1 & 2 column into 3rd column
ex:- 1st column : Why do I love movies so much? Is this strange , 2nd column Why do you love moviesand in 3rd column: 0.70 (which is there cosine value )
screenshot of data set in the following image

#2

@deva123 First you should create fixed-length vectors for each and every sentence in both the columns. You can create such vectors using bag-of-words approach, tfidf, or word embeddings (word2vec and GLoVE). Once you have these vectors you can easily compute the cosine similarity between the sentences of the two columns.

#3

Do you have any references link for this , I new to this topic and lot of blogs I saw were related to python i’m more familiar to R and most of the sources compare cosine similarity between Documents
I removed only punctuation and stop words

``````data1 <- data
library(NLP)
library(tm)
library(stringr)
library(text2vec)

dd<- sim2(data2\$Question.1[1],data2\$Question.2[2],method = "cosine",norm=12)

class(data2\$Question.1[1])

# select 500 rows for faster running times
data_q1 = data2
prep_fun = function(x) {
x %>%
# make text lower case
str_to_lower %>%
# remove non-alphanumeric symbols
str_replace_all("[^[:alnum:]]", " ") %>%
# collapse multiple spaces
str_replace_all("\\s+", " ")
}

data_q1[,1:2] <- apply(data_q1[,1:2],2,prep_fun)``````