How to implement semantic search in Python or R?

How to implement it in R or Python? When I search a document database it should know that I want to search for particular as well as the related document which might not contain the same word as the query word

Hi @yashkan27

Pleas be a bit more specific. It will help us better understand your requirement.

There are just two important things that are important in Search

  1. Accuracy
  2. Speed

Now in order to implement Semantic Search, you first need to understand how search is implemented

It is generally done using a reverse index.

First step is creating a vocabulary in which lets say you take all the unigrams, bigrams and trigrams in all the documents and make a unique list. Remove the english stopwords etc. Lets say you come with a 50K keywords vocabulary

Then you create bag of words of each document where the words can be only of the vocabulary chosen.

Example: Sen 1: My name is Anand. I like AV discuss portal
Sen2: AV Discuss portal is great

all_keywords = (my, name, is, anand, av, discuss, portal, great, i, like)
vocabulary = (anand, av, discuss, portal)

Sen1 bag of words [1,1,1,1]
Sen 2 bag of words [0,1,1,1]

Now visualizing it each of the keywords in the vocabulary can be taken as a dimension and your documents now become vectors in this higher dimension space (4 dimension space in this case). This is exactly what bag of words mean. And this is your index

Now you create a reverse index, where you create a dictionary which maps each keyword to the list of documents it is present in sorted according to matching score (Simplest being number of times the keyword occurs in the document)

Reverse Index
anand -> [Sen1]
av -> [Sen2, Sen1]
discuss -> [Sen2, Sen1]
portal -> [Sen2]

If a query keyword comes, you just return the top matching documents. Say if someone queries anand as search term it will return just sentence 1. If someone queries portal only Sentence 2 is returned

Next in order to make a semantic search. For example - Even if someone searches SVM you want to return results of SVM as well as Support Vector Machines as well as some other algorithms related to SVM.

In such a case, you need to somehow club vocabulary keywords together to a higher conceptual level.

This can be done by reducing the vocabulary dimension using something like LSI or Latent Semantic Indexing.

LSI is nothing but a combination of tfidf (increase relevant words and decrease the weightage of common english words) and dimensional reduction using SVD and taking the principal values. This is done to create a model which takes your original vector in vocabulary space to the lower dimensional conceptual space

The final step is taking both the keyword and all the documents in the lower dimensional conceptual space and finding similarity of the keyword with each of them.
The most similar results are returned in the search

You can see gensim library for an implementation of the LSI part

Other things you have to do by hand. Dont be tempted to use pandas etc as they take everything in RAM and thus wont scale. Also R is useless on any large scale search backend, the revolution R version can be a bit okish though.

My personal recommendation would be you use Python as the backend