# Which Machine Learning Algorithm i should use to mapping data between two data sets?

#1

I have a two different data sets of product and and their information i.e “cost”, “ingredients” etc.
both data sets have different product. how can i map a product from sencond data set with product of first data set refering same or nearby properties?

For Example:
Data set 1
Product   Cost   Ingredient1_level   Inngredient2_level
Alpha     23      12.88              56.91
Beta    34       78.22             98.01
Gama    9       22.19             76.00

Data set 2
Item_name   Price   Material_level1   Material_level2
pro1      34       79.91             22.00
pro111     40       78.21             90.09
pro091    12       32.99             12.89
Pro292    11       21.56             23.99

here Ingredient1_level and Material_level1 refer to same kind property
also Ingredient2_level and Material_level2 refer to same kind property
same as Product = Item_name and Cost = Price

i have to check which Item_name from data set 2 are same as( or nearest matching property) from product of data set 1.
also if there is a column having character data in both data sets then which algorithm i can implement?
data set are only for references.

#2

Hey,

Matching the data across two different dataset is a different thing than finding a similar product from dataset 2 based on product feature defined in dataset 1. While the first one just require you to define a unique key across all the rows and match using VLOOKUP in excel and query in either R or SQL, the later, that is finding similar product is a machine learning task.

The easiest way to find similar products across two different datasets is by using COSINE similarity matrix. In simple words, it’s like making a matrix of products with each cell representing the correlation (closest similar word to cosine similarity) between the two product.

I’d suggest you to take a look at following links to get a better understanding of how it works:

And if you’re a R user, here is a link that can help you in Cosine implementation:

R package and implementation of cosine

Hope this helps.

Peace.

#3

Besides Cosine Similarity, there’s another method.
Regard one record as a 3-D point, then to calculate Euclidean distance among them.

#4

But first use appropriate normalization technique, as both datasets are on diff scale.