I have many data sets from different countries in Europe, Entities (of data sets) and metadata are written in different languages, sometimes abbrevitation are used , so we have to use the metadata provided to get the full entity name (Fr,En,De…). I’m working on mapping these datasets into one single big dataset and i want to use Machine Learning to do that.
I will give you an example hoping to clarify the problem.
I have 2 data sets.
Data set 1 (Metadata written in French + abbreviation used).
num grav atm
1 2 5
2 3 2
4 1 3
Data set 2 (written in English)
Accident_Index Severity Weather Driver
2 2 2 L
3 3 3 R
4 1 5 L
So here, both data sets use same accident ID (problem 1).
For accident Severity, in Data set 1, grav = 1 means Fatal injury, but for Data set 2 , Severity = 1 means Slight injury, so the meaning of 1 depends on the country & metadata (using different standards in other words). Same for the weather, 2 means normal weather for data set 1 however for data set 2 it means Raining. (problem 2).
And problem 3 is that some data sets has extra columns.
So how can I do the mapping here using machine learning ?
Please bear with my bad english… not a native speaker.