Merging of Data Sets


What are the things to consider while merging two or more data sets? and how can one make sure that the merge has happened correctly?



Merging of data sets is most common task and it’s also very easy to get wrong. Things to remember before combining data sets are:

  • Understand the structure of data sets and the method of the way you’re merging them. Otherwise you can end up with a data set that you think is ready for analysis but it is not the required one and can impact the output.

  • If you are performing one-to-one merge, one observation from the master data set is combined with one observation from other data set. A one-to-one merge makes sense when the observations in both data sets describe the same things, but have different information about them. For example, you might merge the answers people gave in wave one of a survey with the answers the same people gave in wave two of the survey.

  • In case one-to-many or many-to-one merge, one observation from one data set is combined with many observations from the other (the difference between one-to-many and many-to-one being whether the master data set has the “many” or the other data set). These merges make sense when you have hierarchical data, and one data set contains information about the level one units while the other contains information about the level two units. For example, you might merge information about households with information about the individuals who live in those households.

  • Always check for duplicate records and if it is there investigate the reason and remove redundancy.

  • One of the common problem with merge is common variable is available in different data type format.

  • If you are going with left and right merge, the number of records in output dataset must be equals to number of records of left right data set respectively.

  • If you are performing inner merge, there must be no missing values for any variable of output data set.

Hope this helps!