Research Guides: Stata: Merging and Appending

Combining datasets

Merging datasets

In this example, we have two datasets, in_class_data and statepolitical, and we want to use variables from both in one analysis. So, we have to merge them together.

The first thing to do is identify a variable that appears in both datasets, and that is stored in exactly the same way, which will allow the software to match up information in both datasets. This is called the key variable. In both datasets, we have variables that describe states. Let's look at our other dataset.

use statepolitical, clear
codebook state
codebook statefips

How do these variables differ?

Now let's go back to first dataset.

use in_class_data, clear
codebook state
codebook statefips

It's better to use a numeric or alphanumeric code rather than a name to perform a merge. For U.S. states, look for a FIPS code or postal code rather than the name of the state; the Stata command statastates can be used to add them if they are not provided. For countries, there are various codes developed by the World Bank, IMF, etc., and all are preferable to using the names of the country; the Stata command kountry works similarly to statastates.

Now let's go back to the previous dataset. We want to merge the statepolitical dataset to in_class_data, so we want in_class_data to be open while we are merging. in_class_data.dta is called the "master" file, while statepolitical.dta will be the "using" file.

merge m:1 statefips using statepolitical.dta

m:1 stands for many to one. Your master dataset has many observations for each state, but the using dataset has only one observations for each state. What do imagine you would use if you were merging two datasets, each with only one observation for each value of the key variable?

After you have merged, you should check to see what didn't merge. Stata provides you with a handy variable called _merge that identifies if observations matched (3), were only in the master file (1) or only in the using file (2).

list if _merge==2
list if _merge!=3

Use your knowledge to get rid of _merge. Stata won't let you merge another dataset if _merge is already there.

A common problem with merging occurs when there are duplicate observations, which prevent the software from matching. Stata has commands for dropping duplicates, but it is also important to understand why there a duplicates, because there might be something else wrong with your data.

Appending observations

The append command is what you use when you have two datasets, structured exactly the same way with the same variables, that you want to stack on top of each other. You might use it if you have datasets from two different years, for example, with the same variables, that you want to put together in one file.

summarize
append using extra_observations.dta
summarize