Skip to Main Content

Stata: Merging and Appending

Combining datasets

Merging datasets

In this example, we have two datasets, in_class_data and statepolitical, and we want to use variables from both in one analysis. So, we have to merge them together.

The first thing to do is identify a variable that appears in both datasets, and that is stored in exactly the same way, which will allow the software to match up information in both datasets. This is called the key variable. In both datasets, we have variables that describe states. Let's look at our other dataset.

use statepolitical, clear
codebook state
codebook statefips

How do these variables differ?

Now let's go back to first dataset.

use in_class_data, clear
codebook state
codebook statefips

It's better to use a numeric or alphanumeric code rather than a name to perform a merge. For U.S. states, look for a FIPS code or postal code rather than the name of the state; the Stata command statastates can be used to add them if they are not provided. For countries, there are various codes developed by the World Bank, IMF, etc., and all are preferable to using the names of the country; the Stata command kountry works similarly to statastates.

Now let's go back to the previous dataset. We want to merge the statepolitical dataset to in_class_data, so we want in_class_data to be open while we are merging. in_class_data.dta  is called the "master" file, while statepolitical.dta will be the "using" file.

merge m:1 statefips using statepolitical.dta

m:1 stands for many to one. Your master dataset has many observations for each state, but the using dataset has only one observations for each state. What do imagine you would use if you were merging two datasets, each with only one observation for each value of the key variable?

statefips state Governor Political Affiliation 01 Alabama Republican 02 Alaska Independent 04 Arizona Republican 05 Arkansas Republican 06 California Democrat These are the key variables.They should be exactly the same in both datasets. This is the master dataset. Each row represents a person.There may be many people from each state. This is statepolitical, the using dataset. There is only 1 row for each state. serial statefips state age gender inctot 97009 02 Alaska 63 Male 99999 93382 06 California 42 Male 99731 61141 01 Alabama 39 Female 99728 81759 04 Arizona 60 Female 99652 60311 01 Alabama 46 Female 99570 serial statefips state age gender inctot Governor Political Affiliation 97009 02 Alaska 63 Male 99999 Independent 93382 06 California 42 Male 99731 Democrat 61141 01 Alabama 39 Female 99728 Republican 81759 04 Arizona 60 Female 99652 Republican 60311 01 Alabama 46 Female 99570 Republican merge m:1 statefips using statepolitical

After you have merged, you should check to see what didn't merge. Stata provides you with a handy variable called _merge that identifies if observations matched (3), were only in the master file (1) or only in the using file (2).

list if _merge==2
list if _merge!=3

Use your knowledge to get rid of _merge. Stata won't let you merge another dataset if _merge is already there.

A common problem with merging occurs when there are duplicate observations, which prevent the software from matching. Stata has commands for dropping duplicates, but it is also important to understand why there a duplicates, because there might be something else wrong with your data.

Appending observations

The append command is what you use when you have two datasets, structured exactly the same way with the same variables, that you want to stack on top of each other. You might use it if you have datasets from two different years, for example, with the same variables, that you want to put together in one file.

summarize
append using extra_observations.dta
summarize