In this example, we have two datasets, in_class_data and statepolitical, and we want to use variables from both in one analysis. So, we have to merge them together.
The first thing to do is identify a variable that appears in both datasets, and that is stored in exactly the same way, which will allow the software to match up information in both datasets. This is called the key variable. In both datasets, we have variables that describe states.
use in_class_data, clear codebook state codebook statefips
How do these variables differ?
Now let's look at the other dataset
use statepolitical.dta, clear codebook state codebook statefips
It's better to use a numeric code rather than string (text) to perform a merge.
Now let's go back to the previous dataset. We want to merge the statepolitical dataset to in_class_data, so we want in_class_data to be open while we are merging. in_class_data.dta is called the "master" file, while statepolitical.dta will be the "using" file.
use in_class_data, clear merge m:1 statefips using statepolitical.dta, keepusing(governorpoliticalaffiliation)
m:1 stands for many to one. Your master dataset has many observations for each state, but the using dataset has only one observations for each state. What do imagine you would use if you were merging two datasets, each with only one observation for each value of the key variable?
After you have merged, you should check to see what didn't merge. Stata provides you with a handy variable called
_merge that identifies if observations matched (3), were only in the master file (1) or only in the using file (2).
list if _merge==2 list if _merge!=3
Use your knowledge to get rid of _merge. Stata won't let you merge another dataset if _merge is already there.
A common problem with merging occurs when there are duplicate observations, which prevent the software from matching. Stata has commands for dropping duplicates, but it is also important to understand why there a duplicates, because there might be something else wrong with your data.
The append command is what you use when you have two datasets, structured exactly the same way with the same variables, that you want to stack on top of each other. You might use it if you have datasets from two different years, for example, with the same variables, that you want to put together in one file.
summarize append using extra_observations.dta summarize