Skip to Main Content

Stata: Stata v. R

Common data problems

  • Not knowing your working directory.
  • String variables that should be numeric.
  • Duplicate observations when merging.
  • "Wide" format data with dates in variable names.

Data for workshop

You will need to unzip/extract to use these data files.

Your working directory

What is your working directory? This is the first thing you should know, and a quick way to avoid problems.

Stata

Type the following into the Command window.

pwd

You can change it with File > Change Working Directory... Change it to the folder where you have saved the data.

R
getwd()

Change using Session > Set Working Directory

Do files and scripts

Stata

Do-files are scripts for automating Stata commands. They are simply text files with the .do file extension. With a correctly written do-file, anyone can reproduce your analysis.

Please DO use a do-file.

You can create a new do-file by clicking on the New Do-File Editor button, or typing

doedit

A properly documented do-file will contain comments that communicate your intentions at each step of code. Stata will ignore the text of the comments when you run your code, but they make your .do file understandable to humans. Comments can be indicated

*like this (for an entire line)
R

In R, the same thing is usually referred to as a script and has the .R file extension. You can create them right in RStudio.

# Comments look like this

The first few lines of an R script is usually installing and loading packages. The following are some packages we'll need to install.

install.packages(c("reshape2", "readr", "readxl"))

String problems

Stata

A major pitfall: string variables. These are variables stored as text. They take up more memory and are not useful for your regressions, etc. You can recognize these in the Data Editor because they will be colored red.

There are two cases in which you will want to convert string variables to numeric.

  1. When the string represents a categorical (nominal) value, and should be replaced with integers. In this case, you should use encode.
  2. When the string represents a continuous value, and was just accidentally imported as string. (This often happens when the values contain dollars signs, commas, etc.) In this case, you should use destring.

Our sample dataset has an example of each. Since the file is in Stata format (.dta) and it is in your working directory (right?), we can open with simply:

use labor_survey.dta, clear

Case #1: A string representing a categorical variable

Stata

First, use describe to reassure yourself that it is a string. With the encode command, add the gen() option to put the newly created numeric values in a new variable.

describe marital
encode marital, gen(marst)
codebook marst

codebook is very handy for seeing what the actual values of the new variable are.

We can easily create a dummy variable for married/not married.

gen married = 1
replace married = 0 if marst != 3
R

You can read in data using the Import Dataset menu in RStudio, or paste in the code.

library(readr)
labor_survey <- read_csv("labor_survey.csv")

In this case, I used the readr package, which you need to install the first time you use it. You can also use the read.csv() function that does not require installation.

You may encounter the same string problems in R. In case #1, we use the factor() function.

labor_survey$marst <- factor(marital)

Case #2: A continuous variable stored as a string

Stata

As in encode, you can (and should) create a new variable using the gen() option. You should also use the ignore() option to skip over any characters that aren't numbers. Otherwise, it will create missing values.

destring incwage, gen(wage) ignore("$")
describe wage
R
labor_survey$wage <- as.numeric(gsub("\\$", "", labor_survey$incwage))

Removing variables

Stata

If you don't need the variable anymore, you can drop it.

drop incwage

You can also get rid of variables that don't meet certain conditions.

summarize
drop if age < 18
summarize

keep works as a mirror opposite to drop.

None of these changes will persist until we save the data. It's a good idea to keep a copy of your original, pristine data, so let's save under a different filename.

save in_class_data, replace
R

In R, you typically drop variables or observations by selecting what you want and assigning to the same or a new object.

labor_survey <- labor_survey[which(labor_survey$age > 18),]
names(labor_survey)
labor_survey <- labor_survey[-19]

You can also look into the subset() function.

Combining datasets

Merging datasets

Stata

In this example, we have two datasets, in_class_data and statepolitical, and we want to use variables from both in one analysis. So, we have to merge them together.

The first thing to do is identify a variable that appears in both datasets, and that is stored in exactly the same way, which will allow the software to match up information in both datasets. This is called the key variable. In both datasets, we have variables that describe states.

use in_class_data, clear
codebook state
codebook statefips

How do these variables differ?

Now let's look at the other dataset

use statepolitical.dta, clear
codebook state
codebook statefips

It's better to use a numeric code rather than string (text) to perform a merge.

Now let's go back to the previous dataset. We want to merge the statepolitical dataset to in_class_data, so we want in_class_data to be open while we are merging. in_class_data.dta  is called the "master" file, while statepolitical.dta will be the "using" file.

use in_class_data, clear
merge m:1 statefips using statepolitical.dta, keepusing(governorpoliticalaffiliation)

m:1 stands for many to one. Your master dataset has many observations for each state, but the using dataset has only one observations for each state. What do imagine you would use if you were merging two datasets, each with only one observation for each value of the key variable?

statefips state Governor Political Affiliation 01 Alabama Republican 02 Alaska Independent 04 Arizona Republican 05 Arkansas Republican 06 California Democrat These are the key variables.They should be exactly the same in both datasets. This is the master dataset. Each row represents a person.There may be many people from each state. This is statepolitical, the using dataset. There is only 1 row for each state. serial statefips state age gender inctot 97009 02 Alaska 63 Male 99999 93382 06 California 42 Male 99731 61141 01 Alabama 39 Female 99728 81759 04 Arizona 60 Female 99652 60311 01 Alabama 46 Female 99570 serial statefips state age gender inctot Governor Political Affiliation 97009 02 Alaska 63 Male 99999 Independent 93382 06 California 42 Male 99731 Democrat 61141 01 Alabama 39 Female 99728 Republican 81759 04 Arizona 60 Female 99652 Republican 60311 01 Alabama 46 Female 99570 Republican merge m:1 statefips using statepolitical

After you have merged, you should check to see what didn't merge. Stata provides you with a handy variable called _merge that identifies if observations matched (3), were only in the master file (1) or only in the using file (2).

list if _merge==2
list if _merge!=3

Use your knowledge to get rid of _merge. Stata won't let you merge another dataset if _merge is already there.

A common problem with merging occurs when there are duplicate observations, which prevent the software from matching. Stata has commands for dropping duplicates, but it is also important to understand why there a duplicates, because there might be something else wrong with your data.

R

In R, you have the same requirements for key variables. In some ways, merging is actually simpler in R.

statepolitical <- read_csv("statepolitical.csv")
in_class_data <- merge(statepolitical, labor_survey, all = TRUE)

If you don't include "all = TRUE", the result will only include variables that matched correctly.

Appending observations

Stata

The append command is what you use when you have two datasets, structured exactly the same way with the same variables, that you want to stack on top of each other. You might use it if you have datasets from two different years, for example, with the same variables, that you want to put together in one file.

summarize
append using extra_observations.dta
summarize
R

In R you can use rbind() (row bind) to append observations. Variables in both dataframes need to have the same names.

Modifying your variables

Rarely will your data or variables come formatted the way you need them. Frequently you will need to reassign values, rename variables, and give labels that make sense to you.

Missing values

Stata

Missing values in Stata are stored as "." but many datasets store missing values as -99, 9999, etc. This will mess up your analysis if you don't recode.

tab race
recode race -999 = .
tab race
tab race, m

The above command is just one way to recode missing values. mvdecode is also very helpful if you have multiple values that represent missing.

mvdecode inctot, mv(-9999 -9998)
R
table(labor_survey$race)
labor_survey$race[which(labor_survey$race == -999)] <- NA
table(labor_survey$race)

Generate new variables

gen and egen commands can create a variety of new variables from existing variables

gen notwages = inctot - wage
gen lnwage = ln(wage)
help functions
recode age (18/29 = 1 18-29) (30/44 = 2 30-44) (45/64  = 3 45-64) (65/85 = 4 65-85), gen(agegroup)
egen meaninc = mean(inctot), by(agegroup)

Reshaping wide to long

Say you want to use a time series of unemployment by state. If you used a dataset like the one shown below, what problems would you encounter?

state statefips unemp2018m10 unemp2018m09 unemp2018m08 unemp2018m07
ALABAMA 1 89754 90830 91211 90928
ALASKA 2 22779 23339 24104 24919
ARIZONA 4 158154 157377 156195 155473
ARKANSAS 5 47441 47244 48131 49550
CALIFORNIA 6 804349 802959 803076 807518
COLORADO 8 98176 94532 90531 85492
CONNECTICUT 9 79992 80072 80898 82750
Stata

Let's open this file. It's in Excel format, so the command for opening is import excel, rather than use.

import excel "state_unemployment.xlsx", sheet("Sheet1") firstrow clear

Luckily, Stata is much, much better than Excel at dealing with datasets like this, through its reshape command. The first thing you should do is open the documentation like so:

help reshape

Using the documentation as a guide, we can try converting from wide to long

reshape long unemp, i( state statefips state_abbrev) j(month) string

This will create a new variable called month that will contain the month value. You can call it whatever you want, but the variable name goes inside the j() option. In the i() option, you list all identifying variables; these are things like id numbers, or this case state names, that identify individual observations. The string() option at the end allows it to be a string rather than a numeric value.

To truly use this data for a time series or panel data analysis, you would need to format month as a date. We won't get into the date formatting in depth, but you can consult the datetime documentation.

gen date = monthly(month, "YM")
format date %tm
R
library(readxl)
state_unemployment <- read_excel("state_unemployment.xlsx")

There are a few ways to do this in R, but I like melt() in the reshape2 package.

library(reshape2)
melted_unemp <- melt(state_unemployment, id.vars = c("state", "state_abbrev", "statefips"))
melted_unemp$year <- as.numeric(substr(melted_unemp$variable,6,9))