Skip to Main Content

Stata: Importing and Exploring Data

Stata FAQs

Sample data

Other Stata Resources

Data packages

How to document your work: do files, comments, and logs

Stata is a command line statistical package with an intuitive syntax, highly used by economists and other social scientists. By using do files, log files, and comments, Stata provides a complete system for documenting your analysis in a way that is totally reproducible.

Do files are scripts for automating Stata commands. They are simply text files with the .do file extension. With a correctly written do file, anyone can reproduce your analysis. Some academic journals (e.g. the American Economic Review) require that authors submit .do files along with their papers.

A properly documented do file will contain comments that communicate your intentions at each step of code. Stata will ignore the text of the comments when you run your code, but they make your .do file understandable to humans. Comments can be indicated

*like this (for an entire line)

or

 it could be // like this 

for the end of a line of code. Or

it could be /* like this */ for the middle of a line.

A log file records the output of the commands as you run your code. The log file can be turned on and off, but it is a good idea to help you keep track of what you've done.

Importing and Exploring Data

Preliminaries

What is your working directory?

This is the first thing you should know, and a quick way to avoid problems.

pwd

You can change it with File > Change Working Directory... or with the command cd. Change it to the folder where you have saved the data.

Create a log

This is a good thing to do! It will capture your commands and your output.

log using stataworkshop.txt, text replace

Opening data files

The usual way to get data is to download a file, import it into Stata, and save as a Stata file. However, there is a world of economic data out there that you can open directly in Stata, without downloading a file. For example. Federal Reserve Economic Data (FRED).

To try it out, go to the menu File > Import > Federal Reserve Economic Data (FRED). In the search box, type in consumer price index. (Hint: if you need to adjust for inflation, this series could be very useful.). Select the first two results, click Add, and then Import.

You'll now have the opportunity to select the time period you want to cover, and the frequency (e.g. quarterly, annual). Click on Submit; the data will import, and you'll see a command like this in the results window:

import fred CPIAUCSL CPILFESL, daterange(2010-01-01 2018-01-22) aggregate(annual,avg) clear

You may need to get an API key from the FRED website. Once you have the key, you can set it using

 set fredkey YOURKEYHERE

Now, let's look at the data files we'll use for the workshop. The Stata file format has the file extension .dta. Not everything comes as a .dta file, unfortunately. If it's an Excel file (xls or xlsx), try:

import excel "in_class_data.xlsx", sheet("Sheet1") firstrow clear

Using the File > Import > Excel spreadsheet menu is a good way to figure out the command. Note the option sheet("Sheet1") indicates which sheet to use, because each Excel file can contain many sheets. firstrow indicates that the first row contains the variable names, which is also important. The clear option clears out any data you already had open.

If it's a csv file, try:

import delimited "in_class_data.csv", clear

You need to save it as a dta file if you want to do anything with it in Stata.

save in_class_csv, replace

If the file is already in Stata format (dta), it's as easy as pie. You can double-click to open like any other file, you can use the File > Open menu, or you can use a command like this:

use "in_class_data.dta", clear

For ICPSR data, there is a PDF and a Youtube video explaining how to import data using setup files.

One thing you want to look out for is fixed width files. You'll recognize these because each line of the data set will be a long string with no columns. If you come across one of these, Christine is happy to help.

Exploring the data

Your first concern is to make sure the data look as you expect them too.

tab marital
tab marital race
summarize
sum if age > 65
mean inctot

If you are using a survey dataset with weights, getting summary statistics, e.g. a mean, works a bit differently, by using either a frequency weight or a probability weight. See this guide to probability and frequency weights to understand the difference. Most likely, a survey dataset uses a probability weight, or pweight.

svyset [pweight=wtsupp]
svy: mean inctot

To calculate simple counts, you can avoid using the svyset by using the weight as a frequency weight, but since Stata only takes integers for frequency weights, you'll have to create a truncated weight first, like so:

gen trncwt = trunc(wtsupp)
summarize [fweight=trncwt]
tab race [fweight=trncwt]

Confusing? Yes, it is! Many datasets will have codebooks or documentation that explain which weight to use, such as this IPUMS user group question.

Modifying your variables

Rarely will your data or variables come formatted the way you need them; you will need to alter some variables. For example, missing values in Stata are stored as  .  but many datasets store missing values as -99, 9999, etc. This will mess up your analysis if you don't recode.

tab race
recode race -999 = .
tab race
tab race, m

The above command is just one way to recode missing values. mvdecode is also very helpful.

sum incwage
mvdecode incwage, mv(9999999 9999998)
mvdecode inctot, mv(-9999)
sum incwage

Frequently you will need to reassign values, rename variables, and give labels that make sense to you.

codebook gender
rename gender female
recode female (1=0)(2=1)
label define fm 1 female 0 male
label values female fm

The above code creates a dummy variable in which the value of 1 represents female. The following also creates a dummy variable for each possible value of the race variable.

tab race
tabulate race, generate(r)

Another pitfall: string variables. These are variables stored as text. They take up more memory and are not useful for your regressions, etc. You'll want to use encode to convert them to numeric variables with text labels.

describe marital
encode marital, gen(marst)
label variable marst "marital status"

codebook marst
gen married = 1
replace married = 0 if marst != 3

You can use keep or drop to get rid of variables or values that you don't want.

drop marital
keep if age>=18

How would you get rid of observations of anyone who isn't age 25?

Generate new variables

gen and egen commands can create new variables based on existing variables, using arithmetic or other functions.

gen notwages = inctot - incwage
gen lnwage = ln(incwage)
help functions

Putting gen(variablename) as an option at the end of a command will create a new variable

codebook educ, tab(20)
recode educ (2 = 0) (10 = 4) (20 = 6) (30 = 8) (40 = 9) (50 = 10) (60/71 = 11) (73 = 12) (81 = 13) (91/92 = 14) (111 = 16) (123/124 = 18)  (125 = 22), gen(edyears)

Regression

The regress command is followed by the dependent variable, and then the independent variables.

regress lnwage edyears age r1 married

In many cases you will want to use a time series or panel data, which have slightly different commands; see Time Series.

End by closing the log file.

log close

Other tutorials