Stata is a command line statistical package with an intuitive syntax, highly used by economists and other social scientists. By using do files, log files, and comments, Stata provides a complete system for documenting your analysis in a way that is totally reproducible.
Do files are scripts for automating Stata commands. They are simply text files with the .do file extension. With a correctly written do file, anyone can reproduce your analysis. Some academic journals (e.g. the American Economic Review) require that authors submit .do files along with their papers.
A properly documented do file will contain comments that communicate your intentions at each step of code. Stata will ignore the text of the comments when you run your code, but they make your .do file understandable to humans. Comments can be indicated
*like this (for an entire line)
or
it could be // like this
for the end of a line of code. Or
it could be /* like this */ for the middle of a line.
A log file records the output of the commands as you run your code. The log file can be turned on and off, but it is a good idea to help you keep track of what you've done.
What is your working directory?
This is the first thing you should know, and a quick way to avoid problems.
pwd
You can change it with File > Change Working Directory... Change it to the folder where you have saved the data.
This is a good thing to do! It will capture your commands and your output.
log using stataworkshop.txt, text replace
Stata only lets you open one data file at a time. If the file is already in Stata format (dta), it's as easy as pie. You can double-click to open like any other file, you can use the File > Open menu, or you can use a command like this:
use "in_class_data.dta", clear
The clear option clears out any data you already had open.
Not everything comes as a .dta file, unfortunately. If it's a csv file, try:
import delimited "filename.csv"
For ICPSR data, there is a PDF and a Youtube video explaining how to import data using setup files. Importing other is explained at Stata Class Notes: Entering Data.
Your first concern is to make sure the data look as you expect them too.
summarize sum if age > 65 tab marital tab marital race
Rarely will your data or variables come formatted the way you need them; you will need to alter some variables. For example, missing values in Stata are stored as "." but many datasets store missing values as -99, 9999, etc. This will mess up your analysis if you don't recode.
tab race recode race -999 = . tab race tab race, m
The above command is just one way to recode missing values. mvdecode is also very helpful.
sum incwage mvdecode incwage, mv(9999999 9999998) sum incwage
Frequently you will need to reassign values, rename variables, and give labels that make sense to you.
tab gender rename gender female recode female (1=0)(2=1) label define fm 1 female 0 male label values female fm
The above code creates a dummy variable in which the value of 1 represents female. The following also creates a dummy variable for each possible value of the race variable.
tab race tabulate race, generate(r)
Another pitfall: string variables. These are variables stored as text. They take up more memory and are not useful for your regressions, etc. You'll want to use encode to convert them to numeric variables with text labels.
describe marital encode marital, gen(marst) label variable marst "marital status" codebook marst gen married = 1 replace married = 0 if marst != 3
You can also get rid of variables that don't meet certain conditions.
drop if age < 18
gen and egen commands can create a plethora of new variables from existing variables
gen lnwage = ln(incwage) gen notwages = inctot - incwage codebook educ, tab(20) recode educ (2 = 0) (10 = 4) (20 = 6) (30 = 8) (40 = 9) (50 = 10) (60/71 = 11) (73 = 12) (81 = 13) (91/92 = 14) (111 = 16) (123/124 = 18) (125 = 22), gen(edyears)
The regress command is followed by the dependent variable, and then the independent variables.
regress lnwage edyears age r1 married [pweight=wtsupp]