You will need to unzip/extract to use these data files.
What is your working directory?
This is the first thing you should know, and a quick way to avoid problems. Type the following into the Command window.
pwd
You can change it with File > Change Working Directory... Change it to the folder where you have saved the data.
Do-files are scripts for automating Stata commands. They are simply text files with the .do file extension. With a correctly written do-file, anyone can reproduce your analysis.
Please DO use a do-file.
You can create a new do-file by clicking on the New Do-File Editor button, or typing
doedit
A properly documented do-file will contain comments that communicate your intentions at each step of code. Stata will ignore the text of the comments when you run your code, but they make your .do file understandable to humans. Comments can be indicated
*like this (for an entire line)
or
it could be // like this
for the end of a line of code. Or
it could be /* like this */ for the middle of a line.
Read more about how to format do files.
A log file records the output of the commands as you run your code. The log file can be turned on and off, but it is a good idea to help you keep track of what you've done.
log using stataworkshop.txt, text replace
When you want to stop the log, end by closing the log file.
log close
A major pitfall: string variables. These are variables stored as text. They take up more memory and are not useful for your regressions, etc. You can recognize these in the Data Editor because they will be colored red.
There are two cases in which you will want to convert string variables to numeric.
encode
. (However, don't encode a string if you plan to use it as a key variable for merging with another dataset; see section on merging below.)destring
.Our sample dataset has an example of each. Since the file is in Stata format (.dta) and it is in your working directory (right?), we can open with simply:
use labor_survey.dta, clear
First, use describe
to reassure yourself that it is a string. With the encode
command, add the gen()
option to put the newly created numeric values in a new variable.
describe marital encode marital, gen(marst) codebook marst
codebook
is very handy for seeing what the actual values of the new variable are.
We can easily create a dummy variable for married/not married.
gen married = 1 replace married = 0 if marst != 3
As in encode
, you can (and should) create a new variable using the gen()
option. You should also use the ignore()
option to skip over any characters that aren't numbers. Otherwise, it will create missing values.
destring incwage, gen(wage) ignore("$") describe wage
If you don't need the variable anymore, you can drop
it.
drop incwage
You can also get rid of variables that don't meet certain conditions.
summarize drop if age < 18 summarize
keep
works as a mirror opposite to drop
.
None of these changes will persist until we save the data. It's a good idea to keep a copy of your original, pristine data, so let's save under a different filename.
save in_class_data, replace
If you have a date variable as a string, you do not want to encode or destring. You want to use the date function. Importantly, you need to tell Stata the order of month, day, year.
gen date = date(date_string, "MDY")
It displays as an integer but you can format it so it is human-readable.
format date = %td
Let's say the season is an important piece of information. You can pull out the month of year using the month()
function.
gen month_of_year = month(date)
To create a monthly series, you can use the mofd() function.
gen month = mofd(date)
I don't recommend the wofd() function.
You can make conditions based on dates, but the date needs to be wrapped in td().
list if date < td(15jan2015)
In this example, we have two datasets, in_class_data and statepolitical, and we want to use variables from both in one analysis. So, we have to merge them together.
The first thing to do is identify a variable that appears in both datasets, and that is stored in exactly the same way, which will allow the software to match up information in both datasets. This is called the key variable. In both datasets, we have variables that describe states. Let's look at our other dataset.
use statepolitical, clear codebook state codebook statefips
How do these variables differ?
Now let's go back to first dataset.
use in_class_data, clear codebook state codebook statefips
It's better to use a numeric or alphanumeric code rather than a name to perform a merge. For U.S. states, look for a FIPS code or postal code rather than the name of the state; the Stata command statastates can be used to add them if they are not provided. For countries, there are various codes developed by the World Bank, IMF, etc., and all are preferable to using the names of the country; the Stata command kountry works similarly to statastates.
Now let's go back to the previous dataset. We want to merge the statepolitical dataset to in_class_data, so we want in_class_data to be open while we are merging. in_class_data.dta is called the "master" file, while statepolitical.dta will be the "using" file.
merge m:1 statefips using statepolitical.dta
m:1 stands for many to one. Your master dataset has many observations for each state, but the using dataset has only one observations for each state. What do imagine you would use if you were merging two datasets, each with only one observation for each value of the key variable?
After you have merged, you should check to see what didn't merge. Stata provides you with a handy variable called _merge
that identifies if observations matched (3), were only in the master file (1) or only in the using file (2).
list if _merge==2 list if _merge!=3
Use your knowledge to get rid of _merge. Stata won't let you merge another dataset if _merge is already there.
A common problem with merging occurs when there are duplicate observations, which prevent the software from matching. Stata has commands for dropping duplicates, but it is also important to understand why there a duplicates, because there might be something else wrong with your data.
The append command is what you use when you have two datasets, structured exactly the same way with the same variables, that you want to stack on top of each other. You might use it if you have datasets from two different years, for example, with the same variables, that you want to put together in one file.
summarize append using extra_observations.dta summarize
Rarely will your data or variables come formatted the way you need them. Frequently you will need to reassign values, rename variables, and give labels that make sense to you.
Missing values in Stata are stored as "." but many datasets store missing values as -99, 9999, etc. This will mess up your analysis if you don't recode.
tab race recode race -999 = . tab race tab race, m
The above command is just one way to recode missing values. mvdecode is also very helpful if you have multiple values that represent missing.
mvdecode inctot, mv(-9999 -9998)
The following also creates a dummy variable for each possible value of the race variable. The name of the variable will begin with whatever text you put in the generate()
option, plus an integer indicating the order of the value in the tabulation.
tab race tabulate race, generate(r)
What does r1
represent?
We can rename our dummy variables as needed.
rename r1 white
gen
and egen
commands can create a variety of new variables from existing variables
gen notwages = inctot - wage gen lnwage = ln(wage) help functions recode age (18/29 = 1 18-29) (30/44 = 2 30-44) (45/64 = 3 45-64) (65/85 = 4 65-85), gen(agegroup) egen meaninc = mean(inctot), by(agegroup)
The regress command is followed by the dependent variable, and then the independent variables.
regress lnwage age
Stata can convert a categorical variable into dummy variables on the fly; put i. in front of the dependent variable name.
regress lnwage age i.marst
Interaction terms can be created by putting a hashtag between two variables. Two hashtags will provide main effects for each variable and an interaction. See Factor Variables for more information.
regress lnwage i.marst##gender
In many cases you will want to use a time series or panel data, which have slightly different commands; see Time Series.
Maybe you are not satisfied with the way your regression estimates look in the Stata output window. We can do better with outreg2
.
First we must install it. outreg2
is a user-written add-on that is not automatically installed with Stata.
ssc install outreg2
Run a regression, and then run outreg2
as shown below.
regress lnwage age gender outreg2 using capstone.doc, replace ctitle(Model 1)
You can add a second model to the same table easily, using the append
option. Just make sure the Word doc (in this instance, capstone.doc) is closed.
reg lnwage age outreg2 using myreg.doc, append ctitle(Model 2)
See these slides for more information.
Say you want to use a time series of unemployment by state. If you used a dataset like the one shown below, what problems would you encounter?
state | statefips | unemp2018m10 | unemp2018m09 | unemp2018m08 | unemp2018m07 |
---|---|---|---|---|---|
ALABAMA | 1 | 89754 | 90830 | 91211 | 90928 |
ALASKA | 2 | 22779 | 23339 | 24104 | 24919 |
ARIZONA | 4 | 158154 | 157377 | 156195 | 155473 |
ARKANSAS | 5 | 47441 | 47244 | 48131 | 49550 |
CALIFORNIA | 6 | 804349 | 802959 | 803076 | 807518 |
COLORADO | 8 | 98176 | 94532 | 90531 | 85492 |
CONNECTICUT | 9 | 79992 | 80072 | 80898 | 82750 |
Let's open this file. It's in Excel format, so the command for opening is import excel
, rather than use
.
import excel "state_unemployment.xlsx", sheet("Sheet1") firstrow clear
Luckily, Stata is much, much better than Excel at dealing with datasets like this, through its reshape
command. The first thing you should do is open the documentation like so:
help reshape
Using the documentation as a guide, we can try converting from wide to long
reshape long unemp, i( state statefips state_abbrev) j(month) string
This will create a new variable called month that will contain the month value. You can call it whatever you want, but the variable name goes inside the j() option. In the i() option, you list all identifying variables; these are things like id numbers, or this case state names, that identify individual observations. The string() option at the end allows it to be a string rather than a numeric value.
To truly use this data for a time series or panel data analysis, you would need to format month as a date. We won't get into the date formatting in depth, but you can consult the datetime documentation.
gen date = monthly(month, "YM") format date %tm