We’ll be able to:
Along the way, we’ll learn:
R is a statistical package and mathematical programming language.
Unlike Stata, SAS, SPSS, Matlab and other statistical packages, it is totally open source. Students can easily install on their own computers and use after they graduate.
Unlike Excel, in R you can easily write scripts that make your analysis reproducible.
RStudio is an integrated development environment (IDE) for R. You still need to install R to use RStudio, but it is a more helpful, graphical environment for working with R. Windows for scripts, files, packages, and plots, in addition to the console, make it easier to keep track of what you are doing. It has many add-ins to make R more powerful.
RStudio is especially good for make reports and presentations with the knitr package. It can make PDFs in Latex, even.
You can go ahead and type some arithmetic into the console and it will print the answer to the screen.
1+2
## [1] 3
However, what makes it so flexible is that everything in R can be an object. The objects can hold numeric values, text strings, datasets, models, anything, really. For example, let’s create an object to hold the value of 1 + 2. <-
is the assignment operator that lets us assign a value to a variable.
oneplustwo <- 1 + 2
Nothing will print to the screen. But, in RStudio, what do you seen in the Environment pane, under Global Environment?
The object’s name chosen here, oneplustwo
, is arbitrary and, for a number of reasons that may soon become apparent, not very smart. You can name objects whatever you want, but try to find names that are meaningful, and do not start with numbers or special characters, and do not contain spaces.
Every object has a class, a type, and a structure, which will affect what you are able to do with the object.
class(oneplustwo)
## [1] "numeric"
str(oneplustwo)
## num 3
typeof(oneplustwo)
## [1] "double"
The class of the object can be changed (or rather, coerced) on the fly using as.character()
, as.numeric()
, etc., which can be very useful.
str(as.character(oneplustwo))
## chr "3"
How would you increase the value of oneplustwo
by 1?
We can store multiple values in a vector using the function c()
.
oneANDtwo <- c(1,2)
class(oneANDtwo)
## [1] "numeric"
str(oneANDtwo)
## num [1:2] 1 2
typeof(oneANDtwo)
## [1] "double"
These vectors can be the building blocks of datasets, which in R parlance we call data frames. A simple way to put together a data frame is the data.frame()
function.
data.frame(oneANDtwo, oneplustwo)
## oneANDtwo oneplustwo
## 1 1 3
## 2 2 3
What did R do with the value in oneplustwo
when it made this data frame?
Will you find this data frame in the Global Environment?
Datasets can be objects, regression models can be objects, anything can be an object.
What do you think will happen with the following?
oneANDtwo - 1
Let’s dispose of these objects.
rm(oneplustwo)
rm(oneANDtwo)
Everything in R can be automated, which makes it really powerful. In RStudio, make a new .R script by going to File -> New File -> R Script, and paste in the code we typed in above. If you save this script, it can be run and re-run whenever you need it.
You have been provided with a script with all the commands we will use in this session, downloadable on the left of this page.
RStudio has some handy tools that make it easier to write a script, especially under the “Code” menu. Another helpful menu is Session -> Set Working Directory.
One of the strengths of R is the huge number of user-contributed packages that extend its functionality. It’s also one of its weaknesses, in that there are so many packages to keep track of, and many ways of doing many tasks.
Packages only need to be installed once. Below are a few of the packages we’ll use today. In RStudio, the Packages window is quite handy.
install.packages("pdfetch")
install.packages("xts")
install.packages("stargazer")
install.packages("zoo")
install.packages("ggplot2")
But the packages need to be loaded in every session. And by session I mean, every time you open and close RStudio. Typically the library()
function is used at the top of a script.
library(pdfetch)
library(stargazer)
library(xts)
library(zoo)
library(ggplot2)
Installing new packages can take time, so let’s have a little economics interlude.
R can do all of the things that a statistical package like Stata can do, plus more sophisticated modeling and machine learning techniques. Stata, however, will most likely continue to be use for typical regression analysis, because it is built for that and is so easy to use for those cases. R is most likely to outcompete a mathematical language like Matlab.
Where R really shines for economists is machine learning. Machine learning describes ahem Big Data ahem techniques like decision trees, LASSO, etc. used for prediction.
For an example, see the scripts that accompany this article from the Quarterly Journal of Economics:
Kleinberg, Jon;Lakkaraju, Himabindu;Leskovec, Jure;Ludwig, Jens;Mullainathan, Sendhil, 2017, “Replication Data for: ‘Human Decisions and Machine Predictions’”, https://doi.org/10.7910/DVN/VWDGHT, Harvard Dataverse, V1
R has some built-in datasets that could be used for demonstration purposes (list)
data("mtcars")
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
This is a data frame–the R object most similar to a spreadsheet or other kind of dataset, and the kind of object you’d usually use for data analysis. It’s easy to import simple tables, such as CSV files, using read.csv()
or read.table()
. However, in RStudio it is also simple to use the Import Dataset window, which you can find under Environment -> Import Dataset, or File -> Import Dataset.
realGDPgrowth <- read.csv("K:/My Drive/classes/ECON 456/R/realGDPgrowth.csv")
The foreign and haven packages can help you get datasets of many, many formats into R. I’ve found it useful just for dealing with opening datasets in the format of software for which we do not have a license. A lifesaver!
What is really fun is using special packages for pulling in data automatically. These packages take the guesswork out of using an API to connect to a data source on the web.
We’ll look at pdfetch
. Type ?pdfetch
into the console to access help. FRED is just one of the sources from which it can fetch data.
pdfetch_FRED("THREEFY10")
Assuming I intend to do something with this data, what did I do wrong here?
We can get multiple series from FRED at a time by concatenating them with c()
.
treasury <- pdfetch_FRED(c("THREEFY10", "THREEFYTP10"))
plot(treasury)
It’s not a great plot, but could it have been any easier to make?
Look through the list of data series in FRED and find your own series to import. When you go to the landing page of the series, the identifier is next to the title in parentheses. These series identifiers are case sensitive and should be enclosed in quotation marks within the pdfetch_FRED()
function. Give them a name that is meaningful so we can refer to it later.
There are other handy packages that can import data directly into R, including:
Additional data packages are listed in the following:
You might be thinking, well this made it easy to get data in, but can I ever get it out? Yes, you can!
write.csv(as.data.frame(treasury),"treasury.csv")
There are a number of ways to store and work with time series data in R. ts object and xts objects are both used to store time series. Unlike the basic data frame objects mentioned above, these are specifically indexed by time. We’ll take a look at xts objects because that is what pdfetch will fetch for you.
Let’s find out a little about our treasury object. class()
, dim()
, and names()
are functions you can use with other kinds of objects; start()
and end()
are specific to xts objects.
class(treasury)
## [1] "xts" "zoo"
dim(treasury)
## [1] 7358 2
names(treasury)
## [1] "THREEFY10" "THREEFYTP10"
start(treasury)
## [1] "1990-07-18"
end(treasury)
## [1] "2018-09-28"
first(treasury)
## THREEFY10 THREEFYTP10
## 1990-07-18 8.4931 2.2688
last(treasury)
## THREEFY10 THREEFYTP10
## 2018-09-28 3.2447 -0.1119
periodicity(treasury)
## Daily periodicity from 1990-07-18 to 2018-09-28
The dollar sign and square bracket are important for selecting certain parts of the xts object.
head(treasury$THREEFY10)
treasury["2000"]
treasury["2000-07"]
treasury_subset <- treasury["2008/2011"]
Try to extract to values from THREEFY10 for October of 2008.
You can change the periodicity of your series. Try the code below, then use head()
on one to see the first six observations.
THREEFY10.monthly <- to.monthly(treasury$THREEFY10)
THREEFYTP10.monthly <- to.monthly(treasury$THREEFYTP10)
We mentioned that xts objects are just one way of storing time series; ts is another. If necessary it is possible to convert between them.
gdp.FRED <- pdfetch_FRED("A191RO1Q156NBEA") # Real GDP growth: Percent Change from Quarter One Year Ago, Seasonally Adjusted
gdp = ts(gdp.FRED, start=start(to.quarterly(gdp.FRED)), end=end(to.quarterly(gdp.FRED)), frequency=4)
plot(gdp)
The plot()
function works passably well, but I’m attached to ggplot2-style graphics. ggplot2 works on data frames, which xts objects are not. However, autoplot.zoo()
will convert xts to data frame and feed into ggplot2 for you.
p <- autoplot.zoo(treasury, facets=NULL)
p
These plots can be customized in many ways.
p1 <- p + labs(title = "Ten Year Treasury Yield and Term Premium, 1990-2017", caption = "Sources: Federal Reserve Bank of New York, Federal Reserve", x = "Year") + scale_colour_grey() + theme(legend.title = element_blank()) + theme(legend.justification=c(1,1), legend.position=c(.95,.95))
p1
p1 + theme_bw()
The last line creates the plot with the black and white theme by adding theme_bw()
. You can find your options for themes at ggplot2: Complete themes. Try the line p1 + theme_bw()
but with a different theme instead of theme_bw()
.
For any kind of regression, you first create a model object, then get summary information out of it.
model <- lm(mpg ~ wt + cyl, data = mtcars)
str(model)
## List of 12
## $ coefficients : Named num [1:3] 39.69 -3.19 -1.51
## ..- attr(*, "names")= chr [1:3] "(Intercept)" "wt" "cyl"
## $ residuals : Named num [1:32] -1.279 -0.465 -3.452 1.019 2.053 ...
## ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ effects : Named num [1:32] -113.65 -29.12 -9.34 1.33 1.6 ...
## ..- attr(*, "names")= chr [1:32] "(Intercept)" "wt" "cyl" "" ...
## $ rank : int 3
## $ fitted.values: Named num [1:32] 22.3 21.5 26.3 20.4 16.6 ...
## ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ assign : int [1:3] 0 1 2
## $ qr :List of 5
## ..$ qr : num [1:32, 1:3] -5.657 0.177 0.177 0.177 0.177 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## .. .. ..$ : chr [1:3] "(Intercept)" "wt" "cyl"
## .. ..- attr(*, "assign")= int [1:3] 0 1 2
## ..$ qraux: num [1:3] 1.18 1.05 1.17
## ..$ pivot: int [1:3] 1 2 3
## ..$ tol : num 1e-07
## ..$ rank : int 3
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 29
## $ xlevels : Named list()
## $ call : language lm(formula = mpg ~ wt + cyl, data = mtcars)
## $ terms :Classes 'terms', 'formula' language mpg ~ wt + cyl
## .. ..- attr(*, "variables")= language list(mpg, wt, cyl)
## .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:3] "mpg" "wt" "cyl"
## .. .. .. ..$ : chr [1:2] "wt" "cyl"
## .. ..- attr(*, "term.labels")= chr [1:2] "wt" "cyl"
## .. ..- attr(*, "order")= int [1:2] 1 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(mpg, wt, cyl)
## .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:3] "mpg" "wt" "cyl"
## $ model :'data.frame': 32 obs. of 3 variables:
## ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
## ..$ cyl: num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language mpg ~ wt + cyl
## .. .. ..- attr(*, "variables")= language list(mpg, wt, cyl)
## .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:3] "mpg" "wt" "cyl"
## .. .. .. .. ..$ : chr [1:2] "wt" "cyl"
## .. .. ..- attr(*, "term.labels")= chr [1:2] "wt" "cyl"
## .. .. ..- attr(*, "order")= int [1:2] 1 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(mpg, wt, cyl)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:3] "mpg" "wt" "cyl"
## - attr(*, "class")= chr "lm"
class(model)
## [1] "lm"
summary(model)
##
## Call:
## lm(formula = mpg ~ wt + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2893 -1.5512 -0.4684 1.5743 6.1004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6863 1.7150 23.141 < 2e-16 ***
## wt -3.1910 0.7569 -4.216 0.000222 ***
## cyl -1.5078 0.4147 -3.636 0.001064 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared: 0.8302, Adjusted R-squared: 0.8185
## F-statistic: 70.91 on 2 and 29 DF, p-value: 6.809e-12
There are a variety of packages that can turn your model into a nice regression table.
stargazer(model, type="html")
Dependent variable: | |
mpg | |
wt | -3.191*** |
(0.757) | |
cyl | -1.508*** |
(0.415) | |
Constant | 39.686*** |
(1.715) | |
Observations | 32 |
R2 | 0.830 |
Adjusted R2 | 0.819 |
Residual Std. Error | 2.568 (df = 29) |
F Statistic | 70.908*** (df = 2; 29) |
Note: | p<0.1; p<0.05; p<0.01 |
This is pretty easy if you are using RStudio (and I assume you are).
It starts with an .Rmd file. In RStudio, go to File -> New File -> R Markdown…
Little chunks of R code are inserted after ```{r}
. Go ahead and paste in some code we have been working on into a chunk.
Markdown is similar to HTML in that it structures documents, but it is much easier. Take a look at the RMarkdown Cheatsheet (PDF) for pointers.
When you have something in your .Rmd file, you are ready to knit! This will create an HTML document when you click on the ball of yarn.
You can also make PDFs and Word documents, but it’s a little touchier.
Basic R fluency
In-depth with R programming
Data manipulation, data cleaning
Reports, Markdown, Latex, and all that
Time series, econometrics, etc.
Geospatial analysis
Data visualization
Making apps