Skip to Main Content

R: Tests

Resources for learning and using the R programming language.

Tests for analyzing a single categorical variable

Binomial test
Chi square (X²) goodness of fit test

Tests for analyzing relationship between two categorical variables

Chi square (X²) contingency test
Fisher's exact test

Tests for analyzing a single numerical variable

One-sample t-test
Sign test for median

Tests with a numerical response variable and explanatory categorical variable(s) (Parametric)

Two-sample t-test
Paired t-test
One-way ANOVA and Tukey-Kramer test
Welch's t-test
Multiway ANOVA
General linear model

Tests with a numerical response variable and an explanatory categorical variable (Non-parametric)

Mann-Whitney U-test
Kruskal-Wallis Test

Tests for analyzing the relationship between numerical variables

Simple linear regression
Linear correlation
Spearman's rank correlation

Tests for analyzing a single categorical variable

Binomial test

binom.test()

This test requires a number of successes, and number of trials, and an expected proportion of successes.

Data set-up: Known values

The simplest way to run this test does not require a dataset at all, but just these three numbers. In this example, there were 110 trials with 64 successes, and the expected proportion is 50%

binom.test(x = 64, n = 110, p = 0.5 )

Data set-up: Disaggregated (Raw data)

More commonly, you may have one dichotomous variable representing success or failure, as well as an expected proportion.

In this example, you calculate the number of successes from the vector data$Ran, in which 2 represents a success.

data <- read.delim("http://www.statsci.org/data/oz/ms212.txt")
binom.test(sum(data$Ran ==  2),  length(data$Ran), p = 0.5  )

data$Ran == 2 creates a vector with TRUE or FALSE for each observation in data$Ran, depending on whether the observation is 2 (success) or not. sum() of that vector gives the number of successes, because TRUE is counted as 1. length(data$Ran) is the number of trials because it outputs the number of observations in data$Ran.

Data set-up: Tabulated (Contingency table)

how a contingency table looks The data maybe already tabulated as a frequency table, and not as a column with rows for each trial.

frequency_table <- data.frame(success = c(1,2), frequency = c(46,64))

number_of_successes is the second value in the frequency column, and number_of_trials is the sum: successes plus failures.

number_of_successes <- frequency_table$frequency[2]
number_of_trials <- sum(frequency_table$frequency)
binom.test(x = number_of_successes, n = number_of_trials, p = 0.5)

See example output

	Exact binomial test

data:  64 and 110
number of successes = 64, number of trials = 110, p-value = 0.1046
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.4839459 0.6751724
sample estimates:
probability of success 
             0.5818182

Chi square (X²) goodness of fit test

chisq.test()

Requires one categorical variable with values of expected frequencies.

Data set-up: Disaggregated (Raw data)

data <- read.delim("http://www.statsci.org/data/oz/ms212.txt")

table(data$Ran) creates a frequency table. If no probability is provided, it is assumed that each category is equally likely.

chisq.test(table(data$Ran))

To indicate different expected probabilities, create a vector of probabilities inside c(), in the order the categories appear in the table

chisq.test(table(data$Ran), p = c(0.75, 0.25))

Data set-up: Tabulated (Contingency table)

what a contingency table looks like If the data are available only as a frequency table, and not as a column with a value for each observation as shown above, you can simply use the vector that represents the frequency of each category.

frequency_table <- data.frame(category = c(1,2), frequency = c(46,64))
chisq.test(frequency_table$frequency))

See example output

	Chi-squared test for given probabilities

data:  table(data$Ran)
X-squared = 2.9455, df = 1, p-value = 0.08612

Tests for analyzing relationship between two categorical variables

Requires two categorical variables, with two or more possible values.

Chi square (X²) contingency test

chisq.test()

Data set-up: Disaggregated (Raw data)

what disaggregated data looks like Requires two categorical variables, with two or more possible values.

data <- read.csv("http://users.stat.ufl.edu/~winner/data/marij1_indiv.csv")

chisq.test() expects a matrix where the rows and columns are possible values of the two variables, and the cells are the number of observations with each combination of values. This can be created in R using the table() function.

# table(data$marijUse, data$party) create a matrix like this:

#      1   2   3
#  1  40 213 118
#  2   3  55  40
#  3   1  44  54
#  4   0  17  32

chisq.test(table(data$marijUse, data$party))

Data set-up: Tabulated (Contingency table)

what tabulated data look like

data <- read.csv("http://users.stat.ufl.edu/~winner/data/marij1.csv")

Data in this structure will require reshaping to create a suitable matrix for chisq.test(). We can accomplish this with the reshape2 package.

library(reshape2)
data_as_matrix <- acast(data, marijUse ~ party)
chisq.test(data_as_matrix)

See example output

	Pearson's Chi-squared test

data:  data$marijUse and data$party
X-squared = 43.38, df = 6, p-value = 9.81e-08

Fisher's exact test

fisher.test()

Requires two categorical variables with two possible values each.

Data set-up: Disaggregated (Raw data)

how disaggregated data look

data <- read.delim("http://www.statsci.org/data/oz/ms212.txt")
fisher.test(data$Gender, data$Smokes)

fisher.test() can accept either a matrix or two vectors. In this example, two vectors are taken from a dataframe.

Data set-up: Tabulated (Contingency table)

undefined

data <- data.frame(gender = c(1, 2, 1, 2), 
                   smokes = c(1, 1, 2,  2), 
                   frequency = c(8, 3, 51, 48))

With data in this tabulated form, it is easier to create a matrix using the reshape2 package.

library(reshape2)
data_as_matrix <- acast(data, gender ~ smokes)

# the matrix looks like this:
#   1  2
# 1 8 51
# 2 3 48

fisher.test(data_as_matrix)

See example output

	Fisher's Exact Test for Count Data

data:  data
p-value = 0.2167
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  0.5559784 15.4198906
sample estimates:
odds ratio 
  2.490131

Tests for analyzing a single numerical variable

One-sample t-test

t.test()

data set up for one sample t test Requires one normally distributed numerical variable and a hypothesized mean. See instructions for checking for normality.

data <- read.table("http://www.statsci.org/data/general/balaconc.txt", 
                   stringsAsFactors = FALSE, header = TRUE)
t.test(data$SideSway, mu = 11)

In this case, the hypothesized mean is 11.

See example output

One Sample t-test

data:  data$SideSway
t = 3.8107, df = 16, p-value = 0.001538
alternative hypothesis: true mean is not equal to 11
95 percent confidence interval:
 14.4974 23.2673
sample estimates:
mean of x 
 18.88235

Sign test for median

SignTest()

Requires one numerical variable and a hypothesized median. The numerical variable does not need to be normally distributed.

data <- read.table("http://www.statsci.org/data/general/balaconc.txt", 
                   stringsAsFactors = FALSE, header = TRUE)
library(DescTools)
SignTest(data$SideSway, mu = 22)

See example output

	One-sample Sign-Test

data:  data$SideSway
S = 3, number of differences = 16, p-value = 0.02127
alternative hypothesis: true median is not equal to 22
95.1 percent confidence interval:
 14 21
sample estimates:
median of the differences 
                       17

Tests with a numerical response variable and explanatory categorical variable(s) (Parametric)

Two-sample t-test

t.test()

Requires one normally distributed, numerical variable and one grouping variable with two values. The grouping variable may be numeric-type or string-type.

data <- read.table("http://www.statsci.org/data/general/balaconc.txt", 
                   stringsAsFactors = FALSE, header = TRUE)
t.test(SideSway ~ Age, data = data, var.equal = TRUE)

See example output

	Two Sample t-test

data:  SideSway by Age
t = 1.8349, df = 15, p-value = 0.08643
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.146965 15.341409
sample estimates:
mean in group Elderly   mean in group Young 
             22.22222              15.12500

Paired t-test

t.test()

Requires two numerical variables that are paired. Paired samples are matched in some way; often they represent the same object or respondent tested at different points in time.

data <- read.table("http://www.statsci.org/data/oz/stroke.txt", 
         header = TRUE)
t.test(data$Bart1, data$Bart8, paired = TRUE)

See example output

	Paired t-test

data:  paireddata$Bart1 and paireddata$Bart8
t = -7.4941, df = 23, p-value = 1.291e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -41.20539 -23.37795
sample estimates:
mean of the differences 
              -32.29167

One-way ANOVA

aov()

Requires one normally distributed, numerical response variable and one categorical grouping variable with two or more values.

data <- read.table("http://www.statsci.org/data/general/wolfrive.txt", 
                   stringsAsFactors = FALSE, header = TRUE)
summary(aov(data$HCB ~ data$Depth))

See example output

            Df Sum Sq Mean Sq F value Pr(>F)  
data$Depth   2  5.357  2.6783   3.032 0.0649 .
Residuals   27 23.848  0.8833                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Welch's t-test

t.test()

Requires one normally distributed, numerical variable and one grouping variable with two values. The grouping variable may be numeric-type or string-type.

data <- read.table("http://www.statsci.org/data/general/balaconc.txt", 
                   stringsAsFactors = FALSE, header = TRUE)
t.test(data$SideSway ~ data$Age)

See example output

	Welch Two Sample t-test

data:  data$SideSway by data$Age
t = 1.9228, df = 10.5, p-value = 0.08204
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.07436 15.26880
sample estimates:
mean in group Elderly   mean in group Young 
             22.22222              15.12500

Multiway ANOVA

aov()

two way anova data set up Requires one normally distributed numerical response variable and two categorical grouping variables with two or more values.

data <- read.table("http://www.statsci.org/data/general/fullmoon.txt", 
                   stringsAsFactors = FALSE, header = TRUE)
summary(aov(data$Admission ~ data$Month + data$Moon))

See example output

            Df Sum Sq Mean Sq F value   Pr(>F)    
data$Month  11  455.6   41.42   7.129 5.08e-05 ***
data$Moon    2   41.5   20.76   3.573   0.0453 *  
Residuals   22  127.8    5.81                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Tests with a numerical response variable and an explanatory categorical variable (Non-parametric)

Mann-Whitney U-test

wilcox.test()

Requires one numerical or ordinal variable, and one grouping variable with two values.

data <- read.table("http://www.statsci.org/data/general/balaconc.txt", 
                   stringsAsFactors = FALSE, header = TRUE)
wilcox.test(data$FBSway~data$Age)

See example output

	Wilcoxon rank sum test with continuity correction

data:  data$FBSway by data$Age
W = 59, p-value = 0.02988
alternative hypothesis: true location shift is not equal to 0

Kruskal-Wallis Test

kruskal.test()

Requires one numerical or ordinal variable, and one grouping variable with two or more values.

data <- read.table("http://www.statsci.org/data/general/balaconc.txt", 
                   stringsAsFactors = FALSE, header = TRUE)
kruskal.test(data$FBSway ~ data$Age)

See example output

	Kruskal-Wallis rank sum test

data:  data$FBSway by data$Age
Kruskal-Wallis chi-squared = 4.9283, df = 1, p-value = 0.02642

Tests for analyzing the relationship between numerical variables

Simple linear regression

lm()

regression data Requires two numerical variables.

data <- read.table("http://www.statsci.org/data/general/kittiwak.txt", 
                   stringsAsFactors = FALSE, header = TRUE)
linear_model <- lm(data$Population ~ data$Area)
summary(linear_model)

See example output

Call:
lm(formula = data$Population ~ data$Area)

Residuals:
    Min      1Q  Median      3Q     Max 
-4317.3  -858.2    33.6   670.8  7153.0 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -734.8063   786.9873  -0.934    0.362    
data$Area      3.3021     0.5832   5.662 1.53e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2579 on 20 degrees of freedom
Multiple R-squared:  0.6158,	Adjusted R-squared:  0.5966 
F-statistic: 32.06 on 1 and 20 DF,  p-value: 1.532e-05

Linear correlation

cor.test(x, y, method=c("pearson"))

Requires two numerical variables. See setup for Simple linear regression above.

data <- read.table("http://www.statsci.org/data/general/kittiwak.txt", 
                   stringsAsFactors = FALSE, header = TRUE)
cor.test(data$Population, data$Area, method=c("pearson"))

See example output

	Pearson's product-moment correlation

data:  data$Population and data$Area
t = 5.6619, df = 20, p-value = 1.532e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5426711 0.9064457
sample estimates:
      cor 
0.7847361

Spearman's rank correlation

cor.test(x, y, method=c("spearman"))

Requires two numerical variables. See setup for Simple linear regression above.

data <- read.table("http://www.statsci.org/data/general/kittiwak.txt", 
                   stringsAsFactors = FALSE, header = TRUE)
cor.test(data$Population, data$Area, method=c("spearman"))

See example output

	Spearman's rank correlation rho

data:  data$Population and data$Area
S = 815.73, p-value = 0.009578
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.5393957