10  CORRELATION AND REGRESSION

Author

Jieun Park

r-function Description
data('dataset_name') Load a R built-in dataset named by ‘dataset_name’
names(x) Retrieve or set the names of elements in x
attach(df) Add a data frame df to the search path, which allows you to access the variables within the data frame df directly by their names instead of using a normal way such as df$var.
cor(x,y) Find the correlation of two vectors x and y
qplot(x,y,data) Create a quick plot data (x,y) in the dataframe data
geom_text(aes(x,y, label)) Add a label at the coordinate (x,y) in the current plot.
lm(y~x, data) Perform the linear regression of y~x, where y,x are column names in the dataframe data.
summary(lm_model) Summarize the linear model lm_model obtained by the R-function lm.

10.1 Correlation

We check if a linear correlation exists between two variables using cor() function.

Code
# We can calculate the correlation coefficient between x and y with the 
# following code.
# cor(x, y)
Code
library(tidyverse)
library(patchwork)
data("mtcars")
names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"
Code
attach(mtcars)
# positive correlation
qplot(wt, disp, data = mtcars) +
  geom_text(aes(x=2, y=400, label="r = 0.888"))

Code
cor(wt, disp)
[1] 0.8879799
Code
# negative correlation
qplot(mpg, wt, data = mtcars)  +
  geom_text(aes(x=30, y=5, label="r = - 0.868"))

Code
cor(mpg, wt)
[1] -0.8676594
Code
# no correlation
qplot(drat, qsec, data = mtcars)  +
  geom_text(aes(x=4.5, y=22, label="r = 0.091"))

Code
cor(drat, qsec)
[1] 0.09120476
  • wt and disp have a positive correlation with r =0.888.
  • wt and disp have a negative correlation with r = -0.868.
  • wt and disp does not have a significant correlation with r = -0.175.

10.2 Linear regression

Assume we have a data set data with x and y variables and we model their relationship by linear regression. We can find the slope and the intercept of the estimated regression line using the following code.

Code
# res <- lm(y ~ x, data)
# summary(res)

For example, we can find the regression line equation between disp (x-variable, predictor) and wt (y-variable, response) as below.

Code
data("mtcars")
res <- lm(wt ~ disp, mtcars)
summary(res)

Call:
lm(formula = wt ~ disp, data = mtcars)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.89044 -0.29775 -0.00684  0.33428  0.66525 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.5998146  0.1729964   9.248 2.74e-10 ***
disp        0.0070103  0.0006629  10.576 1.22e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4574 on 30 degrees of freedom
Multiple R-squared:  0.7885,    Adjusted R-squared:  0.7815 
F-statistic: 111.8 on 1 and 30 DF,  p-value: 1.222e-11

The estimated regression line is wt = 1.600 + 0.007 disp since the intercept is 1.6 and the slope is 0.007. Both of them are significantly different from 0 with a significance level \alpha = 0.05 because their p-values are almost 0. The linear relation means that one inch increase in disp (displacement) makes 7 lbs increase in wt (weight). On average, if a car has a one-inch longer displacement, it is 7 pounds heavier.

If a car has 200 inches displacement, then its estimated weight can be calculated as 1.600 + 0.007\cdot200 = 3000 \textrm{ lbs}

We next use the R package ggplot to visualize the data set and the regression line.

Code
ggplot(mtcars, aes(x=disp, y=wt)) +  # define x and y
  geom_point()+                      # scatter  plot
  geom_smooth(method=lm, se=FALSE) + # add a regression line
  geom_text(aes(x = 150, y = 4.5, label = "wt = 1.600 + 0.007disp")) #add a label