R Basics

http://rafalab.dfci.harvard.edu/dsbook-part-1/R/R-basics.html

Packages

base comes from the installation of R.
add-on packages that can be obtained from CRAN
Use install.packages to install the dslabs package from CRAN
Try: sessionInfo, installed.packages

Prebuilt functions

R base includes automatically loaded packages: stats, graphics, grDevices, utils, datasets, methods.
Very popular packages not included in R base: ggplot2, dplyr, tidyr, and data.table, caret, etc. They can be installed on the fly when RStudio prompts.

Base R functions

Examples: ls, rm, library, search, factor, list, exists, str, typeof, and class.
You can see the raw code for a function by typing its name without the parentheses.
- type ls on your console to see an example.

Help system

You can learn about function using

help("ls")

?ls

many packages provide vignettes that are like how-to manuals for different analyses

Variables

Define a variable.

a <- 2

Use ls to see if it’s there. Also take a look at the Environment tab in RStudio.

ls()

[1] "a"               "has_annotations"

Use rm to remove the variable you defined.

rm(a)

The workspace

each time start R, a new (clean slate) workspace that does not have any variables or libraries is loaded
when exit R you will be asked if you want to save the workspace
- In most cases, always better off saying NO
if you save a workspace it will be saved as a hidden file .Rdata and whenever R is started in a directory with a saved workspace then that workspace will be re-instantiated and used

The search paths

search() # return the search path that R checks in order

[1] ".GlobalEnv"        "package:stats"     "package:graphics" 
[4] "package:grDevices" "package:utils"     "package:datasets" 
[7] "package:methods"   "Autoloads"         "package:base"

Two functions have the same name will be searched in the order defeined in the search path.
When you run

library(deplyr)

package deplyr is added near the top of the search path.

Variable naming convention

use meaningful words (nouns) that describe what is stored, only lower case, and underscores, no spaces.
Do not use the period . in variable names. R may treat it differently and those can cause unintended actions
For more this guide.

Main data types

One dimensional vectors: double, integer, logical, complex, characters.
integer and double are both numerical
Factors (categorical variables): take finite discrete values
Lists: this includes data frames.
Arrays (matrices): all elements are of the same type
Date and time
tibble (comes from tidyverse, behaves like a dataframe, print friendly, no automatic type change, avoids silent bugs)
S3 objects (default R system, flexible, but can be fragile) vs. S4 objects (srongly typed classes, avoid silent bugs)

Data types

str stands for structure, gives us information about an object.
typeof gives you the lower-level data storage type of the object: double, integer, char, logical, list, closure(functions), enviroment.
class returns the object-oriented class attribute of an object at a higher level, often user-facing level.

`typeof` and `class`

d <- as.Date("2026-02-02")
typeof(d)  # "double"

[1] "double"

class(d)   # "Date"

[1] "Date"

unclass(d) # a number (days since 1970-01-01)

[1] 20486

d <- as.Date("2026-02-02")
typeof(d)  # "double"

[1] "double"

class(d)   # "Date"

[1] "Date"

unclass(d) # a number (days since 1970-01-01)

[1] 20486

Data types

Let’s see some example:

library(dslabs)
typeof(murders)

[1] "list"

class(murders)

[1] "data.frame"

str(murders)

'data.frame':   51 obs. of  5 variables:
 $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
 $ abb       : chr  "AL" "AK" "AZ" "AR" ...
 $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
 $ population: num  4779736 710231 6392017 2915918 37253956 ...
 $ total     : num  135 19 232 93 1257 ...

dim(murders)

[1] 51  5

Data frames

Data frames are the most common class used in data analysis.
Data frames are like a matrix, but where the columns can have different types.
Usually, rows represents observations and columns variables.
you can index them like you would a matrix, x[i, j] refers to the element in row i column j
You can see part of the content like this

head(murders)

       state abb region population total
1    Alabama  AL  South    4779736   135
2     Alaska  AK   West     710231    19
3    Arizona  AZ   West    6392017   232
4   Arkansas  AR  South    2915918    93
5 California  CA   West   37253956  1257
6   Colorado  CO   West    5029196    65

View a Data frames

View(murders)

Add a column

murders$pop_rank <- rank(murders$population)
head(murders)

       state abb region population total pop_rank
1    Alabama  AL  South    4779736   135       29
2     Alaska  AK   West     710231    19        5
3    Arizona  AZ   West    6392017   232       36
4   Arkansas  AR  South    2915918    93       20
5 California  CA   West   37253956  1257       51
6   Colorado  CO   West    5029196    65       30

Data frames: Accessor `$`

$ is called an accessor because it lets us access columns.

murders$population

 [1]  4779736   710231  6392017  2915918 37253956  5029196  3574097   897934
 [9]   601723 19687653  9920000  1360301  1567582 12830632  6483802  3046355
[17]  2853118  4339367  4533372  1328361  5773552  6547629  9883640  5303925
[25]  2967297  5988927   989415  1826341  2700551  1316470  8791894  2059179
[33] 19378102  9535483   672591 11536504  3751351  3831074 12702379  1052567
[41]  4625364   814180  6346105 25145561  2763885   625741  8001024  6724540
[49]  1852994  5686986   563626

More generally: $ can be used to access named components of a list.

Alternative ways to access columns (confusing)

murders$population
murders[, "population"]
murders[, 4] 
murders[["population"]]
murders[[4]]

In general, using the name rather than the number as adding or removing columns will change index values, but not names.

Gotcha

Compare

murders$population
murders[, "population"]
murders[["population"]] # access the column content
murders['population']   # access the column as a data frame

class(murders['population'])

[1] "data.frame"

class(murders[["population"]])

[1] "numeric"

`with` clause

with let’s us use the column names as symbols to access the data.

rate <- with(murders, total/population)

Note you can write entire code chunks by enclosing it in curly brackets:

with(murders, {
   rate <- total/population
   rate <- round(rate*10^5)
   print(rate[1:5])
})

[1] 3 3 4 3 3

Atomic vectors

The columns of data frames are one dimensional (atomic) vectors.
An atomic vector is a vector where every element must be the same type.

length(murders$population)

[1] 51

typeof(murders$population)

[1] "double"

Create vectors

USe the concatenate function c by listing its members.

x <- c("s", "t", "a", "t",  " ", "3", "0", "0","0")

We access the elements using []

x[5]

[1] " "

typeof(x[5])

[1] "character"

x[12]

[1] NA

NOTE R does not do array bounds checking…it silently pads with missing values

Sequences

Sequences are vectors with equally spaced elements.

seq(1, 10)

 [1]  1  2  3  4  5  6  7  8  9 10

seq(1, 9, by=3)  # why the output?

[1] 1 4 7

When you want a sequence that increases/decreases by 1 you can use the colon :

1:10

 [1]  1  2  3  4  5  6  7  8  9 10

3.3:7.8

[1] 3.3 4.3 5.3 6.3 7.3

4:-1

[1]  4  3  2  1  0 -1

Sequences

To quickly generate a sequence the same length as a vector x is seq_along(x) cf. 1:length(x)

x <- c("b", "s", "t", " ", "2", "6", "0")
seq_along(x)

[1] 1 2 3 4 5 6 7

for (i in seq_along(x)) {
  cat(toupper(x[i])) #concatenate and print
}

BST 260

But if the length of x is zero, then using 1:length(x) does not work for the above loop

w = x[x=="W"]
print(w) # w=charactor(0): a char vector of length 0.

character(0)

1:length(w) # a sequence [1 0]

[1] 1 0

seq_along(w) # an integer vector of legnth 0

integer(0)

Data types and coercion

double, integer, logical, complex, characters and numeric
Each basic type has its own version of NA (a missing value)
testing for types: is.TYPE,
coercing : as.TYPE, will result in NA if it is not possible

is.numeric("a")

[1] FALSE

is.double(1L)

[1] FALSE

as.double("6")

[1] 6

as.numeric(x) + 3

[1] NA NA NA NA  5  9  3

typeof(1:10)

[1] "integer"

Coercing

When you do something inconsistent with data types, R tries to figure out what you mean and change it accordingly.
We call this coercing.
R does not return an error and in some cases does not return a warning either.
This can cause confusion and unnoticed errors.

Vector types and coercion

coercion is automatically performed when it is possible
coercison hierarchy: logical → integer → double → character
TRUE coerces to 1 and FALSE to 0
but any non-zero integer coerces to TRUE, only 0 coerces to FALSE
as.logical converts 0 and only 0 to FALSE, everything else to TRUE
the character string “NA” is not a missing value

typeof(1:10 + 0.1)

[1] "double"

typeof(TRUE+1) # 1 is double by default in R; integer must be `1L`

[1] "double"

as.character(TRUE)

[1] "TRUE"

as.numeric(TRUE)

[1] 1

as.logical(1)

[1] TRUE

as.logical(.5)

[1] TRUE

More coercing examples

typeof(1L)

[1] "integer"

typeof(1)

[1] "double"

typeof(1 + 1L)

[1] "double"

c("a", 1, 2)

[1] "a" "1" "2"

TRUE + FALSE

[1] 1

factor("a") == "a" #`==` compare values

[1] TRUE

identical(factor("a"), "a") #check if they are the same objects

[1] FALSE

When coercing fails

When R can’t figure out how to coerce, rather an error it returns an NA:

as.numeric("a")

[1] NA

Note that including NAs in arithmetical operations usually returns an NA.

1 + 2 + NA

[1] NA

Explicitly coercing

Most coercion functions start with as.

x <- factor(c("a","b","b","c"))
print(x)

[1] a b b c
Levels: a b c

as.character(x)

[1] "a" "b" "b" "c"

as.numeric(x)

[1] 1 2 2 3

Coercing and parsing by `readr`

The readr package provides some tools for trying to parse

x <- c("12323", "12,323")
as.numeric(x)

[1] 12323    NA

library(readr)
parse_guess(x)

[1] 12323 12323

Atomic vector and array

Atomic vector is 1-D, contains only one type of data, has no dimensions (1dim =NULL`).

x_num  <- c(1, 2, 3)
x_chr  <- c("a", "b", "c")
x_log  <- c(TRUE, FALSE, TRUE)

typeof(x_num) # "double"

[1] "double"

is.atomic(x_num) # TRUE

[1] TRUE

dim(x_num) # NULL

NULL

Array: $\ge 2 D$, one atomic type, has dim attribute

m <- matrix(1:6, nrow = 2)
dim(m)

[1] 2 3

# 2 3

is.atomic(m) # TRUE   (important!)

[1] TRUE

Factors and Characters

The murder dataset has examples of both.

class(murders$state)

[1] "character"

class(murders$region)

[1] "factor"

What is a factor?

A factor is a R representation for a categorical variable, that has a fixed set of non-numeric values
- ex: Sex has Male and Female in the murders dataset
usually not good for variables that have lots of levels (like state names in the murders dataset)
Internally a factor is stored as the unique set of labels (called levels) and an integer vector with values in 1 to length(levels)
- the ith entry corresponds to the kth element of the levels
Usually order does not matter, but if it does you can have ordered factors.

x <- murders$region
levels(x) # default order is lexicographical

[1] "Northeast"     "South"         "North Central" "West"

Setting Levels

you can set up the levels (order) as you would like, when creating a factor
if you do not set them up, then they will be created in lexicographic order (in the locale you are using)

x = sample(c("Male", "Female"), 50, replace =TRUE)
y1 = factor(x, levels=c("Male", "Female"))
y2 = factor(x, levels = c("Female", "Male"))
y1[1:10]

 [1] Female Male   Female Female Female Male   Male   Male   Male   Male  
Levels: Male Female

y2[1:10]

 [1] Female Male   Female Female Female Male   Male   Male   Male   Male  
Levels: Female Male

The previous example continued

x = sample(c("Male", "Female"), 50, replace =TRUE)
y1 = factor(x, levels=c("Male", "Female"))
y2 = factor(x, levels = c("Female", "Male"))
y1[1:10]

 [1] Male   Female Female Female Female Female Male   Male   Male   Female
Levels: Male Female

y2[1:10]

 [1] Male   Female Female Female Female Female Male   Male   Male   Female
Levels: Female Male

y1==y2 # value comparison

 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[46] TRUE TRUE TRUE TRUE TRUE

identical(y1,y2) # object comparison

[1] FALSE

Categories based on strata

The function cut bins a numeric vector into intervals and returns a factor

age <- c(5, 93, 18, 102, 14, 22, 45, 65, 67, 25, 30, 16, 21)
cut(age, c(0, 11, 27, 43, 59, 78, 96, Inf)) # number of categories =length(breaks)-1=7

 [1] (0,11]   (78,96]  (11,27]  (96,Inf] (11,27]  (11,27]  (43,59]  (59,78] 
 [9] (59,78]  (11,27]  (27,43]  (11,27]  (11,27] 
Levels: (0,11] (11,27] (27,43] (43,59] (59,78] (78,96] (96,Inf]

# cut(age,
#     c(0,11,27,43,59,78,96,Inf),
#     labels = c("Child","Young","Adult","Mid","Senior","Old","Very Old"))

cut(age, breaks, right = FALSE) # ->[a,b)
cut(age, breaks, include.lowest = TRUE) # include the lowest age 0

Assigning labels

We can assign more meaningful level names:

age <- c(5, 93, 18, 102, 14, 22, 45, 65, 67, 25, 30, 16, 21)
cut(age, c(0, 11, 27, 43, 59, 78, 96, Inf), 
    labels = c("Alpha", "Zoomer", "Millennial", "X", "Boomer", "Silent", "Greatest"))

 [1] Alpha      Silent     Zoomer     Greatest   Zoomer     Zoomer    
 [7] X          Boomer     Boomer     Zoomer     Millennial Zoomer    
[13] Zoomer    
Levels: Alpha Zoomer Millennial X Boomer Silent Greatest

Changing levels

This is often needed for ordinal data because R defaults to alphabetical order

gen <- factor(c("Alpha", "Zoomer", "Millennial"))
levels(gen)

[1] "Alpha"      "Millennial" "Zoomer"

You can change this with the levels argument:

gen <- factor(gen, levels = c("Alpha", "Zoomer", "Millennial", "X", "Boomer", "Silent", "Greatest"))
levels(gen)

[1] "Alpha"      "Zoomer"     "Millennial" "X"          "Boomer"    
[6] "Silent"     "Greatest"

Reference level

A common reason we want to change levels is to assure R uses the correct reference strata.
This is important for linear models (lm) because the first level is assumed to be the reference.

x <- factor(c("no drug", "drug 1", "drug 2"))
levels(x)

[1] "drug 1"  "drug 2"  "no drug"

x <- relevel(x, ref = "no drug") 
# factor(c("no drug", "drug 1", "drug 2"), levels = c("no drug", "drug 1", "drug 2"))
levels(x)

[1] "no drug" "drug 1"  "drug 2"

We often want to order strata (factor) based on a summary statistic.

x <- reorder(murders$region, murders$population, sum) #reorder(x,by, FUN)
#order the regions by their total populations by ascedning (default)
# reorder(murders$region, murders$population, sum, decreasing = TRUE)
library(ggplot2)
ggplot(murders, aes(x = reorder(region, population, sum),
                    y = population)) +
  geom_bar(stat = "summary", fun = "sum") #summarize y for each x using "sum"

Factors (integers) use less memory

x <- sample(murders$state[c(5,33,44)], 10^7, replace = TRUE) #sample(x,size,replace)
y <- factor(x)
object.size(x)

80000232 bytes

object.size(y)

40000648 bytes

length(x) #  10000000

[1] 10000000

table(x) / length(x) #≈ 0.333, 0.333, 0.333

x
California   New York      Texas 
 0.3335301  0.3332706  0.3331993

# table counts the frequency of each unique value

Factors can be confusing

x <- factor(c("3","2","1"), levels = c("3","2","1"))
as.numeric(x)

[1] 1 2 3

x[1]

[1] 3
Levels: 3 2 1

levels(x[1])

[1] "3" "2" "1"

table(x[1])


3 2 1 
1 0 0

Drop extra levels with `droplevels`

z <- x[1]
z <- droplevels(z)
z

[1] 3
Levels: 3

But note what happens if we change to another level:

z[1] <- "1" # Factors only allow values that exist in levels(z)
z

[1] <NA>
Levels: 3

NAs and NULLs

NA stands for not available and represents missing data.
In R there is a different kind of NA for each of the basic vector data types.
NULL, represents a zero length list and is often returned by functions or expressions that do not have a specified return value

Checking NAs by `is.na`

library(dslabs)
na_example[1:20]

 [1]  2  1  3  2  1  3  1  4  3  2  2 NA  2  2  1  4 NA  1  1  2

The is.na function is key for dealing with NAs

is.na(na_example[1])

[1] FALSE

is.na(na_example[17])

[1] TRUE

is.na(NA)

[1] TRUE

is.na("NA")

[1] FALSE

NAs and logical operators

logical operators like and (&) and or (|) coerce their arguments when needed and possible
the logical operators do “lazy” evaluation, from left to right

TRUE & NA

[1] NA

TRUE & 0

[1] FALSE

TRUE | NA

[1] TRUE

NaNs and Inf

NaN: Not a Number
NaN is a double, coercing it to integer yields an NA
Inf and -Inf represent values of infinity and minus infinity

0/0

[1] NaN

class(0/0)

[1] "numeric"

sqrt(-1)

[1] NaN

log(-1)

[1] NaN

1/Inf

[1] 0

Inf-Inf

[1] NaN

Lists

Data frames are a type of list.
Lists permit components of different types and, unlike data frames, different lengths:

x <- list(name = "John", id = 112, grades = c(95, 87, 92))

The JSON format is best represented as list in R.

Access compoenents of a list

You can access components in different ways:

x$name

[1] "John"

x[[1]]

[1] "John"

x[["name"]]

[1] "John"

subset(sublist)`[]` and extraction `[[]]`

x$name == x[[1]]

[1] TRUE

class(x[[1]])

[1] "character"

x[1]

$name
[1] "John"

class(x[1])

[1] "list"

x[[1]] == x[1]

name 
TRUE

identical(x[1],x[[1]])

[1] FALSE

Matrices: Faster computation

similar to data frames except all entries need to be of the same type.

mat <- matrix(1:12, 4, 3)
mat

     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

mat[2, 3]  # 10

[1] 10

mat[2,]

[1]  2  6 10

mat[,3]

[1]  9 10 11 12

mat[,2:3]

     [,1] [,2]
[1,]    5    9
[2,]    6   10
[3,]    7   11
[4,]    8   12

mat[1:2,2:3]

     [,1] [,2]
[1,]    5    9
[2,]    6   10

as.data.frame(mat)

Define a customized functions

f <- function(x, y, z = 0){
  ### do calculations with x, y, z to compute object
  return(object)
}

positional arguments, keyword arguments
arguments are matched by either exact name (which takes precedence) or position
returned value: the value specified in a call to return or the value of the last statement evaluated

Functions

any symbol found in the body of the function that does not match an argument has to be matched to a value by a process called scoping

x <-10
s <- function(n){
  print(x)
   return(sum(1:n))
}
s(2)

[1] 10

[1] 3

R matches named arguments first, positional arguments last

f <- function(a, b, c) {
  c(a, b, c)
}

f(1, c = 3, 2)

[1] 1 2 3

mean(na.rm = TRUE, c(1, NA, 3))

[1] 2

f <- function(a, b, c) a + b + c
f(1, 2, c = 3, b = 4)
# Error in f(1, 2, c = 3, b = 4) : unused argument (2)

Flow-control and operators

if/else; while; repeat; for; break; next
you can read the manual pages by calling help or using ? (sometimes you must quote the argument)

 help("for")
 ?"break"
 ?"&"
 ?"/"

if-else

a <- 0

if (a != 0) {
  print(1/a)
} else{
  print("No reciprocal for 0.")
}

[1] "No reciprocal for 0."

a <- 0
ifelse(a > 0, 1/a, NA)

[1] NA

#> [1] NA

For loop

for (i in 1:3) {
  print(i)
}

[1] 1
[1] 2
[1] 3

compute_s_n <- function(n) { 
  sum(1:n)
}
m <- 5
s_n <- vector(length = m) # create an empty vector
for (n in 1:m) {
  s_n[n] <- compute_s_n(n)
}
n <- 1:m
plot(n, s_n)

Logical Operators `&` and `|`

& and | perform element-wise comparison; use for vector oprations, filtering, subsetting

x <- c(TRUE, FALSE, TRUE)
y <- c(TRUE, TRUE, FALSE)

x & y

[1]  TRUE FALSE FALSE

x | y

[1] TRUE TRUE TRUE

Typical use: Subsetting

x[x > 0 & x < 10]

Logical operators `&&` and `||`

&& and || lazy eval (short-circuit), move left to right and return when the result is determined. Use for if, while, and guards

x <- c(TRUE, FALSE)
y <- c(FALSE, TRUE)

x && y  # wrong. Each side must return a single TRUE/FALSE

x || y #wrong

if (x > 0 & x < 10) { ... } # wrong
if (x > 0 && x < 10) { ... } # wrong if x is a vector of length >1

Arithmetic Operators

operators like ^ or +
?Syntax will get you the manual page
when in doubt always use parentheses

2^1+1

[1] 3

2^(1+1)

[1] 4

TRUE || TRUE && FALSE   # is the same as

[1] TRUE

TRUE || (TRUE && FALSE) # and different from

[1] TRUE

(TRUE || TRUE) && FALSE

[1] FALSE

Function as an argument

in R functions are first class objects - this means they can be passed as arguments, assigned to symbols, stored in other data structures
in particular they can be passed as arguments to a function and returned as values
in some languages (e.g. C or Java) functions are not first class objects and they cannot be passed as arguments
Python uses a fairly similar strategy as in R

Scope

f <- function(x){
  cat("y is", y,"\n")
  y <- x
  cat("y is", y,"\n")
  return(y)
}
y <- 2
f(3)

y is 2 
y is 3

[1] 3

Exampe: function as a returned value

rexp: generate values from an Exponential distribution $f(x;\lambda)=\begin{cases} \lambda e^{-\lambda x} & x\ge 0 \\ 0 & x<0. \end{cases}$.
notice that a function is returned

x = rexp(100, rate = 4)

llExp = function(DATA) { #log-likelyhood
   n = length(DATA)
   sumx = sum(DATA)
   return(function(mu) {n * log(mu) - mu * sumx})
}

myLL = llExp(x)
myLL

function (mu) 
{
    n * log(mu) - mu * sumx
}
<environment: 0x0000024b1764be80>

Function call as an argument

##possible values for mu
y = seq(3,5,by = 0.1)

plot(y, myLL(y), type="l", xlab="mu", ylab="log likelihood") 
# "l"-line; "p"-points(default), "b"-both; "h"-vertical lines
abline(v = y[which.max(myLL(y))], col = "red")

MLE occurs at mu=3.7

Order of binding

search()
filter     #stats::filter()->time-series filtering
library(dplyr)
search()
filter # dplyr::filter(): ->row filtering

by calling library(dplyr) the package dplyr has been put near the top of the search list. You can call search() before and after the call to library to check this.
users could inadvertently alter computations - and we will want to protect against that

Namespaces

To avoid binding conflict, using <pkgname>::<functionname>:

stats::filter
dplyr::filter

Examples

Restart your R Console and study this example:

library(dslabs) #attach dsplabs to the search path
exists("murders")  # `murders` exists in `dslabs`

[1] TRUE

murders <- murders # create a copy `murders` in .GlobalEnv
murders2 <- murders # 2nd copy
rm(murders) # removes `murders` from .GlobalEnv
exists("murders") #R finds it in `dslabs`

[1] TRUE

detach("package:dslabs") #remove from the search path
exists("murders") # does not exist on the search path

[1] FALSE

exists("murders2")

[1] TRUE

Object Oriented Programming

R uses object oriented programming (OOP).
Base R uses two approaches referred to as S3 and S4
S3, the original approach, is more common, but has some limitations
The S4 approach is more similar to the conventions used by the Lisp family of languages.
In S4 there are classes that have attributes describing data structures and methods that are functions

Object Oriented Programming

Time series
Numeric

class(co2) # ts object

[1] "ts"

plot(co2) # R calls plot.ts(co2)->x-axis: year

plot(as.numeric(co2)) # strip away the time-index. R calls plot.default(). x-axix: index (1,2,3,...)

Plots

Soon we will learn how to use the ggplot2 package to make plots.
R base does have functions:
- plot - mainly for making scatterplots.
- lines - add lines/curves to an existing plot.
- hist - to make a histogram.
- boxplot - makes boxplots.
- image - uses color to represent entries in a matrix.

Plots

in general, we recommend using ggplot2, R base plots are often better for quick exploratory plots.
For example, to make a histogram of values in x simply type:

hist(x)

To make a scatter plot of y versus x and then add connected line:

plot(x,y) # open a new figure (canvas)
lines(x,y) #adds connected line to an existing plot. It does not create a new figure(canvas)

scatterplot

library(dslabs)
with(murders, plot(population, total))

histogram

x <- with(murders, total / population * 100000)
hist(x)

boxplot

murders$rate <- with(murders, total / population * 100000)
boxplot(rate~region, data = murders) #y:rate x:region

R Basics

Packages

Prebuilt functions

Base R functions

Help system

Variables

The workspace

The search paths

Variable naming convention

Main data types

Data types

typeof and class

Data types

Data frames

Add a column

Data frames: Accessor $

Alternative ways to access columns (confusing)

Gotcha

with clause

Atomic vectors

Create vectors

Sequences

Sequences

Data types and coercion

Coercing

Vector types and coercion

More coercing examples

When coercing fails

Explicitly coercing

Coercing and parsing by readr

Atomic vector and array

Factors and Characters

What is a factor?

Setting Levels

The previous example continued

Categories based on strata

Assigning labels

Changing levels

Reference level

Reorder a factor (categorical) variable

Factors (integers) use less memory

Factors can be confusing

Drop extra levels with droplevels

NAs and NULLs

Checking NAs by is.na

NAs and logical operators

NaNs and Inf

Lists

Access compoenents of a list

subset(sublist)[] and extraction [[]]

Matrices: Faster computation

Define a customized functions

Functions

R matches named arguments first, positional arguments last

Flow-control and operators

if-else

For loop

Logical Operators & and |

Logical operators && and ||

Arithmetic Operators

Function as an argument

Scope

Exampe: function as a returned value

Function call as an argument

Order of binding

Namespaces

Examples

Object Oriented Programming

Object Oriented Programming

Plots

Plots

scatterplot

histogram

boxplot

`typeof` and `class`

Data frames: Accessor `$`

`with` clause

Coercing and parsing by `readr`

Drop extra levels with `droplevels`

Checking NAs by `is.na`

subset(sublist)`[]` and extraction `[[]]`

Logical Operators `&` and `|`

Logical operators `&&` and `||`