Vectorization

http://rafalab.dfci.harvard.edu/dsbook-part-1/R/programming-basics.html#sec-vectorization

Vectorization

We will be using the murders dataset in the dslabs package.
Includes data on 2010 gun murders for the US 50 states and DC.
We will answer questions such as “Which state is with the lowest crime rate in the Western part of the US?”

Multiply/divide by a constant

Let’s convert the following heights in inches to meters:

heights <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)

Rather than loop we use vectorization:

heights*2.54/100

 [1] 1.7526 1.5748 1.6764 1.7780 1.7780 1.8542 1.7018 1.8542 1.7018 1.7780

Add/subtract a constant

We can subtract a constant from each element of a vector.
This is convenient for computing residuals or deviations from an average:

avg <- mean(heights)
heights - avg

 [1]  0.3 -6.7 -2.7  1.3  1.3  4.3 -1.7  4.3 -1.7  1.3

Compute SD

s <- sd(heights)
(heights - avg)/s

 [1]  0.08995503 -2.00899575 -0.80959530  0.38980515  0.38980515  1.28935548
 [7] -0.50974519  1.28935548 -0.50974519  0.38980515

There is actually a function, scale, that does this.

Add two vectors

If we operate on two vectors, vectorization is componentwise.

heights <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
error <- rnorm(length(heights), 0, 0.1)
heights + error

 [1] 68.92307 61.95413 65.92478 69.88652 69.93367 73.00120 66.98732 72.99954
 [9] 67.06828 69.97925

Exercise

Add a column to the murders dataset with the murder rate.
Use murders per 100,000 persons as the unit.

Most Arithmetic functions vectorize

Most arithmetic functions work on vectors.

x <- 1:10
sqrt(x)

 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
 [9] 3.000000 3.162278

log(x)

 [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
 [8] 2.0794415 2.1972246 2.3025851

2^x

 [1]    2    4    8   16   32   64  128  256  512 1024

Compute standard units using `scale`

scale(heights)

             [,1]
 [1,]  0.08995503
 [2,] -2.00899575
 [3,] -0.80959530
 [4,]  0.38980515
 [5,]  0.38980515
 [6,]  1.28935548
 [7,] -0.50974519
 [8,]  1.28935548
 [9,] -0.50974519
[10,]  0.38980515
attr(,"scaled:center")
[1] 68.7
attr(,"scaled:scale")
[1] 3.335

provides the same results,

(heights - mean(heights))/sd(heights)

 [1]  0.08995503 -2.00899575 -0.80959530  0.38980515  0.38980515  1.28935548
 [7] -0.50974519  1.28935548 -0.50974519  0.38980515

`scale` coerces to a column matrix

class(scale(heights))

[1] "matrix" "array"

Most functions vectorize

v <- c(3, 7, 10, -2)
any(v < 0) # TRUE   (there is a negative)

[1] TRUE

all(v > 0) # FALSE  (not all are positive)

[1] FALSE

x <- c(1, 3, 5)
if (all(x %% 2 == 1)) {
  print("All numbers are odd")
}

[1] "All numbers are odd"

if (!is.null(x) && length(x) > 0 && all(x > 0)) {
  print("x is a non-empty positive vector")
}

[1] "x is a non-empty positive vector"

a <- c(0, 1, 2, -4, 5)
ifelse(a > 0, 1/a, NA) # ifelse: vectorized function

[1]  NA 1.0 0.5  NA 0.2

Subsetting using alogical vector

Indexing with a logical vector subsets to the set corresponding to the TRUE values

library(dslabs)
ind <- murders$population < 10^6

subset a vector using this logical vector for indexing:

murders$state[ind]

[1] "Alaska"               "Delaware"             "District of Columbia"
[4] "Montana"              "North Dakota"         "South Dakota"        
[7] "Vermont"              "Wyoming"

Indexing by a logical vector

You can also use vectorization to apply logical operators:

ind <- murders$population < 10^6 & murders$region == "West"
murders$state[ind]

[1] "Alaska"  "Montana" "Wyoming"

`split`

split(ind, f): split ind into groups defined by factor f:

inds <- with(murders, split(seq_along(region), region))
str(inds) # a list of row indices, one per region

List of 4
 $ Northeast    : int [1:9] 7 20 22 30 31 33 39 40 46
 $ South        : int [1:17] 1 4 8 9 10 11 18 19 21 25 ...
 $ North Central: int [1:12] 14 15 16 17 23 24 26 28 35 36 ...
 $ West         : int [1:13] 2 3 5 6 12 13 27 29 32 38 ...

murders$state[inds$West] # states where region=="West"

 [1] "Alaska"     "Arizona"    "California" "Colorado"   "Hawaii"    
 [6] "Idaho"      "Montana"    "Nevada"     "New Mexico" "Oregon"    
[11] "Utah"       "Washington" "Wyoming"

Functions for subsetting

The functions which, match and the operator %in% are useful for sub-setting

which

x <- c(10, 5, 8, 5)

which(x == 5) # [1] 2 4 return all indices where the condtion is true

[1] 2 4

x==5

[1] FALSE  TRUE FALSE  TRUE

ind <- which(murders$state == "California")
ind

[1] 5

murders[ind,]

       state abb region population total
5 California  CA   West   37253956  1257

match

x <- c("b", "a", "c")
y <- c("a", "b", "d")

match(x, y) # [1] 2 1 NA.  returns one index per element of x, gives the first match only.

[1]  2  1 NA

v <- c("a", "b", "a", "c")
which(v == "a") # [1] 1 3

[1] 1 3

match("a", v) # [1] 1 first postion only

[1] 1

ind <- match(c("New York", "Florida", "Texas"), murders$state)
ind

[1] 33 10 44

murders[ind,]

      state abb    region population total
33 New York  NY Northeast   19378102   517
10  Florida  FL     South   19687653   669
44    Texas  TX     South   25145561   805

Handle NA with which and match

x <- c(1, NA, 3)

which(is.na(x)) # [1] 2

[1] 2

match(NA, x) # [1] 2

[1] 2

%in%

ind <- which(murders$state %in% c("New York", "Florida", "Texas"))
ind

[1] 10 33 44

murders[ind,]

      state abb    region population total
10  Florida  FL     South   19687653   669
33 New York  NY Northeast   19378102   517
44    Texas  TX     South   25145561   805

Note this is similar to using match.
But note the order is different.

match versus %in%

c("Boston", "Dakota", "Washington") %in% murders$state

[1] FALSE FALSE  TRUE

match(c("Boston", "Dakota", "Washington"), murders$state)

[1] NA NA 48

match(murders$state, c("Boston", "Dakota", "Washington"))

 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA  3 NA NA
[51] NA

there are also set manipulation functions intersect, union, setdiff, setequal

The apply functions

Function	Apply over…	Typical input	Output shape
`lapply(X, FUN)`	list/vector elements	list or atomic vector	list
`sapply(X, FUN)`	same as `lapply`	list or atomic vector	tries to simplify (vector/matrix)
`apply(X, MARGIN, FUN)`	rows/cols of an array	matrix/array	vector/matrix/array
`tapply(X, INDEX, FUN)`	groups	vector + factor(s)	array/table-like
`mapply(FUN, ...)`	parallel elements	multiple vectors/lists	like `sapply`, simplified

sapply

x <- 1:10
sapply(x, sqrt) #>  [1] 1.00 1.41 1.73 2.00 2.24 2.45 2.65 2.83 3.00 3.16

 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
 [9] 3.000000 3.162278

x <- list(a = 1:3, b = 4:5)
lapply(x, sum)

$a
[1] 6

$b
[1] 9

sapply(x, sum)

a b 
6 9

Apply Examples

 mm = matrix(1:10, nc=2, byrow=T) # rows are filled first
 mm

     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8
[5,]    9   10

 apply(mm, 1, sum) # operate on row. rowSums(mm) is faster

[1]  3  7 11 15 19

 apply(mm, 2, \(x) x^2) #> operate on column; `\(x) x^2: anonymous function

     [,1] [,2]
[1,]    1    4
[2,]    9   16
[3,]   25   36
[4,]   49   64
[5,]   81  100

 # mm^2  # is faster

For simply operations avoid apply(). apply() is best for non-vectorized functions.

lapply/tapply

sp1 = split(murders$population, murders$region) # split(x,f): split vector x into groups by f.
# sp1 is a list
lapply(sp1, sum) #apply sum to each list element; obtain a list

$Northeast
[1] 55317240

$South
[1] 115674434

$`North Central`
[1] 66927001

$West
[1] 71945553

alternatively:

with(murders, tapply(population, region, sum)) #result is an array

    Northeast         South North Central          West 
     55317240     115674434      66927001      71945553

tapply

Great for group summaries (base R version of group_by)

v <- c(5, 7, 4, 9, 6)
g <- factor(c("A","A","B","B","B"))
tapply(v, g, mean)

       A        B 
6.000000 6.333333

Multiple group variables

g2 <- factor(c("M","F","M","F","M"))
tapply(v, list(g, g2), mean)

  F M
A 7 5
B 9 5

mapply()

multivariate/parallel apply (map over several inputs)

mapply(rep, 1:3, 1:3) # rep(x,times)

[[1]]
[1] 1

[[2]]
[1] 2 2

[[3]]
[1] 3 3 3

# equivilent to
lapply(1:3, function(i) rep(i, i))

[[1]]
[1] 1

[[2]]
[1] 2 2

[[3]]
[1] 3 3 3

mapply(function(a, b) a + b^2, 1:4, 1:4) # 2 6 12 20

[1]  2  6 12 20

# apply the function to(a,b)= (1,1), (2,2), (3,3),(4,4)

Vectorization

Vectorization

Multiply/divide by a constant

Add/subtract a constant

Compute SD

Add two vectors

Exercise

Most Arithmetic functions vectorize

Compute standard units using scale

scale coerces to a column matrix

Most functions vectorize

Subsetting using alogical vector

Indexing by a logical vector

split

Functions for subsetting

which

match

Handle NA with which and match

%in%

match versus %in%

The apply functions

sapply

Apply Examples

lapply/tapply

tapply

mapply()

Compute standard units using `scale`

`scale` coerces to a column matrix

`split`