Vectorization

http://rafalab.dfci.harvard.edu/dsbook-part-1/R/programming-basics.html#sec-vectorization

Vectorization

Reading

  • We will be using the murders dataset in the dslabs package.

  • Includes data on 2010 gun murders for the US 50 states and DC.

  • We will answer questions such as “Which state is with the lowest crime rate in the Western part of the US?”

Multiply/divide by a constant

  • Let’s convert the following heights in inches to meters:
heights <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
  • Rather than loop we use vectorization:
heights*2.54/100
 [1] 1.7526 1.5748 1.6764 1.7780 1.7780 1.8542 1.7018 1.8542 1.7018 1.7780

Add/subtract a constant

  • We can subtract a constant from each element of a vector.

  • This is convenient for computing residuals or deviations from an average:

avg <- mean(heights)
heights - avg 
 [1]  0.3 -6.7 -2.7  1.3  1.3  4.3 -1.7  4.3 -1.7  1.3

Compute SD

s <- sd(heights)
(heights - avg)/s
 [1]  0.08995503 -2.00899575 -0.80959530  0.38980515  0.38980515  1.28935548
 [7] -0.50974519  1.28935548 -0.50974519  0.38980515
  • There is actually a function, scale, that does this.

Add two vectors

  • If we operate on two vectors, vectorization is componentwise.
heights <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
error <- rnorm(length(heights), 0, 0.1)
heights + error
 [1] 68.92307 61.95413 65.92478 69.88652 69.93367 73.00120 66.98732 72.99954
 [9] 67.06828 69.97925

Exercise

  • Add a column to the murders dataset with the murder rate.

  • Use murders per 100,000 persons as the unit.

Most Arithmetic functions vectorize

  • Most arithmetic functions work on vectors.
x <- 1:10
sqrt(x)
 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
 [9] 3.000000 3.162278
log(x)
 [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
 [8] 2.0794415 2.1972246 2.3025851
2^x
 [1]    2    4    8   16   32   64  128  256  512 1024

Compute standard units using scale

scale(heights)
             [,1]
 [1,]  0.08995503
 [2,] -2.00899575
 [3,] -0.80959530
 [4,]  0.38980515
 [5,]  0.38980515
 [6,]  1.28935548
 [7,] -0.50974519
 [8,]  1.28935548
 [9,] -0.50974519
[10,]  0.38980515
attr(,"scaled:center")
[1] 68.7
attr(,"scaled:scale")
[1] 3.335

provides the same results,

(heights - mean(heights))/sd(heights)
 [1]  0.08995503 -2.00899575 -0.80959530  0.38980515  0.38980515  1.28935548
 [7] -0.50974519  1.28935548 -0.50974519  0.38980515

scale coerces to a column matrix

class(scale(heights))
[1] "matrix" "array" 

Most functions vectorize

v <- c(3, 7, 10, -2)
any(v < 0) # TRUE   (there is a negative)
[1] TRUE
all(v > 0) # FALSE  (not all are positive)
[1] FALSE
x <- c(1, 3, 5)
if (all(x %% 2 == 1)) {
  print("All numbers are odd")
}
[1] "All numbers are odd"
if (!is.null(x) && length(x) > 0 && all(x > 0)) {
  print("x is a non-empty positive vector")
}
[1] "x is a non-empty positive vector"
a <- c(0, 1, 2, -4, 5)
ifelse(a > 0, 1/a, NA) # ifelse: vectorized function
[1]  NA 1.0 0.5  NA 0.2

Subsetting using alogical vector

  • Indexing with a logical vector subsets to the set corresponding to the TRUE values
library(dslabs)
ind <- murders$population < 10^6
  • subset a vector using this logical vector for indexing:
murders$state[ind]
[1] "Alaska"               "Delaware"             "District of Columbia"
[4] "Montana"              "North Dakota"         "South Dakota"        
[7] "Vermont"              "Wyoming"             

Indexing by a logical vector

  • You can also use vectorization to apply logical operators:
ind <- murders$population < 10^6 & murders$region == "West"
murders$state[ind]
[1] "Alaska"  "Montana" "Wyoming"

split

  • split(ind, f): split ind into groups defined by factor f:
inds <- with(murders, split(seq_along(region), region))
str(inds) # a list of row indices, one per region
List of 4
 $ Northeast    : int [1:9] 7 20 22 30 31 33 39 40 46
 $ South        : int [1:17] 1 4 8 9 10 11 18 19 21 25 ...
 $ North Central: int [1:12] 14 15 16 17 23 24 26 28 35 36 ...
 $ West         : int [1:13] 2 3 5 6 12 13 27 29 32 38 ...
murders$state[inds$West] # states where region=="West"
 [1] "Alaska"     "Arizona"    "California" "Colorado"   "Hawaii"    
 [6] "Idaho"      "Montana"    "Nevada"     "New Mexico" "Oregon"    
[11] "Utah"       "Washington" "Wyoming"   

Functions for subsetting

  • The functions which, match and the operator %in% are useful for sub-setting

which

x <- c(10, 5, 8, 5)

which(x == 5) # [1] 2 4 return all indices where the condtion is true
[1] 2 4
x==5 
[1] FALSE  TRUE FALSE  TRUE
ind <- which(murders$state == "California")
ind
[1] 5
murders[ind,]
       state abb region population total
5 California  CA   West   37253956  1257

match

x <- c("b", "a", "c")
y <- c("a", "b", "d")

match(x, y) # [1] 2 1 NA.  returns one index per element of x, gives the first match only. 
[1]  2  1 NA
v <- c("a", "b", "a", "c")
which(v == "a") # [1] 1 3
[1] 1 3
match("a", v) # [1] 1 first postion only
[1] 1
ind <- match(c("New York", "Florida", "Texas"), murders$state)
ind
[1] 33 10 44
murders[ind,]
      state abb    region population total
33 New York  NY Northeast   19378102   517
10  Florida  FL     South   19687653   669
44    Texas  TX     South   25145561   805

Handle NA with which and match

x <- c(1, NA, 3)

which(is.na(x)) # [1] 2
[1] 2
match(NA, x) # [1] 2    
[1] 2

%in%

ind <- which(murders$state %in% c("New York", "Florida", "Texas"))
ind
[1] 10 33 44
murders[ind,]
      state abb    region population total
10  Florida  FL     South   19687653   669
33 New York  NY Northeast   19378102   517
44    Texas  TX     South   25145561   805
  • Note this is similar to using match.

  • But note the order is different.

match versus %in%

c("Boston", "Dakota", "Washington") %in% murders$state
[1] FALSE FALSE  TRUE
match(c("Boston", "Dakota", "Washington"), murders$state)
[1] NA NA 48
match(murders$state, c("Boston", "Dakota", "Washington"))
 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA  3 NA NA
[51] NA
  • there are also set manipulation functions intersect, union, setdiff, setequal

The apply functions

Function Apply over… Typical input Output shape
lapply(X, FUN) list/vector elements list or atomic vector list
sapply(X, FUN) same as lapply list or atomic vector tries to simplify (vector/matrix)
apply(X, MARGIN, FUN) rows/cols of an array matrix/array vector/matrix/array
tapply(X, INDEX, FUN) groups vector + factor(s) array/table-like
mapply(FUN, ...) parallel elements multiple vectors/lists like sapply, simplified

sapply

x <- 1:10
sapply(x, sqrt) #>  [1] 1.00 1.41 1.73 2.00 2.24 2.45 2.65 2.83 3.00 3.16
 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
 [9] 3.000000 3.162278
x <- list(a = 1:3, b = 4:5)
lapply(x, sum)
$a
[1] 6

$b
[1] 9
sapply(x, sum)
a b 
6 9 

Apply Examples

 mm = matrix(1:10, nc=2, byrow=T) # rows are filled first
 mm
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8
[5,]    9   10
 apply(mm, 1, sum) # operate on row. rowSums(mm) is faster
[1]  3  7 11 15 19
 apply(mm, 2, \(x) x^2) #> operate on column; `\(x) x^2: anonymous function
     [,1] [,2]
[1,]    1    4
[2,]    9   16
[3,]   25   36
[4,]   49   64
[5,]   81  100
 # mm^2  # is faster
  • For simply operations avoid apply(). apply() is best for non-vectorized functions.

lapply/tapply

sp1 = split(murders$population, murders$region) # split(x,f): split vector x into groups by f.
# sp1 is a list
lapply(sp1, sum) #apply sum to each list element; obtain a list
$Northeast
[1] 55317240

$South
[1] 115674434

$`North Central`
[1] 66927001

$West
[1] 71945553
  • alternatively:
with(murders, tapply(population, region, sum)) #result is an array
    Northeast         South North Central          West 
     55317240     115674434      66927001      71945553 

tapply

  • Great for group summaries (base R version of group_by)
v <- c(5, 7, 4, 9, 6)
g <- factor(c("A","A","B","B","B"))
tapply(v, g, mean)
       A        B 
6.000000 6.333333 
  • Multiple group variables
g2 <- factor(c("M","F","M","F","M"))
tapply(v, list(g, g2), mean)
  F M
A 7 5
B 9 5

mapply()

  • multivariate/parallel apply (map over several inputs)
mapply(rep, 1:3, 1:3) # rep(x,times)
[[1]]
[1] 1

[[2]]
[1] 2 2

[[3]]
[1] 3 3 3
# equivilent to
lapply(1:3, function(i) rep(i, i))
[[1]]
[1] 1

[[2]]
[1] 2 2

[[3]]
[1] 3 3 3
mapply(function(a, b) a + b^2, 1:4, 1:4) # 2 6 12 20
[1]  2  6 12 20
# apply the function to(a,b)= (1,1), (2,2), (3,3),(4,4)