http://rafalab.dfci.harvard.edu/dsbook-part-1/R/programming-basics.html#sec-vectorization
We will be using the murders dataset in the dslabs package.
Includes data on 2010 gun murders for the US 50 states and DC.
We will answer questions such as “Which state is with the lowest crime rate in the Western part of the US?”
We can subtract a constant from each element of a vector.
This is convenient for computing residuals or deviations from an average:
[1] 0.08995503 -2.00899575 -0.80959530 0.38980515 0.38980515 1.28935548
[7] -0.50974519 1.28935548 -0.50974519 0.38980515
scale, that does this.Add a column to the murders dataset with the murder rate.
Use murders per 100,000 persons as the unit.
scale [,1]
[1,] 0.08995503
[2,] -2.00899575
[3,] -0.80959530
[4,] 0.38980515
[5,] 0.38980515
[6,] 1.28935548
[7,] -0.50974519
[8,] 1.28935548
[9,] -0.50974519
[10,] 0.38980515
attr(,"scaled:center")
[1] 68.7
attr(,"scaled:scale")
[1] 3.335
provides the same results,
scale coerces to a column matrix[1] TRUE
[1] FALSE
[1] "All numbers are odd"
[1] "x is a non-empty positive vector"
TRUE valuessplitsplit(ind, f): split ind into groups defined by factor f:inds <- with(murders, split(seq_along(region), region))
str(inds) # a list of row indices, one per regionList of 4
$ Northeast : int [1:9] 7 20 22 30 31 33 39 40 46
$ South : int [1:17] 1 4 8 9 10 11 18 19 21 25 ...
$ North Central: int [1:12] 14 15 16 17 23 24 26 28 35 36 ...
$ West : int [1:13] 2 3 5 6 12 13 27 29 32 38 ...
[1] "Alaska" "Arizona" "California" "Colorado" "Hawaii"
[6] "Idaho" "Montana" "Nevada" "New Mexico" "Oregon"
[11] "Utah" "Washington" "Wyoming"
which, match and the operator %in% are useful for sub-setting[1] 2 4
[1] FALSE TRUE FALSE TRUE
x <- c("b", "a", "c")
y <- c("a", "b", "d")
match(x, y) # [1] 2 1 NA. returns one index per element of x, gives the first match only. [1] 2 1 NA
[1] 1 3
[1] 1
[1] 10 33 44
state abb region population total
10 Florida FL South 19687653 669
33 New York NY Northeast 19378102 517
44 Texas TX South 25145561 805
Note this is similar to using match.
But note the order is different.
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3 NA NA
[51] NA
intersect, union, setdiff, setequal| Function | Apply over… | Typical input | Output shape |
|---|---|---|---|
lapply(X, FUN) |
list/vector elements | list or atomic vector | list |
sapply(X, FUN) |
same as lapply |
list or atomic vector | tries to simplify (vector/matrix) |
apply(X, MARGIN, FUN) |
rows/cols of an array | matrix/array | vector/matrix/array |
tapply(X, INDEX, FUN) |
groups | vector + factor(s) | array/table-like |
mapply(FUN, ...) |
parallel elements | multiple vectors/lists | like sapply, simplified |
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
[9] 3.000000 3.162278
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[4,] 7 8
[5,] 9 10
[1] 3 7 11 15 19
[,1] [,2]
[1,] 1 4
[2,] 9 16
[3,] 25 36
[4,] 49 64
[5,] 81 100
apply(). apply() is best for non-vectorized functions.sp1 = split(murders$population, murders$region) # split(x,f): split vector x into groups by f.
# sp1 is a list
lapply(sp1, sum) #apply sum to each list element; obtain a list$Northeast
[1] 55317240
$South
[1] 115674434
$`North Central`
[1] 66927001
$West
[1] 71945553
group_by)[[1]]
[1] 1
[[2]]
[1] 2 2
[[3]]
[1] 3 3 3
[[1]]
[1] 1
[[2]]
[1] 2 2
[[3]]
[1] 3 3 3
[1] 2 6 12 20