Module 5: FunctionalProgramming

Learning goals

  1. Define functional programming
  2. Use functional programming tools in base R to repeat multiple functional calls.
  3. Use the purrr package for easily mapping and reducing functions.

What is functional programming?

  • All of our code is functional if it does its job?
  • Yes, but we are using the technical definition of functional programming: programming that is based on functions.
  • Specifically, we want to be able to compose functions, which you might remember hating in precalculus.

\[g \circ f = g(f(x))\]

  • In computer science, a functional is a function that accepts a function as an argument.

Functional vs. imperative programming

  • So far, we’ve been using imperative programming: we save variables, and we update them with new commands. Our code is structured as a list of instructions.

  • In functional programming, we write everything as a function, and we get the results we want by composing many functions.

  • But we’re just going to show you some useful parts of functional programming that you can include in your regular code.

Ok, but why?

  • Recall our function from the previous Module.
get_country_stats <- function(df, iso3_code){
    
    country_data <- subset(df, iso3c == iso3_code)
    
    # Get the summary statistics for this country
    country_cases <- country_data$measles_cases
    country_quart <- quantile(
        country_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)
    )
    country_range <- range(country_cases, na.rm = TRUE)
    
    country_name <- unique(country_data$country)
    
    country_summary <- data.frame(
        country = country_name,
        min = country_range[[1]],
        Q1 = country_quart[[1]],
        median = country_quart[[2]],
        Q3 = country_quart[[3]],
        max = country_range[[2]]
    )
    
    return(country_summary)
}
  • We could write a loop to get stats for many countries.
meas <- readRDS(here::here("data", "measles_final.Rds"))
country_codes <- c("IND", "PAK", "BGD", "NPL")

# Loop setup
out <- vector(mode = "list", length = length(country_codes))
for (i in 1:length(out)) {
    out[[i]] <- get_country_stats(meas, country_codes[[i]]) 
}

out
[[1]]
  country  min      Q1  median       Q3    max
1   India 3305 31135.5 47109.5 80797.25 252940

[[2]]
   country min     Q1 median    Q3   max
1 Pakistan 386 2065.5 4075.5 17422 55543

[[3]]
     country min      Q1 median     Q3   max
1 Bangladesh 203 2193.75 5270.5 9889.5 27327

[[4]]
  country min  Q1 median   Q3   max
1   Nepal  59 190   1268 3100 13344
  • But if we use a functional programming tool called lapply, look how easy it is!
out2 <- lapply(country_codes, function(c) get_country_stats(meas, c))
out2
[[1]]
  country  min      Q1  median       Q3    max
1   India 3305 31135.5 47109.5 80797.25 252940

[[2]]
   country min     Q1 median    Q3   max
1 Pakistan 386 2065.5 4075.5 17422 55543

[[3]]
     country min      Q1 median     Q3   max
1 Bangladesh 203 2193.75 5270.5 9889.5 27327

[[4]]
  country min  Q1 median   Q3   max
1   Nepal  59 190   1268 3100 13344
  • Functional programming techniques can help us avoid writing messy loops and clean up our code.

Ok, but why?

  • R is a functional programming language at its core.
  • In R, functions are objects like everything else.
  • You never have to use FP, but it can help you write neater code.
  • More information: https://adv-r.hadley.nz/fp.html.

Anonymous functions

  • In this code, you can see the keyword function.
out2 <- lapply(country_codes, function(c) get_country_stats(meas, c))
  • The function function(c) get_country_stats(meas, c) is called an anonymous function (or sometimes a lambda).
  • It’s anonymous because it has no name assigned – this is just a shortcut for when you only need a function once.
  • In R, you can also write \(c) ... as a shortcut for writing function(c) ... as an anonymous function.
  • This code works just as well, but can be a bit of a hassle.
get_country_stats_from_meas <- function(c) get_country_stats(meas, c)
out3 <- lapply(country_codes, get_country_stats_from_meas)

*apply() and friends

  • R includes many functional programming tools collectively called *apply(). Here are the specific ones that are most useful:
  • lapply(): repeat the function on a list of things.
  • sapply(): same as lapply() but try to simplify the output to a matrix or vector.
  • apply(): repeat the function over the rows (margin = 1) or columns (margin = 2) of a matrix.
  • tapply(): repeat the function over combinations of grouping factors.

A tapply() example

  • tapply(): repeat the function over combinations of grouping factors.
  • We want to get the average vaccine coverage for each country in the measles dataset – we need to separate this by the two vaccines as well.
  • This is pretty easy to do with tapply().
meas <- readRDS(here::here("data", "measles_final.Rds"))
meas_long <- meas |>
    tidyr::pivot_longer(
        dplyr::starts_with("MCV"),
        names_to = "vaccine_antigen",
        values_to = "vaccine_coverage"
    )
out <- tapply(
    meas_long$vaccine_coverage,
    list(meas_long$iso3c, meas_long$vaccine_antigen),
    FUN = mean
)
head(out)
MCV1_coverage MCV2_coverage
ABW NA NA
AFG NA NA
AGO NA NA
AIA NA NA
ALB NA NA
AND NA NA
  • Uh oh.
  • If we want to handle the missing values, we can use an anonymous function.
out <- tapply(
    meas_long$vaccine_coverage,
    list(meas_long$iso3c, meas_long$vaccine_antigen),
    FUN = \(x) mean(x, na.rm = TRUE)
)
head(out)
MCV1_coverage MCV2_coverage
ABW NaN NaN
AFG 41.80952 31.73684
AGO 47.17500 22.75000
AIA NaN NaN
ALB 93.30233 96.00000
AND 96.53846 89.56250

You try it with lapply()!

  • Read in all sheets of the QCRC_FINAL_Deidentified.xlsx data using a vector of sheet names and lapply.
  • Hint:
data_file <- here::here("data", "QCRC_FINAL_Deidentified.xlsx")
sheet_names <- readxl::excel_sheets(data_file)
sheet_list <- lapply(
    sheet_names,
    \(name) ...
)
data_file <- here::here("data", "QCRC_FINAL_Deidentified.xlsx")
sheet_names <- readxl::excel_sheets(data_file)
sheet_list <- lapply(
    sheet_names,
    \(name) readxl::read_excel(data_file, sheet = name)
)
Warning: Expecting numeric in D6716 / R6716C4: got 'Date\Time Correction'
Warning: Expecting numeric in G7843 / R7843C7: got 'Date\Time Correction'
Warning: Expecting numeric in D7844 / R7844C4: got 'Date\Time Correction'
Warning: Expecting numeric in E7845 / R7845C5: got 'Date\Time Correction'
Warning: Expecting numeric in G10671 / R10671C7: got 'Date\Time Correction'
Warning: Expecting numeric in D10672 / R10672C4: got 'Date\Time Correction'
Warning: Expecting numeric in E10673 / R10673C5: got 'Date\Time Correction'
Warning: Expecting numeric in G11539 / R11539C7: got 'Date\Time Correction'
Warning: Expecting numeric in D11540 / R11540C4: got 'Date\Time Correction'
Warning: Expecting numeric in G17408 / R17408C7: got 'Date\Time Correction'
Warning: Expecting numeric in D17409 / R17409C4: got 'Date\Time Correction'
Warning: Expecting numeric in E17410 / R17410C5: got 'Date\Time Correction'
str(sheet_list, 1)
List of 10
 $ : tibble [288 × 37] (S3: tbl_df/tbl/data.frame)
 $ : tibble [42,606 × 6] (S3: tbl_df/tbl/data.frame)
 $ : tibble [288 × 8] (S3: tbl_df/tbl/data.frame)
 $ : tibble [15,919 × 29] (S3: tbl_df/tbl/data.frame)
 $ : tibble [143,220 × 11] (S3: tbl_df/tbl/data.frame)
 $ : tibble [18,281 × 11] (S3: tbl_df/tbl/data.frame)
 $ : tibble [5,252 × 5] (S3: tbl_df/tbl/data.frame)
 $ : tibble [869,554 × 7] (S3: tbl_df/tbl/data.frame)
 $ : tibble [288 × 71] (S3: tbl_df/tbl/data.frame)
 $ : tibble [3,612 × 36] (S3: tbl_df/tbl/data.frame)

An apply() example

  • You’ll use apply() less often because it’s specifically for matrices, where all data types have to be the same.
  • First, make a matrix from the QCRC main dataset that only has the columns Died, 30D_Mortality, and 60D_Mortality.
d_matrix <- sheet_list[[1]] |>
    dplyr::select(Died, `30D_Mortality`, `60D_Mortality`) |>
    as.matrix()

head(d_matrix)
Died 30D_Mortality 60D_Mortality
1 1 1
0 0 0
0 0 0
1 1 1
1 1 1
1 1 1
  • Next use apply with the correct choice of MARGIN and FUN to get the proportion of cases that resulted in death at all three time points in one function call.
  • Hint:
apply(
    d_matrix,
    MARGIN = (1 or 2),
    FUN = some_function
)
  • Another hint: you can tell if you chose the correct MARGIN argument by counting the number of elements in the output. There should be 3, one for each column.
  • Solution:
apply(
    d_matrix,
    MARGIN = 2,
    FUN = mean
)
         Died 30D_Mortality 60D_Mortality 
    0.3402778     0.3263889     0.3333333 

You try it!

  • Load the measles dataset. Filter the dataset so you only have records from 2005, and only have complete cases.
  • Then, use split to split the dataframe into a list by region.
  • Now, use lapply() to fit a Poisson glm for each region separately that includes effects of MCV1 and MCV2 coverage.
  • Find a way, using lapply() or sapply(), coef(), and do.call() to make a nice matrix of the coefficients.
  • Solution for data processing
meas <- readr::read_rds(here::here("data", "measles_final.Rds"))
meas_2005 <- meas |>
    dplyr::filter(year == 2005) |>
    tidyr::drop_na()

meas_regions <- split(meas_2005, meas_2005$region)
  • Hint
model_list <- lapply(
    meas_regions,
    \(d) glm(...)
)
model_coefs_list <- lapply(model_list, ...)
model_coefs_mat <- do.call(...)
  • Solution
model_list <- lapply(
    meas_regions,
    \(d) glm(
        measles_cases ~ MCV1_coverage + MCV2_coverage,
        data = d, family = "poisson"
    )
)
model_coefs_list <- lapply(model_list, coef)
model_coefs_mat <- do.call(rbind, model_coefs_list)

model_coefs_mat
(Intercept) MCV1_coverage MCV2_coverage
Africa 17.676171 -0.2813935 0.1419766
Americas 7.278677 -0.0790399 0.0169828
Asia 11.417256 -0.0440387 0.0143116
Europe -12.027052 0.1838390 0.0008544
Oceania -8.355161 0.1176879 -0.0257241

Summary

  • Functional programming tools like *apply() take a function as an input and use the same function multiple times.
  • In R, functions are objects like everything else and manipulating them like objects can help us write readable, fast code.
  • Side note: purrr is a modern version of *apply() with a more consistent interface. It’s worth learning and covered in the Advanced R book, but most people still use *apply().
  • This book also covers more standard FP tools like reduction and filtering that we didn’t have time to talk about.