Chapter 3: Linear regression: the basics

Author

Zane Billings

Code
box::use(
    arm
)

Exercises

Q1

The folder pyth contains outcome \(y\) and inputs \(x_1, x_2\) for \(40\) data points, with a further \(20\) points with the inputs but no observed outcome. Save the file to your working directory and read it into R using the read.table() function.

Well, my first problem here was that I downloaded the data from the textbook website and the pyth folder wasn’t in there. It’s on the website, just not in the zip folder. Anyways.

Code
pyth <- read.table(
    here::here("arm", "data", "pyth", "exercise2.1.dat"),
    header = TRUE
)

# Get only data with observed outcomes
pyth_obs <- pyth[1:40, ]

# Get only data with unobserved outcomes
pyth_unobs <- pyth[41:60, ]
  1. Use R to fit a linear regression model predicting \(y\) from \(x_1, x_2,\) using the first \(40\) data points in the file. Summarize the inferences and check the fit of your model.
Code
fit_q1a <- lm(y ~ x1 + x2, data = pyth_obs)
arm::display(fit_q1a)
lm(formula = y ~ x1 + x2, data = pyth_obs)
            coef.est coef.se
(Intercept) 1.32     0.39   
x1          0.51     0.05   
x2          0.81     0.02   
---
n = 40, k = 3
residual sd = 0.90, R-Squared = 0.97

From the model output, it looks like all of the coefficients should be statistically significant (at a \(0.05\) significance level). The \(R^2\) indicates that the model fits quite well. The mean of the observed \(y\) values is \(\bar{y} = 13.59\), which is quite large compared to the residual SD of \(0.9\), so the model fits the data quite well.

  1. Display the estimated model graphically as in Figure 3.2.
Code
set.seed(370)
beta_hat <- coef(fit_q1a)
beta_sim <- arm::sim(fit_q1a)@coef
layout(matrix(c(1, 2), nrow = 1))
plot(
    pyth_obs$x1, pyth_obs$y,
    xlab = "x1", ylab = "y"
)
for (i in 1:100) {
    curve(
        cbind(1, x, mean(pyth_obs$x2)) %*% beta_sim[i, ],
        lwd = 0.5, col = adjustcolor("gray", alpha.f = 0.5),
        add = TRUE, n = 1000L
    )
}
curve(
    cbind(1, x, mean(pyth_obs$x2)) %*% beta_hat,
    lwd = 1, col = "black",
    add = TRUE, n = 1000L
)

plot(
    pyth_obs$x2, pyth_obs$y,
    xlab = "x2", ylab = "y"
)
for (i in 1:100) {
    curve(
        cbind(1, mean(pyth_obs$x1), x) %*% beta_sim[i, ],
        lwd = 0.5, col = adjustcolor("gray", alpha.f = 0.5),
        add = TRUE, n = 1000L
    )
}
curve(
    cbind(1, mean(pyth_obs$x1), x) %*% beta_hat,
    lwd = 1, col = "black",
    add = TRUE, n = 1000L
)

Code
layout(1)
  1. Make a residual plot for this model. Do the assumptions appear to be met?
Code
plot(
    fitted.values(fit_q1a), residuals(fit_q1a),
    xlab = "Fitted values", ylab = "Residuals",
    ylim = c(-3, 3)
)
abline(h = 0, lty = 2)
abline(h = arm::sigma.hat(fit_q1a), col = "darkgray", lty = 3)
abline(h = -arm::sigma.hat(fit_q1a), col = "darkgray", lty = 3)

The residuals are clearly not evenly distributed above and below 0, as they should be. It appears that underpredictions are typically less bad than overpredictions are, which indicates that perhaps either additivity, linearity, or equal variance are violated.

  1. Make predictions for the remaining 20 data points in the file. How confident do you feel about these predictions?
Code
layout(matrix(c(1, 2), nrow = 1))
pyth_pred <- predict(fit_q1a, newdata = pyth_unobs, interval = "prediction")
# X1 plot
plot(
    NULL, NULL,
    xlim = c(0, 10), ylim = c(3, 22),
    xlab = "x1", ylab = "y"
)
points(
    pyth_obs$x1, pyth_obs$y,
    pch = 1, col = "black"
)
segments(
    pyth_unobs$x1,
    pyth_pred[, 2],
    pyth_unobs$x1,
    pyth_pred[, 3],
    col = "firebrick4"
)
points(
    pyth_unobs$x1, pyth_pred[, 1],
    pch = 21, col = "firebrick4", bg = "white"
)
for (i in 1:100) {
    curve(
        cbind(1, x, mean(pyth_obs$x2)) %*% beta_sim[i, ],
        lwd = 0.5, col = adjustcolor("gray", alpha.f = 0.5),
        add = TRUE, n = 1000L
    )
}
curve(
    cbind(1, x, mean(pyth_obs$x2)) %*% beta_hat,
    lwd = 1, col = "black",
    add = TRUE, n = 1000L
)

# x2 plot
plot(
    NULL, NULL,
    xlim = c(0, 20), ylim = c(3, 22),
    xlab = "x2", ylab = "y"
)
points(
    pyth_obs$x2, pyth_obs$y,
    pch = 1, col = "black"
)
segments(
    pyth_unobs$x2,
    pyth_pred[, 2],
    pyth_unobs$x2,
    pyth_pred[, 3],
    col = "firebrick4"
)
points(
    pyth_unobs$x2, pyth_pred[, 1],
    pch = 21, col = "firebrick4", bg = "white"
)
for (i in 1:100) {
    curve(
        cbind(1, mean(pyth_obs$x1), x) %*% beta_sim[i, ],
        lwd = 0.5, col = adjustcolor("gray", alpha.f = 0.5),
        add = TRUE, n = 1000L
    )
}
curve(
    cbind(1, mean(pyth_obs$x1), x) %*% beta_hat,
    lwd = 1, col = "black",
    add = TRUE, n = 1000L
)

From the plot, we can see that the new predictions are fairly similar to the original data, and none of them look crazy. So given the good diagnostics, these predictions are probably OK – as long as we assume that they came from the same data-generating process as the observed outcomes. Fixing the issue that causes the somewhat strange residuals might give us some more peace of mind though.

Q2

Suppose that, for a certain population, we can predict log earnings from log height as follows:

  • A person who is \(66\) inches tall is predicted to have earnings of \(\$30,000\).
  • Every increase of \(1\%\) in height corresponds to a predicted increase of \(0.8\%\) in earnings.
  • The earnings of approximately \(95\%\) of people fall within a factor of 1.1 of predicted values.
  1. Give the equation of the regression line and the residual standard deviation of the regression.

From the second bullet point, we can get that

\[\hat{\beta} = \frac{0.8\%}{0.1\%} = 8.\]

Then, using the information from the first bullet point we get that

\[ \$30,000 = \hat{\alpha} + 8 \times 66 \implies \hat{\alpha} = \$30,000 - 8 \times 66 = \$29,472. \]