Code
::use(
box
arm )
Zane Billings
The folder pyth
contains outcome \(y\) and inputs \(x_1, x_2\) for \(40\) data points, with a further \(20\) points with the inputs but no observed outcome. Save the file to your working directory and read it into R
using the read.table()
function.
Well, my first problem here was that I downloaded the data from the textbook website and the pyth
folder wasn’t in there. It’s on the website, just not in the zip folder. Anyways.
- Use
R
to fit a linear regression model predicting \(y\) from \(x_1, x_2,\) using the first \(40\) data points in the file. Summarize the inferences and check the fit of your model.
lm(formula = y ~ x1 + x2, data = pyth_obs)
coef.est coef.se
(Intercept) 1.32 0.39
x1 0.51 0.05
x2 0.81 0.02
---
n = 40, k = 3
residual sd = 0.90, R-Squared = 0.97
From the model output, it looks like all of the coefficients should be statistically significant (at a \(0.05\) significance level). The \(R^2\) indicates that the model fits quite well. The mean of the observed \(y\) values is \(\bar{y} = 13.59\), which is quite large compared to the residual SD of \(0.9\), so the model fits the data quite well.
- Display the estimated model graphically as in Figure 3.2.
set.seed(370)
beta_hat <- coef(fit_q1a)
beta_sim <- arm::sim(fit_q1a)@coef
layout(matrix(c(1, 2), nrow = 1))
plot(
pyth_obs$x1, pyth_obs$y,
xlab = "x1", ylab = "y"
)
for (i in 1:100) {
curve(
cbind(1, x, mean(pyth_obs$x2)) %*% beta_sim[i, ],
lwd = 0.5, col = adjustcolor("gray", alpha.f = 0.5),
add = TRUE, n = 1000L
)
}
curve(
cbind(1, x, mean(pyth_obs$x2)) %*% beta_hat,
lwd = 1, col = "black",
add = TRUE, n = 1000L
)
plot(
pyth_obs$x2, pyth_obs$y,
xlab = "x2", ylab = "y"
)
for (i in 1:100) {
curve(
cbind(1, mean(pyth_obs$x1), x) %*% beta_sim[i, ],
lwd = 0.5, col = adjustcolor("gray", alpha.f = 0.5),
add = TRUE, n = 1000L
)
}
curve(
cbind(1, mean(pyth_obs$x1), x) %*% beta_hat,
lwd = 1, col = "black",
add = TRUE, n = 1000L
)
- Make a residual plot for this model. Do the assumptions appear to be met?
The residuals are clearly not evenly distributed above and below 0, as they should be. It appears that underpredictions are typically less bad than overpredictions are, which indicates that perhaps either additivity, linearity, or equal variance are violated.
- Make predictions for the remaining 20 data points in the file. How confident do you feel about these predictions?
layout(matrix(c(1, 2), nrow = 1))
pyth_pred <- predict(fit_q1a, newdata = pyth_unobs, interval = "prediction")
# X1 plot
plot(
NULL, NULL,
xlim = c(0, 10), ylim = c(3, 22),
xlab = "x1", ylab = "y"
)
points(
pyth_obs$x1, pyth_obs$y,
pch = 1, col = "black"
)
segments(
pyth_unobs$x1,
pyth_pred[, 2],
pyth_unobs$x1,
pyth_pred[, 3],
col = "firebrick4"
)
points(
pyth_unobs$x1, pyth_pred[, 1],
pch = 21, col = "firebrick4", bg = "white"
)
for (i in 1:100) {
curve(
cbind(1, x, mean(pyth_obs$x2)) %*% beta_sim[i, ],
lwd = 0.5, col = adjustcolor("gray", alpha.f = 0.5),
add = TRUE, n = 1000L
)
}
curve(
cbind(1, x, mean(pyth_obs$x2)) %*% beta_hat,
lwd = 1, col = "black",
add = TRUE, n = 1000L
)
# x2 plot
plot(
NULL, NULL,
xlim = c(0, 20), ylim = c(3, 22),
xlab = "x2", ylab = "y"
)
points(
pyth_obs$x2, pyth_obs$y,
pch = 1, col = "black"
)
segments(
pyth_unobs$x2,
pyth_pred[, 2],
pyth_unobs$x2,
pyth_pred[, 3],
col = "firebrick4"
)
points(
pyth_unobs$x2, pyth_pred[, 1],
pch = 21, col = "firebrick4", bg = "white"
)
for (i in 1:100) {
curve(
cbind(1, mean(pyth_obs$x1), x) %*% beta_sim[i, ],
lwd = 0.5, col = adjustcolor("gray", alpha.f = 0.5),
add = TRUE, n = 1000L
)
}
curve(
cbind(1, mean(pyth_obs$x1), x) %*% beta_hat,
lwd = 1, col = "black",
add = TRUE, n = 1000L
)
From the plot, we can see that the new predictions are fairly similar to the original data, and none of them look crazy. So given the good diagnostics, these predictions are probably OK – as long as we assume that they came from the same data-generating process as the observed outcomes. Fixing the issue that causes the somewhat strange residuals might give us some more peace of mind though.
Suppose that, for a certain population, we can predict log earnings from log height as follows:
- Give the equation of the regression line and the residual standard deviation of the regression.
From the second bullet point, we can get that
\[\hat{\beta} = \frac{0.8\%}{0.1\%} = 8.\]
Then, using the information from the first bullet point we get that
\[ \$30,000 = \hat{\alpha} + 8 \times 66 \implies \hat{\alpha} = \$30,000 - 8 \times 66 = \$29,472. \]