On Monday, we will go over last week’s case study, on estimating a demand curve for milk and choosing an optimal price. Download “milk.R” from the R Scripts tab above.
We will then discuss two advanced topics regarding prediction intervals:
- how to form prediction intervals in models that involve a transformation of the y variable. We’ll learn this in the context of the homework problem on supply and demand for milk.
- diagnosing when “simple” prediction intervals break down (in the presence of heteroskedasticity, or non-constant variance), and fixing them using quantile regression. Download “hetpred.R” from the R Scripts tab above. For those who want to read a simple (and entirely optional) overview of quantile regression, this page is pretty accessible.
On Wednesday, we will start class by working on a case study anchored in the first half of Chapter 4 of the course packet, about grouping variables in regression models. The goal is to build a model to predict flaws in a manufacturing process for printed circuit boards. We’ll spend 15 minutes at the start of class on this one. See if you can reach three intermediate goals by the end of those 15 minutes:
1) three boxplots showing the effects of the three main variables on the outcome (skips).
2) at least one boxplot that can help you assess whether there ought to be an interaction in the model.
3) one model for skips, incorporating a combination of main effects and interactions that you can explain using the plots from steps 1 and 2.
Using this data set, we will spend a lot of time in class talking about main effects and interactions. This will finish off the material from Chapter 4. See the “solder.R” script above. Key topics we’ll cover include:
- interpreting main effects and interactions in a model with only grouping variables as predictors
- detecting interactions graphically (e.g. boxplots on combinations of features)
- Using an ANOVA table to reason about the practical significance of a main effect or interaction in improving the fit of a model
For the last 15 minutes of class, we’ll spend some time with the “TenMileRace.R” scripts from the R Scripts tab above. The idea here is to understand why we get a different slope on the “age” variable, depending on whether or not we fit a single line to the whole data set, versus stratify the data set into male and female subsets.
Readings
For Wednesday, please finish Chapter 4 of the course packet (picking up with the section on “Numerical and grouping variables together.”)
For Monday of next week, please read chapter 5, on quantifying uncertainty using the bootstrap. We will cover the sections starting from “Confidence intervals and coverage” onward in class next Monday, so it is optional to have read that far by then. You can skip the section at the very end on “Bootstrapped Prediction Intervals,” which we will not cover in this course.
Videos
For Monday of next week, please watch the following videos:
Software
Outside of class, complete the following R walkthroughs.
For Wednesday of this week (week 4):
- house prices: modeling numerical outcomes with both numerical and categorical predictors.
For next week (week 5):
- Gone fishing: using the Monte Carlo method to simulate the sampling distributions of the sample mean and of the least-squares estimator of a regression line.
- Creatinine, revisited: bootstrapping the sample mean and the OLS estimator; computing confidence intervals from bootstrapped samples; standard errors and confidence intervals from the normality assumption.
Exercises
Exercises 3 this week are about regression models incorporating grouping variables. They are due in class on February 12.