University of Minnesota, Twin Cities School of Statistics Stat 3011 Rweb Textbook (Wild and Seber)

Stat 3011 (Geyer) In-Class Examples (Chapter 12)

General Instructions
Simple Linear Regression
Tests and Confidence Intervals for Regression Coefficients
Simple Quadratic Regression (not covered)
Confidence Interval for the Regression Function
Regression Prediction
Diagnostic Plots
Correlation

General Instructions

To do each example, just click the "Submit" button. You do not have to type in any R instructions (that's already done for you). You do not have to select a dataset (that's already done for you).

Simple Linear Regression (Section 12.3 in Wild and Seber)

For an example we will use the data for Example 12.3.1 in Wild and Seber, which is in the file cmputer1.txt. This file contains three variables. The two of interest in this problem are no, the number of computer terminals, and pertask, the time per task (p. 104 and p. 511 in Wild and Seber). The following R commands

make a scatter plot of the two variables,
fit the least-squares regression model,
put the regression line on the plot, and
print a summary of the regression.

(Note the similarity to ANOVA. The similarity makes sense because ANOVA is a special case of regression.)

Since the output is rather complicated, we copy it below with the regression coefficients highlighted. The least-squares regression line is

y = 3.04963 + 0.26034 x

   Rweb:> summary(out) 
 
   Call: 
   lm(formula = pertask ~ no) 
 
   Residuals: 
       Min      1Q  Median      3Q     Max  
   -3.5631 -0.7923 -0.1547  1.3844  2.4403  
 
   Coefficients: 
               Estimate Std. Error t value Pr(>|t|)     
   (Intercept)  3.04963    1.26045   2.419   0.0323 *   
   no           0.26034    0.02705   9.625 5.41e-07 *** 
   --- 
   Signif. codes:  0  `***'  0.001  `**'  0.01  `*'  0.05  `.'  0.1  ` '  1  
 
   Residual standard error: 1.672 on 12 degrees of freedom 
   Multiple R-Squared: 0.8853,     Adjusted R-squared: 0.8758  
   F-statistic: 92.63 on 1 and 12 degrees of freedom,      p-value: 5.406e-07

Tests and Confidence Intervals for Regression Coefficients (Section 12.4.2 in Wild and Seber)

Testing Whether the Slope is Zero (also called test of no linear relationship, Wild and Seber, p. 525, and test of zero correlation, Wild and Seber, p. 541)

A test of whether a regression coefficient is zero can be read off the regression printout. No additional work required.

We repeat here the same printout as in the preceding section, but here the test statistic and P-value for a test of the hypotheses

   Rweb:> summary(out) 
 
   Call: 
   lm(formula = pertask ~ no) 
 
   Residuals: 
       Min      1Q  Median      3Q     Max  
   -3.5631 -0.7923 -0.1547  1.3844  2.4403  
 
   Coefficients: 
               Estimate Std. Error t value Pr(>|t|)     
   (Intercept)  3.04963    1.26045   2.419   0.0323 *   
   no           0.26034    0.02705   9.625 5.41e-07 *** 
   --- 
   Signif. codes:  0  `***'  0.001  `**'  0.01  `*'  0.05  `.'  0.1  ` '  1  
 
   Residual standard error: 1.672 on 12 degrees of freedom 
   Multiple R-Squared: 0.8853,     Adjusted R-squared: 0.8758  
   F-statistic: 92.63 on 1 and 12 degrees of freedom,      p-value: 5.406e-07

The P-value read off the printout P = 5.41e-07 (meaning 5.4 times 10 to the minus 7) is exceedingly small. So there is a "highly statistically significant" linear relationship between the two variables.

Confidence Interval for the Slope (Example 12.4.2 in Wild and Seber)

R has no procedure especially for producing confidence intervals for regression coefficients. However, all the information necessary is in the regression printout. Again we repeat the same regression printout, this time with the information necessary for doing a confidence interval for the slope highlighted.

   Rweb:> summary(out) 
 
   Call: 
   lm(formula = pertask ~ no) 
 
   Residuals: 
       Min      1Q  Median      3Q     Max  
   -3.5631 -0.7923 -0.1547  1.3844  2.4403  
 
   Coefficients: 
               Estimate Std. Error t value Pr(>|t|)     
   (Intercept)  3.04963    1.26045   2.419   0.0323 *   
   no           0.26034    0.02705   9.625 5.41e-07 *** 
   --- 
   Signif. codes:  0  `***'  0.001  `**'  0.01  `*'  0.05  `.'  0.1  ` '  1  
 
   Residual standard error: 1.672 on 12 degrees of freedom
   Multiple R-Squared: 0.8853,     Adjusted R-squared: 0.8758  
   F-statistic: 92.63 on 1 and 12 degrees of freedom,      p-value: 5.406e-07

In the Estimate column is the estimate of the slope. In the Std. Error column is the standard error of this estimate. We multiply the standard error by a t critical value derived from a t distribution with 12 degrees of freedom (n - 2, but we can also read that directly off the regression printout). Thus

give the margin of error and the interval estimate for the true slope of the population regression line associated with a 95% confidence level.

To get a different confidence level, you would use a different argument in the qt function.

Testing Whether the Slope has Some Hypothesized Value (Example 12.4.3 in Wild and Seber)

R also has no procedure especially for performing general hypothesis tests for regression coefficients. The P-value for the test of whether a regression coefficient is zero is given in the regression printout. If we want to test any nonzero hypothesized value for a regression coefficient, we have to do it "by hand" (assisted perhaps by the computer).

Again, all the information necessary is in the regression printout. In fact we use exactly the same information that was highlighted in the preceding printout. We just use it in a different way.

For an example we will use the data for Example 12.4.3 in Wild and Seber, which is in the file gauge.txt. This file contains three variables. The two of interest in this problem are can and gauge.

The R statements

do the plot shown on p. 530 in Wild and Seber and give the accompanying regression printout, which does not match the regression printout on that page.

The regression printout on p. 530 in Wild and Seber corresponds to the following regression, which deletes case 35 on the grounds that it is an apparent outlier.

Again, we present the R regression output with the relevant numbers highlighted.

   Rweb:> summary(out) 
 
   Call: 
   lm(formula = gauge[-35] ~ can[-35]) 
 
   Residuals: 
         Min        1Q    Median        3Q       Max  
   -0.074413 -0.020389 -0.008787  0.021637  0.132076  
 
   Coefficients: 
               Estimate Std. Error t value Pr(>|t|)     
   (Intercept) 0.034438   0.007612   4.524 3.37e-05 *** 
   can[-35]    0.439803   0.003649 120.519  < 2e-16 *** 
   --- 
   Signif. codes:  0  `***'  0.001  `**'  0.01  `*'  0.05  `.'  0.1  ` '  1  
 
   Residual standard error: 0.04156 on 54 degrees of freedom
   Multiple R-Squared: 0.9963,     Adjusted R-squared: 0.9962  
   F-statistic: 1.452e+04 on 1 and 54 degrees of freedom,  p-value:     0

As usual, the test statistic is

which is this particular case is

This test statistic has (assuming the usual assumptions for linear regression) a t distribution with n - 2 degrees of freedom, a number which is also given in the printout. Thus

give the test statistic (which agrees with the book) and the one-tailed and two-tailed P-values (which don't exactly). The book is apparently wrong here, but since no details are given of exactly how they got their number (they say "the two-sided P-value is 0.98"), we can't resolve the discrepancy.

Simple Quadratic Regression (Example 12.4.4 in Wild and Seber)

Note: This example was not covered in class and won't be on the test or homework. I've just left it on this web page to have it written down somewhere.

For an example we again use the data for Example 12.3.1 in Wild and Seber, which is in the file cmputer1.txt. This file contains three variables. The two of interest in this problem are no and time.

Now we want to fit a quadratic model and see if it fits better than the linear model. The following R commands

make a scatter plot of the two variables,
fit the least-squares quadratic regression model
put the quadratic regression curve on the plot, and
print a summary of the regression.

A couple of fine points.

The function I( ) is used to indicate that no^2 means what it appears to, the square of the variable no. Otherwise R treats ^ as a magic character meaning something quite different, as explained on the on-line help for model formulas.
The third line is explained by the on-line help for the curve and predict functions. Admittedly, it takes a fairly sophisticated knowledge of R to decipher these help pages. Students are not expected to understand how this line works! I assure you, however, that it does work, not just for this example but for any regression problem. Just replace no in
```
   curve(predict(out, data.frame(no=x)), add=TRUE)
```
with the actual name of the predictor variable in the regression actually done.

The regression printout is repeated below with the coefficient of the quadratic term highlighted and also the P-value for testing whether this coefficient is nonzero.

   Rweb:> summary(out)
 
   Call: 
   lm(formula = time ~ no + I(no^2)) 
 
   Residuals: 
       Min      1Q  Median      3Q     Max  
   -2.3254 -0.6556 -0.0973  1.0512  1.8757  

   Coefficients: 
               Estimate Std. Error t value Pr(>|t|)    
   (Intercept) 0.215067   1.941166   0.111  0.91378    
   no          0.036714   0.100780   0.364  0.72254    
   I(no^2)     0.004526   0.001209   3.745  0.00324 ** 
   --- 
   Signif. codes:  0  `***'  0.001  `**'  0.01  `*'  0.05  `.'  0.1  ` '  1  
   Residual standard error: 1.319 on 11 degrees of freedom 
   Multiple R-Squared: 0.9715,     Adjusted R-squared: 0.9663  
   F-statistic: 187.7 on 2 and 11 degrees of freedom,      p-value: 3.159e-09

The actual value of the coefficient 0.004526 is small, but highly statistically significant (P = 0.003). Hence we conclude that the quadratic model is better and that the linear model is inadequate.

Confidence Interval for the Regression Function (Section 12.4.3 in Wild and Seber)

The regression function is also a parameter of interest. A standard problem asks for an estimate of this parameter for some particular value of x.

The predict function in R also does this.

We return to the data used in the first section of this page (Example 12.3.1 in Wild and Seber, data file cmputer1.txt with variables no and pertask). The top of p. 533 in Wild and Seber has the estimate of the regression function for no = 70. The following R statements do this.

As always, the on-line help for the function in question, here predict, gives optional arguments and other information about the function.

In particular, the optional argument level=0.90 changes the confidence level to 90%.

Regression Prediction (Section 12.4.3 in Wild and Seber)

This section is very similar to the preceding section. The only difference is the optional argument interval is changed to interval="prediction". This gives a prediction interval that has a specified probability (95% by default) of covering a new data value associated with a specified predictor value.

The prediction interval at the top of p. 533 in Wild and Seber for a new pertask value associated with no = 70 is produced by the following R statements.

Diagnostic Plots (Section 12.4.4 in Wild and Seber)

QQ Plot of Residuals

Yet again we reuse the data used in first section of this page (Example 12.3.1 in Wild and Seber, data file cmputer1.txt with variables no and pertask).

The R statements

do a QQ plot of the residuals.

Plot of Residuals Versus Fitted Values

For the same data, the R statements

do a plot of the residuals versus the fitted values.

Plot of Residuals Versus a Predictor

Again, for the same data, the R statements

do a plot of the residuals versus the predictor variable no.

The plots in this section and the preceding section are essentially the same for simple linear regression. The difference between them only arises in multiple regression (which we haven't done). If there are k predictor variables, then there are k residuals versus predictor plots (one for each of the k predictors, but only one residuals versus fitted values plot. Since the latter shows most of what you want, it is preferred. Certainly, if you are going to make only one plot of this type, it should be a residuals versus fitted values plot.

Correlation (Section 12.5 in Wild and Seber)

For an example we will use the data for Example 12.4.1 in Wild and Seber, which is in the file rating2.txt. This file contains two variables: parent and teacher.

For comparison with the correlation analysis, we first do a regression analysis, reproducing the analysis in Example 12.4.1.

Now we return to Section 12.5 in Wild and Seber, where we just want to examine the correlation of these two variables. The R function that calculates correlations is cor.

This does produce the correlation 0.2970181 found in Wild and Seber.

If we want to test whether the correlation is zero, we look at the test whether the regression slope is zero. Both are exactly the same test, because the correlation is zero if and only if the regression slope is zero.

Thus the P-value for the test of whether the correlation is zero can be read out of the regression printout of the preceding Rweb submission

Rweb:> summary(out) 
 
Call: 
lm(formula = teacher ~ parent) 
 
Residuals: 
    Min      1Q  Median      3Q     Max  
-27.936 -13.562  -1.398  11.688  49.945  
 
Coefficients: 
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   1.3659    11.3561   0.120   0.9047   
parent        0.4188     0.1799   2.328   0.0236 * 
--- 
Signif. codes:  0  `***'  0.001  `**'  0.01  `*'  0.05  `.'  0.1  ` '  1  
 
Residual standard error: 16.85 on 56 degrees of freedom 

Multiple R-Squared: 0.08822,    Adjusted R-squared: 0.07194  
F-statistic: 5.418 on 1 and 56 degrees of freedom,      p-value: 0.02357