*** A brief guide to using the Regression Diagnostics ***
{T J B Holland and S A T Redfern (1997) "Unit cell refinement from
powder diffraction data: the use of regression diagnostics".
Mineralogical Magazine, 61: 65-77.}
The original reference on regression diagnostics is Belsley, Kuh and Welsh
(1980) Regression Diagnostics: Identifying influential data and sources of
collinearity. J Wiley.
They were introduced to least squares (LSQ) problems in geology by Powell
(1985) J. Met. Geol. 3, 231-243, and are briefly described there.
Regression diagnostics are numbers, calculated during the regression,
which furnish valuable information on the influence of each observation on
the least squares result and on the estimated parameters. Usually it is
deletion diagnostics which are calculated, and these give information on the
changes which would result from deletion of each observation from the
regression. In the context of the least squares programs used here, the main
diagnostics are briefly described below:
(in what follows n=number of observations and p=number of parameters)
* Hat. Hat values are listed for each observation and give information on the
amount of influence each observation has on the least squares result. A hat
value of 0.0 implies no influence whatever, whereas a hat of 1.0 implies
extreme influence (that observation is effectively fixing one parameter in
the regression). The sum of the hat values is equal to the number of
parameters being estimated, so an average hat value is p/n. Hat values which
are greater than a cutoff value of 2p/n are flagged as potential leverage
points (highly influential).
* Rstudent. Ordinary residuals (y-ycalc) are not always very useful because
influential data often have very small residuals. Rstudent is designed to
take influence into account through division of the residual by sqrt(1-h).
They are defined in Belsley et al 1980 and Powell 1985. A suitable cutoff for
95% confidence level is 2.0, any value of Rstudent above this magnitude may
signify a potentially deleterious observation.
* Dfits. This is a deletion diagnostic involves the change in the predicted
value of y upon deletion of an observation. The diagnostic printed gives the
change in calculated y upon deletion of an observation as a multiple of the
standard deviation of the calculated value. Values greater than the cutoff of
2sqrt(p/n) are to be treated as potentially suspicious.
* sig(i). This is simply the value that sigmafit would take on upon deletion
of observation i. If this value falls significantly below the value for
sigmafit (the standard error of the fit) then deletion of that observation
would cause an improvement in the overall fit.
* DFbetai. The change in each fitted parameter upon deletion of observation i
is flagged by this diagnostic. In the output it is given in terms of a
percentage of the standard error. Observations which would cause any parameter
to change by more than 30% of its standard error are flagged by the program
as potentially suspicious.
The usefulness of diagnostics is that without re-running the regression it is
possible to gain an understanding of which observations may be deleterious to
the analysis. Outliers (large residuals) may not be a problem if they have a
low influence (small hat). It may be a good strategy to remove the offensive
observations sequentially until you are satisfied that the deleterious data
have been removed. However, these are single-observation diagnostics and cannot
detect deleterious effects of several observations acting together - there
may be a masking effect.