In a general linear model
the response is modelled by a linear function of explanatory variables plus an error term.
We assume that the errors are independent and identically distributed such that
Typically we assume
as a basis for inference, e.g. t-tests on parameters.
Most of the commonly used statistical distributions, e.g. Normal, Binomial and Poisson, are members of the exponential family of distributions whose densities can be written in the form
where is the dispersion parameter and is the canonical parameter.
It can be shown that
A generalized linear model is made up of a linear predictor
and two functions
a link function that describes how the mean, , depends on the linear predictor
a variance function that describes how the variance, depends on the mean
where the dispersion parameter is a constant.
For a glm where the response follows an exponential family, we have
The canonical link is defined as
Canonical links lead to desirable statistical properties of the glm hence tend to be used by default. However there is no a priori reason why the systematic effects in the model should be additive on the scale given by this link.
For the general linear model with we have the linear predictor
the link function
and the variance function
Suppose
and we wish to model the proportions Then
So our variance function is
Our link function must map from . A common choice is
Suppose
Then
So our variance function is
Our link function must map from . A natural choice is
A single algorithm can be used to estimate the parameters of an exponential family glm using maximum likelihood.
The log-likelihood for the sample is
The maximum likelihood estimates are obtained by solving the score equations
for parameters .
We assume that
where is a single dispersion parameter and are known prior weights; for example binomial proportions with known index have and .
The estimating equations are then
which does not depend on (which may be unknown).
A general method of solving score equations is the iterative algorithm Fisher’s Method of Scoring (derived from a Taylor’s expansion of )
In the -th iteration, the new estimate is obtained from the previous estimate by
where is the Hessian matrix: the matrix of second derivatives of the log-likelihood.
It turns out that the updates can be written as
i.e. the score equations for a weighted least squares regression of on with weights , where
Hence the estimates can be found using an Iteratively (Re-)Weighted Least Squares algorithm:
For models with the canonical link, this is simply the Newton-Raphson method.
The estimates have the usual properties of maximum likelihood estimators. In particular, is asymptotically
where
Standard errors for the may therefore be calculated as the square roots of the diagonal elements of
in which is a by-product of the final IWLS iteration. If is unknown, an estimate is required.
There are practical difficulties in estimating the dispersion by maximum likelihood.
Therefore it is usually estimated by method of moments. If was known, an unbiased estimate of would be
Allowing for the fact that must be estimated we obtain
For non-Normal data, we can use the fact that asymptotically
and use a -test to test the significance of a coefficient. Specifically, we test
using the test statistic
which is asymptotically under .
The deviance of a model is defined as
where is the log-likelihood of the fitted model and is the log-likelihood of the saturated model.
In the saturated model, the number of parameters is equal to the number of observations, so .
For linear regression with Normal data, the deviance is equal to the residual sum of squares.
response:
working: from the working response in the IWLS algorithm
Pearson:
s.t. equals the generalized Pearson statistic.
deviance: s.t. equals the deviance.
When fitting a logistic model, it can also be helpful to plot the data on the logit scale.
To avoid dividing by zero, we calculate the empirical logits
Nested models: Each model is a special case of the models that have a greater number of terms.
We can compare nested models by testing the hypothesis that some of the parameters of a larger model are equal to zero.
For example suppose we have the model
we can test
using the likelihood ratio statistic
where is the maximised log-likelihood under model , i.e.
Under the null hypothesis, is approximately where .
glm
glm
in three ways:
weights
argument.Notice that the deviance
is times the likelihood ratio statistic comparing the fitted model to the saturated model.
Therefore the deviance can be used as goodness-of-fit statistic, tested against .
A good-fitting model will have
Consider the logistic model
If we increase by one unit
the odds are multiplied by .
Confidence intervals for the parameters can be based on the asymptotic normal distribution for .
For example a confidence interval would be given by
Such confidence intervals can be obtained as follows:
confint.default(parr)
The Wald confidence intervals used standard errors based on the second-order Taylor expansion of the log-likelihood at .
An alternative approach is to the profile the log-likelihood, or equivalently the deviance, around each and base confidence intervals on this.
We set to and re-fit the model to maximise the likelihood/minimise the deviance under this constraint. Repeating this for a range of values around gives a deviance profile for that parameter.
To test the hypothesis
We can use the likelihood ratio statistic
which is asymptotically distributed . Thus
is asymptotically and is analogous to the Wald statistic.
If the log-likelihood were quadratic about , then a plot of vs. would be a straight line.
We can obtain such a plot as follows
plot(profile(parr, "ldose"))
Rather than use the quadratic approximation, we can directly estimate the values of for which to obtain a confidence interval for .
This is the method used by confint.glm
:
confint(parr)
Notice the confidence intervals are asymmetric.
The predict
method for GLMs has a type
argument, which may be specified as
"link"
for predictions of "reponse"
for predictions of If no new data is passed to predict
, these options return object$linear.predictor
and object$fitted.values
respectively.
The deviance residuals can be used to check the model as with Normal models.
The standardized residuals for binomial data should have an approximate normal distribution, provided the numbers for each covariate pattern is not too small.
par(mfrow = c(2, 2))
plot(parr, 1:4)
We have seen that the deviance may be viewed as a likelihood ratio statistic with approximate distribution .
However the distribution of the likelihood ratio statistic is based on the limit as with the number of parameters in the nested models both fixed. This does not apply to the deviance.
The distribution is still reasonable where the information content of each observation is large e.g. Binomial models with large , Poisson models with large , Gamma models with small .
For binary data, the approximation does not apply.
In fact for the logistic regression model it can be shown that
which depends only on through therefore can tell us nothing about agreement between and .
Instead we shall analyse the residuals and consider alternative models.
For binary data, or binomial data where is small for most covariate patterns, there are few distinct values of the residuals and the plots may be uninformative.
Therefore we consider “large” residuals
A useful technique in evaluating models fit to binary data is to group the data and treat as binomial instead.
We select category boundaries to give roughly equal numbers in each category.
Then we compute the proportions from the original data.
Often counts are based on events that may be assumed to arise from a Poisson process, where
In such situations, the counts can be assumed to follow a Poisson distribution, say
In many cases we are making comparisons across observation units with different levels of exposure to the event and hence the measure of interest is the rate of occurrence, e.g.
Since the counts are Poisson distributed, we would like to use a glm to model the expected rate, , where is the exposure for unit .
Typically explanatory variables have a multiplicative effect rather than an additive effect on the expected rate, therefore a suitable model is
i.e. Poisson glm with the canonical log link.
This is known as a log-linear model.
glm
, either using the offset
argument or using the offset
function in the formula, e.g. offset(time)
.Lack of fit may be due to inadequate specification of the model, but another possibility when modelling discrete data is overdispersion.
Under the Poisson or Binomial model, we have a fixed mean-variance relationship:
Overdispersion occurs when
This may occur due to correlated responses or variability between observational units.
We can adjust for over-dispersion by estimating a dispersion parameter
This changes the assumed distribution of our response, to a distribution for which we do not have the full likelihood.
However the score equations in the IWLS
only require the variance function, so we can still obtain estimates for the parameters. Note the score equations do not depend on , so we will obtain the same estimates as if .
This approach is known as quasi-likelihood estimation. Whilst estimating does not affect the parameter estimates, it will change inference based on the model.
The asymptotic theory for maximum likelihood also applies to quasi-likelihood, in particular is approximately distributed as
so compared to the case with , the standard errors of the parameters are multiplied by .
Since is estimated, Wald tests based on the Normal assumption are rather than tests.
The deviance based on the likelihood of the exponential family distribution with the same variance function may be used as a quasi-deviance. Since is estimated rather than fixed at 1, nested models are compared by referring
to the distribution with degrees of freedom.
The is undefined for quasi-likelihood models.
— Jul 15, 2022
Made with ❤ at Earth.