🌑

Explore / Study / Statistics / Statistical Machine Learning 5.3k words | 32 minutes

§2 Generalized Linear Models

  1. Introduction to Generalized Linear Models
    1. The General Linear Model
    2. Error structure
    3. Restrictions of Linear Models
    4. Exponential Family
    5. Generalized Linear Models (GLMs)
    6. Canonical Links
    7. Normal General Linear Model as a Special Case
    8. Modelling Binomial Data
    9. Modelling Poisson Data
    10. Transformation vs. GLM
    11. Estimation of the Model Parameters
    12. Standard Errors
    13. Wald Tests
    14. Deviance
    15. Residual Analysis
  2. Binary Data
    1. Exploring Binary Data
    2. Models for Binary Data
    3. Choice of Link
    4. Scatterplot Scales
    5. Nested models
    6. Binomial Responses and glm
    7. Goodness-of-fit
    8. Interpretation of Logistic Models
    9. Wald Confidence Intervals
    10. Profiling the Deviance
    11. Likelihood Ratio Test
    12. Profile Plots
    13. Profile Confidence Intervals
    14. Prediction
    15. Residual Analysis
    16. Deviance for Binary Data
    17. Residual Plots for Binary Data
    18. Grouping the Data
  3. Count Data
    1. Rate Data
    2. Offsets
    3. Overdispersion

Introduction to Generalized Linear Models

The General Linear Model

  • In a general linear model

    yi=β0+β1x1i++βpxpi+ϵiy_{i}=\beta_{0}+\beta_{1} x_{1 i}+\ldots+\beta_{p} x_{p i}+\epsilon_{i}

    the response yi,i=1,,ny_{i}, i=1, \ldots, n is modelled by a linear function of explanatory variables xj,j=1,,px_{j}, j=1, \ldots, p plus an error term.

Error structure

  • We assume that the errors ϵi\epsilon_{i} are independent and identically distributed such that

    E[ϵi]=0 and var[ϵi]=σ2\begin{array}{c}E\left[\epsilon_{i}\right]=0 \\\text { and } \operatorname{var}\left[\epsilon_{i}\right]=\sigma^{2}\end{array}

    Typically we assume

    ϵiN(0,σ2)\epsilon_{i} \sim N\left(0, \sigma^{2}\right)

    as a basis for inference, e.g. t-tests on parameters.

Restrictions of Linear Models

  • Although a very useful framework, there are some situations where general linear models are not appropriate
    • the range of YY is restricted (e.g. binary, count)
    • the variance of YY depends on the mean
  • Generalized linear models extend the general linear model framework to address both of these issues

Exponential Family

  • Most of the commonly used statistical distributions, e.g. Normal, Binomial and Poisson, are members of the exponential family of distributions whose densities can be written in the form

    f(y;θ,ϕ)=exp{yθb(θ)ϕ}exp[c(y,ϕ)] f(y ; \theta, \phi)=\exp \left\{\frac{y \theta-b(\theta)}{\phi}\right\} \exp [\mathrm{c}(\mathrm{y}, \phi)]

    where ϕ\phi is the dispersion parameter and θ\theta is the canonical parameter.

  • It can be shown that

    E(Y)=b(θ)=μ and var(Y)=ϕb(θ)=ϕV(μ) \begin{aligned} E(Y) &=b^{\prime}(\theta)=\mu \\ \text { and } \quad \operatorname{var}(Y) &=\phi b^{\prime \prime}(\theta)=\phi V(\mu) \end{aligned}

Generalized Linear Models (GLMs)

  • A generalized linear model is made up of a linear predictor

    ηi=β0+β1x1i++βpxpi\eta_{i}=\beta_{0}+\beta_{1} x_{1 i}+\ldots+\beta_{p} x_{p i}

    and two functions

    • a link function that describes how the mean, E(Yi)=μiE\left(Y_{i}\right)=\mu_{i} , depends on the linear predictor

      g(μi)=ηig\left(\mu_{i}\right)=\eta_{i}

    • a variance function that describes how the variance, var(Yi)\operatorname{var}\left(Y_{i}\right) depends on the mean

      var(Yi)=ϕV(μi)\operatorname{var}\left(Y_{i}\right)=\phi V\left(\mu_{\mathrm{i}}\right)

      where the dispersion parameter ϕ\phi is a constant.

  • For a glm where the response follows an exponential family, we have

    g(μi)=g(b(θi))=β0+β1x1i++βpxpig\left(\mu_{i}\right)=g\left(b^{\prime}\left(\theta_{i}\right)\right)=\beta_{0}+\beta_{1} x_{1 i}+\ldots+\beta_{p} x_{p i}

  • The canonical link is defined as

    g=(b)1g(μi)=θi=β0+β1x1i++βpxpi\begin{aligned} g &=\left(b^{\prime}\right)^{-1} \\ \Rightarrow g\left(\mu_{i}\right) &=\theta_{i}=\beta_{0}+\beta_{1} x_{1 i}+\ldots+\beta_{p} x_{p i} \end{aligned}

  • Canonical links lead to desirable statistical properties of the glm hence tend to be used by default. However there is no a priori reason why the systematic effects in the model should be additive on the scale given by this link.

Normal General Linear Model as a Special Case

  • For the general linear model with ϵN(0,σ2)\epsilon \sim N\left(0, \sigma^{2}\right) we have the linear predictor

    ηi=β0+β1x1i++βpxpi\eta_{i}=\beta_{0}+\beta_{1} x_{1 i}+\ldots+\beta_{p} x_{p i}

    the link function

    g(μi)=μig\left(\mu_{i}\right)=\mu_{i}

    and the variance function

    V(μi)=1V\left(\mu_{i}\right)=1

Modelling Binomial Data

  • Suppose

    YiBinomial(ni,pi)Y_{i} \sim \operatorname{Binomial}\left(n_{i}, p_{i}\right)

    and we wish to model the proportions Yi/ni.Y_{i} / n_{i}. Then

    E(Yi/ni)=pivar(Yi/ni)=1nipi(1pi)E\left(Y_{i} / n_{i}\right)=p_{i} \quad \operatorname{var}\left(Y_{i} / n_{i}\right)=\frac{1}{n_{i}} p_{i}\left(1-p_{i}\right)

    So our variance function is

    V(μi)=μi(1μi)V\left(\mu_{i}\right)=\mu_{i}\left(1-\mu_{i}\right)

    Our link function must map from (0,1)(,)(0,1) \rightarrow(-\infty, \infty). A common choice is

    g(μi)=logit(μi)=log(μi1μi)g\left(\mu_{i}\right)=\operatorname{logit}\left(\mu_{i}\right)=\log \left(\frac{\mu_{i}}{1-\mu_{i}}\right)

Modelling Poisson Data

  • Suppose

    Yi Poisson (λi)Y_{i} \sim \text { Poisson }\left(\lambda_{i}\right)

    Then

    E(Yi)=λivar(Yi)=λiE\left(Y_{i}\right)=\lambda_{i} \quad \operatorname{var}\left(Y_{i}\right)=\lambda_{i}

    So our variance function is

    V(μi)=μiV\left(\mu_{i}\right)=\mu_{i}

    Our link function must map from (0,)(,)(0, \infty) \rightarrow(-\infty, \infty). A natural choice is

    g(μi)=log(μi)g\left(\mu_{i}\right)=\log \left(\mu_{i}\right)

Transformation vs. GLM

  • In some situations a response variable can be transformed to improve linearity and homogeneity of variance so that a general linear model can be applied.
  • This approach has some drawbacks
    • response variable has changed!
    • transformation must simultaneously improve linearity and homogeneity of variance
    • transformation may not be defined on the boundaries of the sample space

Estimation of the Model Parameters

  • A single algorithm can be used to estimate the parameters of an exponential family glm using maximum likelihood.

  • The log-likelihood for the sample y1,,yny_{1}, \ldots, y_{n} is

    l=i=1nyiθib(θi)ϕi+c(yi,ϕi)l=\sum_{i=1}^{n} \frac{y_{i} \theta_{i}-b\left(\theta_{i}\right)}{\phi_{i}}+c\left(y_{i}, \phi_{i}\right)

  • The maximum likelihood estimates are obtained by solving the score equations

    s(βj)=lβj=i=1nyiμiϕiV(μi)×xjig(μi)=0s\left(\beta_{j}\right)=\frac{\partial l}{\partial \beta_{j}}=\sum_{i=1}^{n} \frac{y_{i}-\mu_{i}}{\phi_{i} V\left(\mu_{i}\right)} \times \frac{x_{j i}}{g^{\prime}\left(\mu_{i}\right)}=0

    for parameters βj\beta_{j}.

  • We assume that

    ϕi=ϕai\phi_{i}=\frac{\phi}{a_{i}}

    where ϕ\phi is a single dispersion parameter and aia_{i} are known prior weights; for example binomial proportions with known index nin_i have ϕ=1\phi=1 and ai=nia_{i}=n_{i}.

    The estimating equations are then

    lβj=i=1nai(yiμi)V(μi)×xijg(μi)=0\frac{\partial l}{\partial \beta_{j}}=\sum_{i=1}^{n} \frac{a_{i}\left(y_{i}-\mu_{i}\right)}{V\left(\mu_{i}\right)} \times \frac{x_{i j}}{g^{\prime}\left(\mu_{i}\right)}=0

    which does not depend on ϕ\phi (which may be unknown).

  • A general method of solving score equations is the iterative algorithm Fisher’s Method of Scoring (derived from a Taylor’s expansion of s(β)s(\boldsymbol{\beta}))

    In the rr -th iteration, the new estimate β(r+1)\boldsymbol{\beta}^{(r+1)} is obtained from the previous estimate β(r)\boldsymbol{\beta}^{(r)} by

    β(r+1)=β(r)s(β(r))E(H(β(r)))1\boldsymbol{\beta}^{(r+1)}=\boldsymbol{\beta}^{(r)}-s\left(\boldsymbol{\beta}^{(r)}\right) E\left(H\left(\boldsymbol{\beta}^{(r)}\right)\right)^{-1}

    where HH is the Hessian matrix: the matrix of second derivatives of the log-likelihood.

  • It turns out that the updates can be written as

    β(r+1)=(XTW(r)X)1XTW(r)z(r)\boldsymbol{\beta}^{(r+1)}=\left(X^{T} W^{(r)} X\right)^{-1} X^{T} W^{(r)} \boldsymbol{z}^{(r)}

    i.e. the score equations for a weighted least squares regression of z(r)\boldsymbol{z}^{(r)} on X\boldsymbol{X} with weights W(r)=diag(wi)W^{(r)}=\operatorname{diag}\left(w_{i}\right) , where

    zi(r)=ηi(r)+(yiμi(r))g(μi(r)) and wi(r)=aiV(μi(r))(g(μi(r)))2\begin{aligned} z_{i}^{(r)}=\eta_{i}^{(r)}+\left(y_{i}-\mu_{i}^{(r)}\right) g^{\prime}\left(\mu_{i}^{(r)}\right) \\ \text { and } \quad w_{i}^{(r)}=\frac{a_{i}}{V\left(\mu_{i}^{(r)}\right)\left(g^{\prime}\left(\mu_{i}^{(r)}\right)\right)^{2}} \end{aligned}

  • Hence the estimates can be found using an Iteratively (Re-)Weighted Least Squares algorithm:

    1. Start with initial estimates μi(r)\mu_{i}^{(r)}
    2. Calculate working responses zi(r)z_{i}^{(r)} and working weights wi(r)w_{i}^{(r)}
    3. Calculate β(r+1)\boldsymbol{\beta}^{(r+1)} by weighted least squares
    4. Repeat 2 and 3 till convergence

    For models with the canonical link, this is simply the Newton-Raphson method.

Standard Errors

  • The estimates β^\hat{\boldsymbol{\beta}} have the usual properties of maximum likelihood estimators. In particular, β^\hat{\boldsymbol{\beta}} is asymptotically

    N(β,i1)N\left(\boldsymbol{\beta}, i^{-1}\right)

    where

    i(β)=ϕ1XTWXi(\boldsymbol{\beta})=\phi^{-1} X^{T} W X

    Standard errors for the βj\beta_{j} may therefore be calculated as the square roots of the diagonal elements of

    cov^(β^)=ϕ(XTW^X)1\hat{\operatorname{cov}}(\hat{\boldsymbol{\beta}})=\phi\left(X^{T} \hat{W} X\right)^{-1}

    in which (XTW^X)1\left(X^{T} \hat{W} X\right)^{-1} is a by-product of the final IWLS iteration. If ϕ\phi is unknown, an estimate is required.

  • There are practical difficulties in estimating the dispersion ϕ\phi by maximum likelihood.

    Therefore it is usually estimated by method of moments. If β\boldsymbol{\beta} was known, an unbiased estimate of ϕ={aivar(Y)}/V(μi)\phi=\left\{a_{i} \operatorname{var}(Y)\right\} / V\left(\mu_{i}\right) would be

    1ni=1nai(yiμi)2V(μi)\frac{1}{n} \sum_{i=1}^{n} \frac{a_{i}\left(y_{i}-\mu_{i}\right)^{2}}{V\left(\mu_{i}\right)}

    Allowing for the fact that β\boldsymbol{\beta} must be estimated we obtain

    1npi=1nai(yiμi)2V(μi)\frac{1}{n-p} \sum_{i=1}^{n} \frac{a_{i}\left(y_{i}-\mu_{i}\right)^{2}}{V\left(\mu_{i}\right)}

Wald Tests

  • For non-Normal data, we can use the fact that asymptotically

    β^N(β,ϕ(XWX)1)\hat{\boldsymbol{\beta}} \sim N\left(\boldsymbol{\beta}, \phi\left(\boldsymbol{X}^{\prime} \boldsymbol{W} \boldsymbol{X}\right)^{-1}\right)

    and use a zz-test to test the significance of a coefficient. Specifically, we test

    H0:βj=0 versus H1:βj0H_{0}: \beta_{j}=0 \quad \text { versus } \quad H_{1}: \beta_{j} \neq 0

    using the test statistic

    zj=β^jϕ(XW^X)jj1z_{j}=\frac{\hat{\beta}_{j}}{\sqrt{\phi}\left(\boldsymbol{X}^{\prime} \hat{\boldsymbol{W}} \boldsymbol{X}\right)_{j j}^{-1}}

    which is asymptotically N(0,1)N(0,1) under H0H_{0}.

Deviance

  • The deviance of a model is defined as

    D=2ϕ(lsatlmod)D=2 \phi\left(l_{s a t}-l_{m o d}\right)

    where lmodl_{\bmod } is the log-likelihood of the fitted model and lsatl_{s a t} is the log-likelihood of the saturated model.

  • In the saturated model, the number of parameters is equal to the number of observations, so y^=y\hat{y}=y.

  • For linear regression with Normal data, the deviance is equal to the residual sum of squares.

Residual Analysis

  • Several kinds of residuals can be defined for GLMs:
    • response: yiμ^iy_{i}-\hat{\mu}_{i}

    • working: from the working response in the IWLS algorithm

    • Pearson:

      riP=yiμ^iV(μ^i)r_{i}^{P}=\frac{y_{i}-\hat{\mu}_{i}}{\sqrt{V\left(\hat{\mu}_{i}\right)}}

      s.t. i(riP)2\sum_{i}\left(r_{i}^{P}\right)^{2} equals the generalized Pearson statistic.

    • deviance: riDr_{i}^{D} s.t. i(riD)2\sum_{i}\left(r_{i}^{D}\right)^{2} equals the deviance.

  • These definitions are all equivalent for Normal models.

Binary Data

  • Binary data may occur in two forms:
    • ungrouped in which the variable can take one of two values, say success/failure
    • grouped in which the variable is the number of successes in a given number of trials
  • The natural distribution for such data is the Binomial (n,p)(n, p) distribution, where in the first case n=1n=1.

Exploring Binary Data

  • If our aim is to model a binary response, we would first like to explore the relationship between that response and potential explanatory variables.
  • When the explanatory variables are categorical, a simple approach is to calculate proportions within subgroups of the data.
  • When some of the explanatory variables are continuous, plots can be more helpful.

Models for Binary Data

  • In Part I we saw that Binomial data may be modelled by a glm, with the canonical logit link. This model is known as the logistic regression model and is the most popular for binary data.
  • There are two other links commonly used in practice:
    • probit link g(μi)=Φ1(μi)g\left(\mu_{i}\right)=\Phi^{-1}\left(\mu_{i}\right) where Φ\Phi denotes the cumulative distribution function of N(0,1)\mathrm{N}(0,1)
    • complementary log-log link g(μi)=log(log(1μi))g\left(\mu_{i}\right)=\log \left(-\log \left(1-\mu_{i}\right)\right)
  • The logit and probit functions are symmetric and - once their variances are equated - are very similar. Therefore it is usually difficult to choose between them on the grounds of fit.
  • The logit is usually preferred over the probit because of its simple interpretation as the logarithm of the odds of success (pi/(1pi))(p_i/(1 - p_i)).
  • The complementary log-log is asymmetric and may therefore be useful when the logit and probit links are inappropriate. We will concentrate on using the logit link.

Scatterplot Scales

  • When fitting a logistic model, it can also be helpful to plot the data on the logit scale.

  • To avoid dividing by zero, we calculate the empirical logits

    log((yi+0.5)/ni1(yi+0.5)/ni)=log(yi+0.5ni+0.5yi)\log \left(\frac{\left(y_{i}+0.5\right) / n_{i}}{1-\left(y_{i}+0.5\right) / n_{i}}\right)=\log \left(\frac{y_{i}+0.5}{n_{i}+0.5-y_{i}}\right)

Nested models

  • Nested models: Each model is a special case of the models that have a greater number of terms.

  • We can compare nested models by testing the hypothesis that some of the parameters of a larger model are equal to zero.

  • For example suppose we have the model

    logit(pi)=β0+β1x1++βpxp\operatorname{logit}\left(p_{i}\right)=\beta_{0}+\beta_{1} x_{1}+\ldots+\beta_{p} x_{p}

    we can test

    H0:βq+1==βp=0 versus H1:βj0, for some j{q+1,p}\begin{array}{ll} &H_{0}: \beta_{q+1}=\ldots=\beta_{p}=0 \\ \text { versus } &H_{1}: \beta_{j} \neq 0, \text { for some } j \in\{q+1, p\} \end{array}

    using the likelihood ratio statistic

    LR=2(lbiglsmall)L R=2\left(l_{b i g}-l_{s m a l l}\right)

    where lml_{m} is the maximised log-likelihood under model mm , i.e. l(β^m)l\left(\hat{\boldsymbol{\beta}}_{m}\right)

    Under the null hypothesis, LRL R is approximately χd2\chi_{d}^{2} where d=pqd=p-q.

Binomial Responses and glm

  • Now we would like to fit our candidate models. Binomial responses can be specified to glm in three ways:
    • a numeric vector giving the proportion of successes yi/niy_i/n_i, in which case a vector of the prior weights nin_i must be passed to the weights argument.
    • a numeric 0/1 vector (0 = failure); a logical vector (FALSE = failure), or a factor (first level = failure)
    • a two-column matrix with the number of successes and the number of failures
  • Better starting values are generated when the third format is used.

Goodness-of-fit

  • Notice that the deviance

    D=2ϕ(lsatlmod) D=2 \phi\left(l_{s a t}-l_{m o d}\right)

    is ϕ\phi times the likelihood ratio statistic comparing the fitted model to the saturated model.

    Therefore the deviance can be used as goodness-of-fit statistic, tested against χnp2\chi_{n-p}^{2}.

    A good-fitting model will have

    Dϕ d.f. \frac{D}{\phi} \approx \text { d.f. }

Interpretation of Logistic Models

  • Consider the logistic model

    log(pi1pi)=β0+β1x1i\log \left(\frac{p_{i}}{1-p_{i}}\right)=\beta_{0}+\beta_{1} x_{1 i}

    If we increase x1x_{1} by one unit

    log(pi1pi)=β0+β1(x1i+1)=β0+β1x1i+β1(pi1pi)=exp(β0+β1x1i)exp(β1)\begin{aligned}\log \left(\frac{p_{i}}{1-p_{i}}\right) &=\beta_{0}+\beta_{1}\left(x_{1 i}+1\right) \\&=\beta_{0}+\beta_{1} x_{1 i}+\beta_{1} \\\Rightarrow\left(\frac{p_{i}}{1-p_{i}}\right) &=\exp \left(\beta_{0}+\beta_{1} x_{1 i}\right) \exp \left(\beta_{1}\right)\end{aligned}

    the odds are multiplied by exp(β1)\exp \left(\beta_{1}\right).

Wald Confidence Intervals

  • Confidence intervals for the parameters can be based on the asymptotic normal distribution for β^j\hat{\beta}_{j}.

    For example a 95%95 \% confidence interval would be given by

    β^j±1.96× s.e. (β^j)\hat{\beta}_{j} \pm 1.96 \times \text { s.e. }\left(\hat{\beta}_{j}\right)

    Such confidence intervals can be obtained as follows:

    confint.default(parr)

Profiling the Deviance

  • The Wald confidence intervals used standard errors based on the second-order Taylor expansion of the log-likelihood at β^\hat{\boldsymbol{\beta}}.

  • An alternative approach is to the profile the log-likelihood, or equivalently the deviance, around each β^j\hat{\beta}_{j} and base confidence intervals on this.

    We set βj\beta_{j} to β~jβ^j\tilde{\beta}_{j} \neq \hat{\beta}_{j} and re-fit the model to maximise the likelihood/minimise the deviance under this constraint. Repeating this for a range of values around β^j\hat{\beta}_{j} gives a deviance profile for that parameter.

Likelihood Ratio Test

  • To test the hypothesis

    H0:βj=β~j versus H1:βj=β^j\begin{array}{ll} &H_{0}: \beta_{j}=\tilde{\beta}_{j} \\ \text { versus } & H_{1}: \beta_{j}=\hat{\beta}_{j} \end{array}

    We can use the likelihood ratio statistic

    2(l(β^j)l(β~j))2\left(l\left(\hat{\beta}_{j}\right)-l\left(\tilde{\beta}_{j}\right)\right)

    which is asymptotically distributed χ12\chi_{1}^{2}. Thus

    τ=sign(β~jβ^j)(2(l(β^j)l(β~j)))=sign(β~jβ^j)((D(β~j)D(β^j))/ϕ)\begin{aligned} \tau &\left.=\operatorname{sign}\left(\tilde{\beta}_{j}-\hat{\beta}_{j}\right) \sqrt{(} 2\left(l\left(\hat{\beta}_{j}\right)-l\left(\tilde{\beta}_{j}\right)\right)\right) \\ &\left.=\operatorname{sign}\left(\tilde{\beta}_{j}-\hat{\beta}_{j}\right) \sqrt{(}\left(D\left(\tilde{\beta}_{j}\right)-D\left(\hat{\beta}_{j}\right)\right) / \phi\right) \end{aligned}

    is asymptotically N(0,1)N(0,1) and is analogous to the Wald statistic.

Profile Plots

  • If the log-likelihood were quadratic about β^j\hat{\beta}_{j} , then a plot of τ\tau vs. β~j\tilde{\beta}_{j} would be a straight line.

  • We can obtain such a plot as follows

    plot(profile(parr, "ldose"))

Profile Confidence Intervals

  • Rather than use the quadratic approximation, we can directly estimate the values of βj\beta_{j} for which τ=±1.96\tau=\pm 1.96 to obtain a 95%95 \% confidence interval for βj\beta_{j}.

  • This is the method used by confint.glm:

    confint(parr)

    Notice the confidence intervals are asymmetric.

Prediction

  • The predict method for GLMs has a type argument, which may be specified as

    • "link" for predictions of η\eta
    • "reponse" for predictions of μ\mu

    If no new data is passed to predict, these options return object$linear.predictor and object$fitted.values respectively.

Residual Analysis

  • The deviance residuals can be used to check the model as with Normal models.

  • The standardized residuals for binomial data should have an approximate normal distribution, provided the numbers for each covariate pattern is not too small.

    par(mfrow = c(2, 2))

    plot(parr, 1:4)

Deviance for Binary Data

  • We have seen that the deviance may be viewed as a likelihood ratio statistic with approximate distribution χnp2\chi_{n-p}^{2}.

    However the χ2\chi^{2} distribution of the likelihood ratio statistic is based on the limit as nn \rightarrow \infty with the number of parameters in the nested models both fixed. This does not apply to the deviance.

    The χnp2\chi_{n-p}^{2} distribution is still reasonable where the information content of each observation is large e.g. Binomial models with large nin_{i} , Poisson models with large μi\mu_{i} , Gamma models with small ϕ\phi.

  • For binary data, the χ2\chi^{2} approximation does not apply.

    In fact for the logistic regression model it can be shown that

    D=2i=1n{p^ilog[p^i/(1p^i)]+log(1p^i)}D=-2 \sum_{i=1}^{n}\left\{\hat{p}_{i} \log \left[\hat{p}_{i} /\left(1-\hat{p}_{i}\right)\right]+\log \left(1-\hat{p}_{i}\right)\right\}

    which depends only on yiy_{i} through p^i\hat{p}_{i} therefore can tell us nothing about agreement between yiy_{i} and p^i\hat{p}_{i}.

    Instead we shall analyse the residuals and consider alternative models.

Residual Plots for Binary Data

  • For binary data, or binomial data where nin_i is small for most covariate patterns, there are few distinct values of the residuals and the plots may be uninformative.

    Therefore we consider “large” residuals

Grouping the Data

  • A useful technique in evaluating models fit to binary data is to group the data and treat as binomial instead.

    We select category boundaries to give roughly equal numbers in each category.

    Then we compute the proportions from the original data.

Count Data

  • Often counts are based on events that may be assumed to arise from a Poisson process, where

    • counts are observed over fixed time interval
    • probability of the event approximately proportional to length of time for small intervals of time
    • for small intervals of time probability of >1>1 event is negligible compared to probability of one event numbers of events
    • in non-overlapping time intervals are independent
  • In such situations, the counts can be assumed to follow a Poisson distribution, say

    YiPoisson(λi)Y_{i} \sim \operatorname{Poisson}\left(\lambda_{i}\right)

Rate Data

  • In many cases we are making comparisons across observation units i=1,,ni=1, \ldots, n with different levels of exposure to the event and hence the measure of interest is the rate of occurrence, e.g.

    • number of household burglaries per 10,000 households in city ii in a given year
    • number of customers served per hour by salesperson ii in a given month
    • number of train accidents per billion train-kilometers in year ii
  • Since the counts are Poisson distributed, we would like to use a glm to model the expected rate, λi/ti\lambda_{i} / t_{i} , where tit_{i} is the exposure for unit ii.

    Typically explanatory variables have a multiplicative effect rather than an additive effect on the expected rate, therefore a suitable model is

    log(λi/ti)=β0+r=1pxirβrlog(λi)=log(ti)+β0+r=1pxirβr\begin{array}{l}\log \left(\lambda_{i} / t_{i}\right)=\beta_{0}+\sum_{r=1}^{p} x_{i r} \beta_{r} \\\Rightarrow \log \left(\lambda_{i}\right)=\log \left(t_{i}\right)+\beta_{0}+\sum_{r=1}^{p} x_{i r} \beta_{r}\end{array}

    i.e. Poisson glm with the canonical log link.
    This is known as a log-linear model.

Offsets

  • The standardizing term log(ti)\log \left(t_{i}\right) is an example of an offset: a term with a fixed coefficient of 1.
  • Offsets are easily specified to glm, either using the offset argument or using the offset function in the formula, e.g. offset(time).
  • If all the observations have the same exposure, the model does not need an offset term and we can model log(λi)\log \left(\lambda_{i}\right) directly.

Overdispersion

  • Lack of fit may be due to inadequate specification of the model, but another possibility when modelling discrete data is overdispersion.

  • Under the Poisson or Binomial model, we have a fixed mean-variance relationship:

    var(Yi)=V(μi)\operatorname{var}\left(Y_{i}\right)=V\left(\mu_{i}\right)

    Overdispersion occurs when

    var(Yi)>V(μi)\operatorname{var}\left(Y_{i}\right)>V\left(\mu_{i}\right)

    This may occur due to correlated responses or variability between observational units.

  • We can adjust for over-dispersion by estimating a dispersion parameter

    var(Yi)=ϕV(μi)\operatorname{var}\left(Y_{i}\right)=\phi V\left(\mu_{i}\right)

    This changes the assumed distribution of our response, to a distribution for which we do not have the full likelihood.
    However the score equations in the IWLS

    lβj=i=1nai(yiμi)V(μi)×xijg(μi)=0\frac{\partial l}{\partial \beta_{j}}=\sum_{i=1}^{n} \frac{a_{i}\left(y_{i}-\mu_{i}\right)}{V\left(\mu_{i}\right)} \times \frac{x_{i j}}{g^{\prime}\left(\mu_{i}\right)}=0

    only require the variance function, so we can still obtain estimates for the parameters. Note the score equations do not depend on ϕ\phi , so we will obtain the same estimates as if ϕ=1\phi=1.

  • This approach is known as quasi-likelihood estimation. Whilst estimating ϕ\phi does not affect the parameter estimates, it will change inference based on the model.

    The asymptotic theory for maximum likelihood also applies to quasi-likelihood, in particular β\boldsymbol{\beta} is approximately distributed as

    N(β,ϕ(XTW^X)1)N\left(\boldsymbol{\beta}, \phi\left(X^{T} \hat{W} X\right)^{-1}\right)

    so compared to the case with ϕ=1\phi=1 , the standard errors of the parameters are multiplied by (ϕ)\sqrt{(} \phi).
    Since ϕ\phi is estimated, Wald tests based on the Normal assumption are tt rather than ZZ tests.

  • The deviance based on the likelihood of the exponential family distribution with the same variance function may be used as a quasi-deviance. Since ϕ\phi is estimated rather than fixed at 1, nested models are compared by referring

    {DbigDsmall}/{ϕ^(pbigpsmall)}\left \{ D_{big} - D_{small} \right \} / \left \{ \hat{\phi} \left( p_{big}-p_{small} \right) \right \}

    to the FF distribution with pbigpsmall,npbigp_{big}-p_{small}, n-p_{big} degrees of freedom.

    The AIC\mathrm{AIC} is undefined for quasi-likelihood models.

— Jul 15, 2022

Related:

Creative Commons License
§2 Generalized Linear Models by Lu Meng is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at About.