🌑

☀️

Hi Folks.

Home Explore Trending Me About

🇨🇳 🔎

Explore / Study / Statistics / Statistical Machine Learning 5.3k words | 32 minutes

§2 Generalized Linear Models

Introduction to Generalized Linear Models

The General Linear Model

In a general linear model
$y_{i}=\beta_{0}+\beta_{1} x_{1 i}+\ldots+\beta_{p} x_{p i}+\epsilon_{i}$
the response $y_{i}, i=1, \ldots, n$ is modelled by a linear function of explanatory variables $x_{j}, j=1, \ldots, p$ plus an error term.

Error structure

We assume that the errors $\epsilon_{i}$ are independent and identically distributed such that
$\begin{array}{c}E\left[\epsilon_{i}\right]=0 \\\text { and } \operatorname{var}\left[\epsilon_{i}\right]=\sigma^{2}\end{array}$
Typically we assume
$\epsilon_{i} \sim N\left(0, \sigma^{2}\right)$
as a basis for inference, e.g. t-tests on parameters.

Restrictions of Linear Models

Although a very useful framework, there are some situations where general linear models are not appropriate
- the range of $Y$ is restricted (e.g. binary, count)
- the variance of $Y$ depends on the mean
Generalized linear models extend the general linear model framework to address both of these issues

Exponential Family

Most of the commonly used statistical distributions, e.g. Normal, Binomial and Poisson, are members of the exponential family of distributions whose densities can be written in the form
$f(y ; \theta, \phi)=\exp \left\{\frac{y \theta-b(\theta)}{\phi}\right\} \exp [\mathrm{c}(\mathrm{y}, \phi)]$
where $\phi$ is the dispersion parameter and $\theta$ is the canonical parameter.
It can be shown that
$\begin{aligned} E(Y) &=b^{\prime}(\theta)=\mu \\ \text { and } \quad \operatorname{var}(Y) &=\phi b^{\prime \prime}(\theta)=\phi V(\mu) \end{aligned}$

Generalized Linear Models (GLMs)

A generalized linear model is made up of a linear predictor

$\eta_{i}=\beta_{0}+\beta_{1} x_{1 i}+\ldots+\beta_{p} x_{p i}$

and two functions
- a link function that describes how the mean, $E\left(Y_{i}\right)=\mu_{i}$ , depends on the linear predictor
  $g\left(\mu_{i}\right)=\eta_{i}$
- a variance function that describes how the variance, $\operatorname{var}\left(Y_{i}\right)$ depends on the mean
  $\operatorname{var}\left(Y_{i}\right)=\phi V\left(\mu_{\mathrm{i}}\right)$
  where the dispersion parameter $\phi$ is a constant.

Canonical Links

For a glm where the response follows an exponential family, we have
$g\left(\mu_{i}\right)=g\left(b^{\prime}\left(\theta_{i}\right)\right)=\beta_{0}+\beta_{1} x_{1 i}+\ldots+\beta_{p} x_{p i}$
The canonical link is defined as
$\begin{aligned} g &=\left(b^{\prime}\right)^{-1} \\ \Rightarrow g\left(\mu_{i}\right) &=\theta_{i}=\beta_{0}+\beta_{1} x_{1 i}+\ldots+\beta_{p} x_{p i} \end{aligned}$
Canonical links lead to desirable statistical properties of the glm hence tend to be used by default. However there is no a priori reason why the systematic effects in the model should be additive on the scale given by this link.

Normal General Linear Model as a Special Case

For the general linear model with $\epsilon \sim N\left(0, \sigma^{2}\right)$ we have the linear predictor
$\eta_{i}=\beta_{0}+\beta_{1} x_{1 i}+\ldots+\beta_{p} x_{p i}$
the link function
$g\left(\mu_{i}\right)=\mu_{i}$
and the variance function
$V\left(\mu_{i}\right)=1$

Modelling Binomial Data

Suppose
$Y_{i} \sim \operatorname{Binomial}\left(n_{i}, p_{i}\right)$
and we wish to model the proportions $Y_{i} / n_{i}.$ Then
$E\left(Y_{i} / n_{i}\right)=p_{i} \quad \operatorname{var}\left(Y_{i} / n_{i}\right)=\frac{1}{n_{i}} p_{i}\left(1-p_{i}\right)$
So our variance function is
$V\left(\mu_{i}\right)=\mu_{i}\left(1-\mu_{i}\right)$
Our link function must map from $(0,1) \rightarrow(-\infty, \infty)$ . A common choice is
$g\left(\mu_{i}\right)=\operatorname{logit}\left(\mu_{i}\right)=\log \left(\frac{\mu_{i}}{1-\mu_{i}}\right)$

Modelling Poisson Data

Suppose
$Y_{i} \sim \text { Poisson }\left(\lambda_{i}\right)$
Then
$E\left(Y_{i}\right)=\lambda_{i} \quad \operatorname{var}\left(Y_{i}\right)=\lambda_{i}$
So our variance function is
$V\left(\mu_{i}\right)=\mu_{i}$
Our link function must map from $(0, \infty) \rightarrow(-\infty, \infty)$ . A natural choice is
$g\left(\mu_{i}\right)=\log \left(\mu_{i}\right)$

Transformation vs. GLM

In some situations a response variable can be transformed to improve linearity and homogeneity of variance so that a general linear model can be applied.
This approach has some drawbacks
- response variable has changed!
- transformation must simultaneously improve linearity and homogeneity of variance
- transformation may not be defined on the boundaries of the sample space

Estimation of the Model Parameters

A single algorithm can be used to estimate the parameters of an exponential family glm using maximum likelihood.
The log-likelihood for the sample $y_{1}, \ldots, y_{n}$ is
$l=\sum_{i=1}^{n} \frac{y_{i} \theta_{i}-b\left(\theta_{i}\right)}{\phi_{i}}+c\left(y_{i}, \phi_{i}\right)$
The maximum likelihood estimates are obtained by solving the score equations
$s\left(\beta_{j}\right)=\frac{\partial l}{\partial \beta_{j}}=\sum_{i=1}^{n} \frac{y_{i}-\mu_{i}}{\phi_{i} V\left(\mu_{i}\right)} \times \frac{x_{j i}}{g^{\prime}\left(\mu_{i}\right)}=0$
for parameters $\beta_{j}$ .
We assume that
$\phi_{i}=\frac{\phi}{a_{i}}$
where $\phi$ is a single dispersion parameter and $a_{i}$ are known prior weights; for example binomial proportions with known index $n_i$ have $\phi=1$ and $a_{i}=n_{i}$ .

The estimating equations are then
$\frac{\partial l}{\partial \beta_{j}}=\sum_{i=1}^{n} \frac{a_{i}\left(y_{i}-\mu_{i}\right)}{V\left(\mu_{i}\right)} \times \frac{x_{i j}}{g^{\prime}\left(\mu_{i}\right)}=0$
which does not depend on $\phi$ (which may be unknown).
A general method of solving score equations is the iterative algorithm Fisher’s Method of Scoring (derived from a Taylor’s expansion of $s(\boldsymbol{\beta})$ )

In the $r$ -th iteration, the new estimate $\boldsymbol{\beta}^{(r+1)}$ is obtained from the previous estimate $\boldsymbol{\beta}^{(r)}$ by
$\boldsymbol{\beta}^{(r+1)}=\boldsymbol{\beta}^{(r)}-s\left(\boldsymbol{\beta}^{(r)}\right) E\left(H\left(\boldsymbol{\beta}^{(r)}\right)\right)^{-1}$
where $H$ is the Hessian matrix: the matrix of second derivatives of the log-likelihood.
It turns out that the updates can be written as
$\boldsymbol{\beta}^{(r+1)}=\left(X^{T} W^{(r)} X\right)^{-1} X^{T} W^{(r)} \boldsymbol{z}^{(r)}$
i.e. the score equations for a weighted least squares regression of $\boldsymbol{z}^{(r)}$ on $\boldsymbol{X}$ with weights $W^{(r)}=\operatorname{diag}\left(w_{i}\right)$ , where
$\begin{aligned} z_{i}^{(r)}=\eta_{i}^{(r)}+\left(y_{i}-\mu_{i}^{(r)}\right) g^{\prime}\left(\mu_{i}^{(r)}\right) \\ \text { and } \quad w_{i}^{(r)}=\frac{a_{i}}{V\left(\mu_{i}^{(r)}\right)\left(g^{\prime}\left(\mu_{i}^{(r)}\right)\right)^{2}} \end{aligned}$
Hence the estimates can be found using an Iteratively (Re-)Weighted Least Squares algorithm:
1. Start with initial estimates $\mu_{i}^{(r)}$
2. Calculate working responses $z_{i}^{(r)}$ and working weights $w_{i}^{(r)}$
3. Calculate $\boldsymbol{\beta}^{(r+1)}$ by weighted least squares
4. Repeat 2 and 3 till convergence
For models with the canonical link, this is simply the Newton-Raphson method.

Standard Errors

The estimates $\hat{\boldsymbol{\beta}}$ have the usual properties of maximum likelihood estimators. In particular, $\hat{\boldsymbol{\beta}}$ is asymptotically
$N\left(\boldsymbol{\beta}, i^{-1}\right)$
where
$i(\boldsymbol{\beta})=\phi^{-1} X^{T} W X$
Standard errors for the $\beta_{j}$ may therefore be calculated as the square roots of the diagonal elements of
$\hat{\operatorname{cov}}(\hat{\boldsymbol{\beta}})=\phi\left(X^{T} \hat{W} X\right)^{-1}$
in which $\left(X^{T} \hat{W} X\right)^{-1}$ is a by-product of the final IWLS iteration. If $\phi$ is unknown, an estimate is required.
There are practical difficulties in estimating the dispersion $\phi$ by maximum likelihood.

Therefore it is usually estimated by method of moments. If $\boldsymbol{\beta}$ was known, an unbiased estimate of $\phi=\left\{a_{i} \operatorname{var}(Y)\right\} / V\left(\mu_{i}\right)$ would be
$\frac{1}{n} \sum_{i=1}^{n} \frac{a_{i}\left(y_{i}-\mu_{i}\right)^{2}}{V\left(\mu_{i}\right)}$
Allowing for the fact that $\boldsymbol{\beta}$ must be estimated we obtain
$\frac{1}{n-p} \sum_{i=1}^{n} \frac{a_{i}\left(y_{i}-\mu_{i}\right)^{2}}{V\left(\mu_{i}\right)}$

Wald Tests

For non-Normal data, we can use the fact that asymptotically
$\hat{\boldsymbol{\beta}} \sim N\left(\boldsymbol{\beta}, \phi\left(\boldsymbol{X}^{\prime} \boldsymbol{W} \boldsymbol{X}\right)^{-1}\right)$
and use a $z$ -test to test the significance of a coefficient. Specifically, we test
$H_{0}: \beta_{j}=0 \quad \text { versus } \quad H_{1}: \beta_{j} \neq 0$
using the test statistic
$z_{j}=\frac{\hat{\beta}_{j}}{\sqrt{\phi}\left(\boldsymbol{X}^{\prime} \hat{\boldsymbol{W}} \boldsymbol{X}\right)_{j j}^{-1}}$
which is asymptotically $N(0,1)$ under $H_{0}$ .

Deviance

The deviance of a model is defined as
$D=2 \phi\left(l_{s a t}-l_{m o d}\right)$
where $l_{\bmod }$ is the log-likelihood of the fitted model and $l_{s a t}$ is the log-likelihood of the saturated model.
In the saturated model, the number of parameters is equal to the number of observations, so $\hat{y}=y$ .
For linear regression with Normal data, the deviance is equal to the residual sum of squares.

Residual Analysis

Several kinds of residuals can be defined for GLMs:
- response: $y_{i}-\hat{\mu}_{i}$
- working: from the working response in the IWLS algorithm
- Pearson:
  $r_{i}^{P}=\frac{y_{i}-\hat{\mu}_{i}}{\sqrt{V\left(\hat{\mu}_{i}\right)}}$
  s.t. $\sum_{i}\left(r_{i}^{P}\right)^{2}$ equals the generalized Pearson statistic.
- deviance: $r_{i}^{D}$ s.t. $\sum_{i}\left(r_{i}^{D}\right)^{2}$ equals the deviance.
These definitions are all equivalent for Normal models.

Binary Data

Binary data may occur in two forms:
- ungrouped in which the variable can take one of two values, say success/failure
- grouped in which the variable is the number of successes in a given number of trials
The natural distribution for such data is the Binomial $(n, p)$ distribution, where in the first case $n=1$ .

Exploring Binary Data

If our aim is to model a binary response, we would first like to explore the relationship between that response and potential explanatory variables.
When the explanatory variables are categorical, a simple approach is to calculate proportions within subgroups of the data.
When some of the explanatory variables are continuous, plots can be more helpful.

Models for Binary Data

In Part I we saw that Binomial data may be modelled by a glm, with the canonical logit link. This model is known as the logistic regression model and is the most popular for binary data.
There are two other links commonly used in practice:
- probit link $g\left(\mu_{i}\right)=\Phi^{-1}\left(\mu_{i}\right)$ where $\Phi$ denotes the cumulative distribution function of $\mathrm{N}(0,1)$
- complementary log-log link $g\left(\mu_{i}\right)=\log \left(-\log \left(1-\mu_{i}\right)\right)$

Choice of Link

The logit and probit functions are symmetric and - once their variances are equated - are very similar. Therefore it is usually difficult to choose between them on the grounds of fit.
The logit is usually preferred over the probit because of its simple interpretation as the logarithm of the odds of success $(p_i/(1 - p_i))$ .
The complementary log-log is asymmetric and may therefore be useful when the logit and probit links are inappropriate. We will concentrate on using the logit link.

Scatterplot Scales

When fitting a logistic model, it can also be helpful to plot the data on the logit scale.
To avoid dividing by zero, we calculate the empirical logits
$\log \left(\frac{\left(y_{i}+0.5\right) / n_{i}}{1-\left(y_{i}+0.5\right) / n_{i}}\right)=\log \left(\frac{y_{i}+0.5}{n_{i}+0.5-y_{i}}\right)$

Nested models

Nested models: Each model is a special case of the models that have a greater number of terms.
We can compare nested models by testing the hypothesis that some of the parameters of a larger model are equal to zero.
For example suppose we have the model
$\operatorname{logit}\left(p_{i}\right)=\beta_{0}+\beta_{1} x_{1}+\ldots+\beta_{p} x_{p}$
we can test
$\begin{array}{ll} &H_{0}: \beta_{q+1}=\ldots=\beta_{p}=0 \\ \text { versus } &H_{1}: \beta_{j} \neq 0, \text { for some } j \in\{q+1, p\} \end{array}$
using the likelihood ratio statistic
$L R=2\left(l_{b i g}-l_{s m a l l}\right)$
where $l_{m}$ is the maximised log-likelihood under model $m$ , i.e. $l\left(\hat{\boldsymbol{\beta}}_{m}\right)$

Under the null hypothesis, $L R$ is approximately $\chi_{d}^{2}$ where $d=p-q$ .

Binomial Responses and `glm`

Now we would like to fit our candidate models. Binomial responses can be specified to glm in three ways:
- a numeric vector giving the proportion of successes $y_i/n_i$ , in which case a vector of the prior weights $n_i$ must be passed to the weights argument.
- a numeric 0/1 vector (0 = failure); a logical vector (FALSE = failure), or a factor (first level = failure)
- a two-column matrix with the number of successes and the number of failures
Better starting values are generated when the third format is used.

Goodness-of-fit

Notice that the deviance
$D=2 \phi\left(l_{s a t}-l_{m o d}\right)$
is $\phi$ times the likelihood ratio statistic comparing the fitted model to the saturated model.

Therefore the deviance can be used as goodness-of-fit statistic, tested against $\chi_{n-p}^{2}$ .

A good-fitting model will have
$\frac{D}{\phi} \approx \text { d.f. }$
💡
$\frac{D}{\phi} \sim \chi_{n-p}^{2}$ as $n \rightarrow \infty$ and $p$ being fixed.

Compare $\frac{D}{\phi}$ and $\mathbb{E}\left(\chi_{n-p}^{2}\right)$

Interpretation of Logistic Models

Consider the logistic model
$\log \left(\frac{p_{i}}{1-p_{i}}\right)=\beta_{0}+\beta_{1} x_{1 i}$
If we increase $x_{1}$ by one unit
$\begin{aligned}\log \left(\frac{p_{i}}{1-p_{i}}\right) &=\beta_{0}+\beta_{1}\left(x_{1 i}+1\right) \\&=\beta_{0}+\beta_{1} x_{1 i}+\beta_{1} \\\Rightarrow\left(\frac{p_{i}}{1-p_{i}}\right) &=\exp \left(\beta_{0}+\beta_{1} x_{1 i}\right) \exp \left(\beta_{1}\right)\end{aligned}$
the odds are multiplied by $\exp \left(\beta_{1}\right)$ .

Wald Confidence Intervals

Confidence intervals for the parameters can be based on the asymptotic normal distribution for $\hat{\beta}_{j}$ .

For example a $95 \%$ confidence interval would be given by
$\hat{\beta}_{j} \pm 1.96 \times \text { s.e. }\left(\hat{\beta}_{j}\right)$
Such confidence intervals can be obtained as follows:

confint.default(parr)

Profiling the Deviance

The Wald confidence intervals used standard errors based on the second-order Taylor expansion of the log-likelihood at $\hat{\boldsymbol{\beta}}$ .
An alternative approach is to the profile the log-likelihood, or equivalently the deviance, around each $\hat{\beta}_{j}$ and base confidence intervals on this.

We set $\beta_{j}$ to $\tilde{\beta}_{j} \neq \hat{\beta}_{j}$ and re-fit the model to maximise the likelihood/minimise the deviance under this constraint. Repeating this for a range of values around $\hat{\beta}_{j}$ gives a deviance profile for that parameter.

Likelihood Ratio Test

To test the hypothesis
$\begin{array}{ll} &H_{0}: \beta_{j}=\tilde{\beta}_{j} \\ \text { versus } & H_{1}: \beta_{j}=\hat{\beta}_{j} \end{array}$
We can use the likelihood ratio statistic
$2\left(l\left(\hat{\beta}_{j}\right)-l\left(\tilde{\beta}_{j}\right)\right)$
which is asymptotically distributed $\chi_{1}^{2}$ . Thus
$\begin{aligned} \tau &\left.=\operatorname{sign}\left(\tilde{\beta}_{j}-\hat{\beta}_{j}\right) \sqrt{(} 2\left(l\left(\hat{\beta}_{j}\right)-l\left(\tilde{\beta}_{j}\right)\right)\right) \\ &\left.=\operatorname{sign}\left(\tilde{\beta}_{j}-\hat{\beta}_{j}\right) \sqrt{(}\left(D\left(\tilde{\beta}_{j}\right)-D\left(\hat{\beta}_{j}\right)\right) / \phi\right) \end{aligned}$
is asymptotically $N(0,1)$ and is analogous to the Wald statistic.

💡
$\pm\sqrt{\chi_{1}^{2} }\sim N(0,1)$

Profile Plots

If the log-likelihood were quadratic about $\hat{\beta}_{j}$ , then a plot of $\tau$ vs. $\tilde{\beta}_{j}$ would be a straight line.
We can obtain such a plot as follows

plot(profile(parr, "ldose"))

Profile Confidence Intervals

Rather than use the quadratic approximation, we can directly estimate the values of $\beta_{j}$ for which $\tau=\pm 1.96$ to obtain a $95 \%$ confidence interval for $\beta_{j}$ .
This is the method used by confint.glm:

confint(parr)

Notice the confidence intervals are asymmetric.

Prediction

The predict method for GLMs has a type argument, which may be specified as
- "link" for predictions of $\eta$
- "reponse" for predictions of $\mu$
If no new data is passed to predict, these options return object$linear.predictor and object$fitted.values respectively.

Residual Analysis

The deviance residuals can be used to check the model as with Normal models.
The standardized residuals for binomial data should have an approximate normal distribution, provided the numbers for each covariate pattern is not too small.

par(mfrow = c(2, 2))

plot(parr, 1:4)

Deviance for Binary Data

We have seen that the deviance may be viewed as a likelihood ratio statistic with approximate distribution $\chi_{n-p}^{2}$ .

However the $\chi^{2}$ distribution of the likelihood ratio statistic is based on the limit as $n \rightarrow \infty$ with the number of parameters in the nested models both fixed. This does not apply to the deviance.

The $\chi_{n-p}^{2}$ distribution is still reasonable where the information content of each observation is large e.g. Binomial models with large $n_{i}$ , Poisson models with large $\mu_{i}$ , Gamma models with small $\phi$ .
For binary data, the $\chi^{2}$ approximation does not apply.

In fact for the logistic regression model it can be shown that
$D=-2 \sum_{i=1}^{n}\left\{\hat{p}_{i} \log \left[\hat{p}_{i} /\left(1-\hat{p}_{i}\right)\right]+\log \left(1-\hat{p}_{i}\right)\right\}$
which depends only on $y_{i}$ through $\hat{p}_{i}$ therefore can tell us nothing about agreement between $y_{i}$ and $\hat{p}_{i}$ .

Instead we shall analyse the residuals and consider alternative models.

Residual Plots for Binary Data

For binary data, or binomial data where $n_i$ is small for most covariate patterns, there are few distinct values of the residuals and the plots may be uninformative.

Therefore we consider “large” residuals

Grouping the Data

A useful technique in evaluating models fit to binary data is to group the data and treat as binomial instead.

We select category boundaries to give roughly equal numbers in each category.

Then we compute the proportions from the original data.

Count Data

Often counts are based on events that may be assumed to arise from a Poisson process, where
- counts are observed over fixed time interval
- probability of the event approximately proportional to length of time for small intervals of time
- for small intervals of time probability of $>1$ event is negligible compared to probability of one event numbers of events
- in non-overlapping time intervals are independent
In such situations, the counts can be assumed to follow a Poisson distribution, say
$Y_{i} \sim \operatorname{Poisson}\left(\lambda_{i}\right)$

Rate Data

In many cases we are making comparisons across observation units $i=1, \ldots, n$ with different levels of exposure to the event and hence the measure of interest is the rate of occurrence, e.g.
- number of household burglaries per 10,000 households in city $i$ in a given year
- number of customers served per hour by salesperson $i$ in a given month
- number of train accidents per billion train-kilometers in year $i$
Since the counts are Poisson distributed, we would like to use a glm to model the expected rate, $\lambda_{i} / t_{i}$ , where $t_{i}$ is the exposure for unit $i$ .

Typically explanatory variables have a multiplicative effect rather than an additive effect on the expected rate, therefore a suitable model is
$\begin{array}{l}\log \left(\lambda_{i} / t_{i}\right)=\beta_{0}+\sum_{r=1}^{p} x_{i r} \beta_{r} \\\Rightarrow \log \left(\lambda_{i}\right)=\log \left(t_{i}\right)+\beta_{0}+\sum_{r=1}^{p} x_{i r} \beta_{r}\end{array}$
i.e. Poisson glm with the canonical log link.
This is known as a log-linear model.

Offsets

The standardizing term $\log \left(t_{i}\right)$ is an example of an offset: a term with a fixed coefficient of 1.
Offsets are easily specified to glm, either using the offset argument or using the offset function in the formula, e.g. offset(time).
If all the observations have the same exposure, the model does not need an offset term and we can model $\log \left(\lambda_{i}\right)$ directly.

Overdispersion

Lack of fit may be due to inadequate specification of the model, but another possibility when modelling discrete data is overdispersion.
Under the Poisson or Binomial model, we have a fixed mean-variance relationship:
$\operatorname{var}\left(Y_{i}\right)=V\left(\mu_{i}\right)$
Overdispersion occurs when
$\operatorname{var}\left(Y_{i}\right)>V\left(\mu_{i}\right)$
This may occur due to correlated responses or variability between observational units.
We can adjust for over-dispersion by estimating a dispersion parameter
$\operatorname{var}\left(Y_{i}\right)=\phi V\left(\mu_{i}\right)$
This changes the assumed distribution of our response, to a distribution for which we do not have the full likelihood.
However the score equations in the IWLS
$\frac{\partial l}{\partial \beta_{j}}=\sum_{i=1}^{n} \frac{a_{i}\left(y_{i}-\mu_{i}\right)}{V\left(\mu_{i}\right)} \times \frac{x_{i j}}{g^{\prime}\left(\mu_{i}\right)}=0$
only require the variance function, so we can still obtain estimates for the parameters. Note the score equations do not depend on $\phi$ , so we will obtain the same estimates as if $\phi=1$ .
This approach is known as quasi-likelihood estimation. Whilst estimating $\phi$ does not affect the parameter estimates, it will change inference based on the model.

The asymptotic theory for maximum likelihood also applies to quasi-likelihood, in particular $\boldsymbol{\beta}$ is approximately distributed as
$N\left(\boldsymbol{\beta}, \phi\left(X^{T} \hat{W} X\right)^{-1}\right)$
so compared to the case with $\phi=1$ , the standard errors of the parameters are multiplied by $\sqrt{(} \phi)$ .
Since $\phi$ is estimated, Wald tests based on the Normal assumption are $t$ rather than $Z$ tests.
The deviance based on the likelihood of the exponential family distribution with the same variance function may be used as a quasi-deviance. Since $\phi$ is estimated rather than fixed at 1, nested models are compared by referring
$\left \{ D_{big} - D_{small} \right \} / \left \{ \hat{\phi} \left( p_{big}-p_{small} \right) \right \}$
to the $F$ distribution with $p_{big}-p_{small}, n-p_{big}$ degrees of freedom.

The $\mathrm{AIC}$ is undefined for quasi-likelihood models.

— Jul 15, 2022

Related: #GLM

▲Top

View / Make Comments

§2 Generalized Linear Models by Lu Meng is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at About.

◀ §3 Classification

§1 Statistical Learning ▶

Made with ❤ at Earth.

🌑 ☀️

Hi Folks.

§2 Generalized Linear Models

Introduction to Generalized Linear Models

The General Linear Model

Error structure

Restrictions of Linear Models

Exponential Family

Generalized Linear Models (GLMs)

Canonical Links

Normal General Linear Model as a Special Case

Modelling Binomial Data

Modelling Poisson Data

Transformation vs. GLM

Estimation of the Model Parameters

Standard Errors

Wald Tests

Deviance

Residual Analysis

Binary Data

Exploring Binary Data

Models for Binary Data

Choice of Link

Scatterplot Scales

Nested models

Binomial Responses and glm

Goodness-of-fit

Interpretation of Logistic Models

Wald Confidence Intervals

Profiling the Deviance

Likelihood Ratio Test

Profile Plots

Profile Confidence Intervals

Prediction

Residual Analysis

Deviance for Binary Data

Residual Plots for Binary Data

Grouping the Data

Count Data

Rate Data

Offsets

Overdispersion

🌑

☀️

Binomial Responses and `glm`