🌑

Probabilistic Reasoning and Learning

  1. Conditional probability tables (CPTs)
    1. Noisy-OR CPT
    2. Sigmoid CPT
  2. d-separation
    1. Blocked paths
  3. Polytrees
    1. Inference
  4. Exact inference in loopy BNs
    1. Node clustering
    2. Cutset conditioning
  5. Approximate inference in loopy BNs
    1. Rejection sampling
    2. Likelihood weighting
    3. Markov chain Monte Carlo
  6. Learning in BNs
    1. Computing the log-likelihood
    2. Maximum likelihood CPTs
  7. Linear regression
    1. Gaussian conditional distribution
  8. Numerical optimization
    1. Gradient ascent
    2. Newton’s method
  9. Logistic regression
  10. Learning from incomplete data
    1. Computing the log-likelihood with incomplete data
  11. Auxiliary functions
  12. EM Algorithm
    1. E-step
    2. M-step (Learning)
  13. Hidden Markov models
    1. Parameters of HMMs
    2. Forward algorithm
    3. Viterbi algorithm
    4. Backward algorithm
  14. EM algorithm for HMMs
    1. E-step
    2. M-step
  15. Multivariate Gaussian distributions
    1. Maximum likelihood (ML) estimation
  16. Clustering with GMMs
    1. ML estimates for complete data
    2. EM updates for incomplete data

Conditional probability tables (CPTs)

Noisy-OR CPT

P(Y=1X1,X2,,Xk)=1i=1k(1pi)XiP\left(Y=1 \mid X_{1}, X_{2}, \ldots, X_{k}\right)=1-\prod_{i=1}^{k}\left(1-p_{i}\right)^{X_{i}}

Sigmoid CPT

P(Y=1X1,X2,,Xk) = σ(i=1kθiXi)P(Y=1 \mid X_1,X_2,\ldots,X_k)~=~\sigma\left(\sum_{i=1}^k\theta_iX_i\right)

σ(z)=11+ez\sigma(z)=\frac1{1+e^{-z}}

d-separation

Blocked paths

  1. ZEZ \in E, edges align
  2. ZEZ \in E, edges diverge
  3. ZE,desc(Z)E=Z \notin E, \mathrm{desc}(Z)\cap{E}=\varnothing , edges converge

Polytrees

Inference

P(X=xE)P(X=xEX+)from above P(EXX=x)from below P(X=x \mid E) \propto \underbrace{P\left(X=x \mid E_{X}^{+}\right)}_{\text {from above }} \underbrace{P\left(E_{X}^{-} \mid X=x\right)}_{\text {from below }}

P(X=xEX+)=uP(X=xU=u)CPT i=1mP(Ui=uiEUi\X)recursive instance P\left(X=x \mid E_{X}^{+}\right)=\sum_{\vec{u}} \underbrace{P(X=x \mid \vec{U}=\vec{u})}_{\text {CPT }} \prod_{i=1}^{m} \underbrace{P\left(U_{i}=u_{i} \mid E_{U_{i} \backslash X}\right)}_{\text {recursive instance }}

P(EYj\XX=x)yjP(EYjYj=yj)recursive instance zjP(yjzj,x)CPT kP(zjkEZjk\Yj)recursive instance P\left(E_{Y_{j} \backslash X} \mid X=x\right) \propto \sum_{y_{j}} \underbrace{P\left(E_{Y_{j}}^{-} \mid Y_{j}=y_{j}\right)}_{\text {recursive instance }} \sum_{\vec{z}_{j}} \underbrace{P\left(y_{j} \mid \vec{z}_{j}, x\right)}_{\text {CPT }} \prod_{k} \underbrace{P\left(z_{j k} \mid E_{Z_{j k} \backslash Y_{j}}\right)}_{\text {recursive instance }}

Exact inference in loopy BNs

Node clustering

  • CPTs grow exponentially when nodes are clustered.

Cutset conditioning

  • The number of times that we must run the polytree algorithm grows exponentially with the size of the cutset.

Approximate inference in loopy BNs

Rejection sampling

P(Q=qE=e)N(q,e)N(e)=i=1NI(q,qi)I(e,ei)i=1NI(e,ei)P(Q=q \mid E=e) \approx \frac{N(q, e)}{N(e)}=\frac{\sum_{i=1}^{N} I\left(q, q_{i}\right) I\left(e, e_{i}\right)}{\sum_{i=1}^{N} I\left(e, e_{i}\right)}

Likelihood weighting

P(Q=qE=e)i=1NI(q,qi)P(E=epa(E)=pi)i=1NP(E=epa(E)=pi)P(Q=q \mid E=e) \approx \frac{\sum_{i=1}^{N} I\left(q, q_{i}\right) P\left(E=e \mid pa(E)=p_{i}\right)}{\sum_{i=1}^{N} P\left(E=e \mid pa(E)=p_{i}\right)}

Markov chain Monte Carlo

  • Initialization

    Fix evidence nodes to observed values e,ee, e'.

    Initialize non-evidence nodes to random values.

  • Repeat NN times

    Pick a non-evidence node XX at random.

    Use Bayes rule to compute P(XBX)P(X \mid B_X).

    Resample xP(XBX)x \sim P(X \mid B_X).

    Take a snapshot of all the nodes in the BN.

  • Estimate

    P(Q=qE=e) N(q)NP(Q=q \mid E=e)~\approx\frac{N(q)}N

Learning in BNs

Computing the log-likelihood

L=i=1nt=1TlogP(xi(t)pai(t))=i=1nxπcount(Xi=x,pai=π)logP(Xi=xpai=π)\begin{aligned} \mathcal{L} & =\sum_{i=1}^{n} \sum_{t=1}^{T} \log P\left(x_{i}^{(t)} \mid \mathrm{pa}_{i}^{(t)}\right) \\ & =\sum_{i=1}^{n} \sum_{x} \sum_{\pi} \operatorname{count}\left(X_{i}=x, \mathrm{pa}_{i}=\pi\right) \log P\left(X_{i}=x \mid \mathrm{pa}_{i}=\pi\right) \end{aligned}

Maximum likelihood CPTs

PML(Xi=x)=count(Xi=x)T=1TtI(xit,x)P_{\mathrm{ML}}(X_i=x)=\frac{\mathrm{count}(X_i=x)}T=\frac{1}{T} \sum_{t} I\left(x_{i t}, x\right)

PML(Xi=xpai=π)=count(Xi=x,pai=π)count(pai=π)=tI(xit,x)I(pait,π)tI(pait,π)P_{\mathrm{ML}}(X_i=x \mid \mathrm{pa}_i=\pi)=\frac{\mathrm{count}(X_i=x,\mathrm{pa}_i=\pi)}{\mathrm{count}(\mathrm{pa}_i=\pi)}=\frac{\sum_{t} I\left(x_{i t}, x\right) I\left(\mathrm{pa}_{i t}, \pi\right)}{\sum_{t} I\left(\mathrm{pa}_{i t}, \pi\right)}

Linear regression

Gaussian conditional distribution

P(yx)=12πσ2exp{(ywx)22σ2}P\left(y \mid \vec{x}\right)=\frac1{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{(y-\vec{w}\cdot\vec{x})^2}{2\sigma^2}\right\}

L(w,σ2)=12t=1T[log(2πσ2)+(ytwxt)2σ2]\mathcal{L}({\vec{w}},{\sigma^2})=-\frac12\sum_{t=1}^T\left[\log(2\pi{\sigma^2})+\frac{(y_t-\vec{w}\cdot\vec{x_t})^2}{\sigma^2}\right]

A=txtxtb=tytxtwML=A1b\begin{aligned} \mathbf{A} & =\sum_{t} \vec{x}_{t} \vec{x}_{t}^{\top} \\ \vec{b} & =\sum_{t} y_{t} \vec{x}_{t} \\ \vec{w}_{\mathrm{ML}}&=\mathbf{A}^{-1} \vec{b} \end{aligned}

Numerical optimization

Gradient ascent

θθ+ηt(fθ)\vec{\theta} \leftarrow \vec{\theta} + \eta_{t}\left(\frac{\partial f}{\partial \vec{\theta}}\right)

Newton’s method

θθH1f\vec{\theta} \leftarrow \vec{\theta}-\mathbf{H}^{-1} \nabla f

Logistic regression

P(Y=1x)=σ(wx)=11+ewxP(Y=1 \mid \vec{x})=\sigma(\vec{w} \cdot \vec{x})=\frac{1}{1+e^{-\vec{w} \cdot \vec{x}}}

σ(z)=11+ezσ(z)=1σ(z)ddzσ(z)=σ(z)σ(z)\begin{aligned}\sigma(z) & =\frac{1}{1+e^{-z}} \\\sigma(-z) & =1-\sigma(z) \\\frac{d}{d z} \sigma(z) & =\sigma(z) \sigma(-z)\end{aligned}

L(w)=t=1T[ytlogσ(wxt)+(1yt)logσ(wxt)]\mathcal{L}(\vec{w})=\sum_{t=1}^{T}\left[y_{t} \log \sigma\left(\vec{w} \cdot \vec{x}_{t}\right)+\left(1-y_{t}\right) \log \sigma\left(-\vec{w} \cdot \vec{x}_{t}\right)\right]

Lwα=t=1Txαt[ytσ(wxt)]\frac{\partial \mathcal{L}}{\partial w_{\alpha}}=\sum_{t=1}^{T} x_{\alpha t}\left[y_{t}-\sigma\left(\vec{w} \cdot \vec{x}_{t}\right)\right]

2Lwαwβ=t=1Tσ(wxt)σ(wxt)xαtxβt\frac{\partial^{2} \mathcal{L}}{\partial w_{\alpha} \partial w_{\beta}}=-\sum_{t=1}^{T} \sigma\left(\vec{w} \cdot \vec{x}_{t}\right) \sigma\left(-\vec{w} \cdot \vec{x}_{t}\right) x_{\alpha t} x_{\beta t}

Learning from incomplete data

Computing the log-likelihood with incomplete data

L=t=1Tloghi=1nP(Xi=xipai=πi){Ht=h,vt=vt}\mathcal{L}=\left.\sum_{t=1}^{T} \log \sum_{h} \prod_{i=1}^{n} P\left(X_{i}=x_{i} \mid \mathrm{pa}_{i}=\pi_{i}\right)\right|_{\left\{H_{t}=h, v_{t}=v_{t}\right\}}

Auxiliary functions

θnew = arg maxθ Q(θ,θold )\vec{\theta}_{\text {new }}=\underset{\vec{\theta}}{\operatorname{~arg~max}}\ Q\left(\vec{\theta}, \vec{\theta}_{\text {old }}\right)

  1. Q(θ,θ)=f(θ)Q(\vec{\theta}, \vec{\theta})=f(\vec{\theta}) for all θ\vec{\theta}
  2. Q(θ,θ)f(θ)Q(\vec{\theta^{\prime}}, \vec{\theta}) \leq f(\vec{\theta^{\prime}}) for all θ,θ\vec{\theta}, \vec{\theta^{\prime}}

EM Algorithm

E-step

P(Xi=xVt=vt)P\left(X_{i}=x \mid V_{t}=v_{t}\right)

P(Xi=x,pai=πVt=vt)P(X_i=x,pa_i=\pi \mid V_t=v_t)

M-step (Learning)

P(Xi=x)1Tt=1TP(Xi=xVt=vt)P(X_i=x) \longleftarrow \frac1T\sum_{t=1}^TP(X_i=x \mid V_t=v_t)

P(Xi=xpai=π)t=1P(Xi=x,pai=πVt=vt)t=1TP(pai=πVt=vt)P(X_i=x \mid \text{pa}_i=\pi)\longleftarrow\frac{\sum_{t=1}P(X_i=x,\text{pa}_i=\pi \mid V_t=v_t)}{\sum_{t=1}^TP(\text{pa}_i=\pi \mid V_t=v_t)}

Hidden Markov models

Parameters of HMMs

aij=P(St+1=jSt=i)a_{ij}=P(S_{t+1}=j \mid S_t=i)

bik=P(Ot=kSt=i)b_{ik}=P(O_t=k \mid S_t=i)

πi=P(S1=i)\pi_i=P(S_1={i})

Forward algorithm

αit=P(o1,o2,,ot,St=i){\alpha_{it}}={P(o_1,o_2,\ldots,o_t,S_t=i)}

αi1=πibi(o1)αj,t+1=i=1nαitaijbj(ot+1)\begin{array}{rcl}\alpha_{i1}&=&\pi_ib_i(o_1)\\\alpha_{j,t+1}&=&\sum_{i=1}^n\alpha_{it}a_{ij}b_j(o_{t+1})\end{array}

P(o1,o2,,oT)=i=1nαiTP(o_1,o_2,\ldots,o_T)=\sum_{i=1}^n\alpha_{iT}

Viterbi algorithm

it=maxs1,s2,...,st1logP(s1,s2,,st1,St=i,o1,o2,,ot){\ell_{it}^*}=\max_{s_1,s_2,...,s_{t-1}}\log P(s_1,s_2,\ldots,s_{t-1},{S_t}=i,o_1,o_2,\ldots,o_t)

i1=logπi+logbi(o1)j,t+1=maxi[it+logaij]+logbj(ot+1)\begin{array}{rcl}\ell_{i1}^*&=&\log\pi_i+\log b_i(o_1)\\{\ell_{j,t+1}^*}&=&\max_i\left[\ell_{it}^*+\log a_{ij}\right]+\log b_j(o_{t+1})\end{array}

Φt+1(j)=arg maxi[it+logaij]\Phi_{t+1}(j)=\mathrm{arg~max}_i\left[\ell_{it}^*+\log a_{ij}\right]

sT=argmaxi[iT]st=argmaxi[it+logaist+1]=Φt+1(st+1)\begin{array}{rcl}s_T^*&=&\mathrm{arg}\max_i\left[\ell_{iT}^*\right]\\s_t^*&=&\mathrm{arg}\max_i\left[\ell_{it}^*+\log a_{is_{t+1}^*}\right]=\Phi_{t+1}(s_{t+1}^*)\end{array}

Backward algorithm

βit=P(ot+1,ot+2,,oTSt=i)\beta_{i{t}}=P(o_{t+1},o_{t+2},\ldots,o_T \mid S_t=i)

βiT=1βit=j=1naijbj(ot+1)βj,t+1\begin{array}{rcl}\beta_{i{T}}&=&1\\\beta_{i{t}}&=&\sum_{j=1}^na_{ij}b_j(o_{t+1})\beta_{j,t+1}\end{array}

EM algorithm for HMMs

E-step

P(St=io1,,oT)=αitβitjαjtβjtP(S_t=i \mid o_1,\ldots,o_T)=\frac{\alpha_{it}\beta_{it}}{\sum_j\alpha_{jt}\beta_{jt}}

P(St=i,St+1=jo1,,oT)=αitaijbj(ot+1)βj,t+1kαktβktP(S_t=i,S_{t+1}=j \mid o_1,\ldots,o_T)=\frac{\alpha_{it}a_{ij}b_j(o_{t+1})\beta_{j,t+1}}{\sum_k\alpha_{kt}\beta_{kt}}

M-step

πiP(S1=io1,o2,,oT)\pi_i\leftarrow P(S_1=i \mid o_1,o_2,\ldots,o_T)

aijtP(St+1=j,St=io1,o2,,oT)tP(St=io1,o2,,oT)a_{ij}\leftarrow{\frac{\sum_tP(S_{t+1}=j,S_t=i \mid o_1,o_2,\ldots,o_T)}{\sum_tP(S_t=i \mid o_1,o_2,\ldots,o_T)}}

biktI(ot,k)P(St=io1,o2,,oT)tP(St=io1,o2,,oT)b_{ik}\leftarrow\frac{\sum_tI(o_t,k)P(S_t=i \mid o_1,o_2,\ldots,o_T)}{\sum_tP(S_t=i \mid o_1,o_2,\ldots,o_T)}

Multivariate Gaussian distributions

P(x) = 1(2π)ddet(Σ)exp{12(xμ)Σ1(xμ)}P(\mathbf{x})~=~\frac1{\sqrt{(2\pi)^d\det({\Sigma})}}\exp\left\{-\frac12(\mathbf{x}-\mu)^\top{\Sigma}^{-1}(\mathbf{x}-\mu)\right\}

Maximum likelihood (ML) estimation

μML=1Tt=1Txt\mu_\mathrm{ML}=\frac1T\sum_{t=1}^Tx_t

ΣML=1Tt=1T(xtμML)(xtμML)\mathbf{\Sigma}_{\mathrm{ML}}=\frac1T\sum_{t=1}^T(\mathbf{x}_t-\mathbf{\mu}_{\mathrm{ML}})(\mathbf{x}_t-\mathbf{\mu}_{\mathrm{ML}})^\top

Clustering with GMMs

ML estimates for complete data

μi=t=1TI(zt,i)xtt=1TI(zt,i)\boldsymbol{\mu}_i=\frac{\sum_{t=1}^T{I(z_t,i)}\boldsymbol{x}_t}{\sum_{t=1}^T{I(z_t,i)}}

Σi=t=1TI(zt,i)(xtμi)(xtμi)t=1TI(zt,i)\boldsymbol{\Sigma}_{i}=\frac{\sum_{t=1}^{T}{I(z_{t},i)(x_{t}-\mu_{i})(x_{t}-\mu_{i})^{\top}}}{\sum_{t=1}^{T}{I(z_{t},i)}}

EM updates for incomplete data

μit=1TP(Z=ixt)xtt=1TP(Z=ixt)\begin{aligned}\mu_i\leftarrow\frac{\sum_{t=1}^TP(Z=i \mid \mathbf{x}_t)\mathbf{x}_t}{\sum_{t=1}^TP(Z=i \mid \mathbf{x}_t)}\end{aligned}

Σit=1TP(Z=ixt)(xtμi)(xtμi)t=1TP(Z=ixt)\boldsymbol{\Sigma}_i\leftarrow\frac{\sum_{t=1}^T{P(Z=i \mid x_t)(x_t-\mu_i)(x_t-\mu_i)^\top}}{\sum_{t=1}^T{P(Z=i \mid x_t)}}

— Dec 14, 2023

Creative Commons License
Probabilistic Reasoning and Learning by Lu Meng is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at About.