Probabilistic Reasoning and Learning
Conditional probability tables (CPTs)
Noisy-OR CPT
P(Y=1∣X1,X2,…,Xk)=1−i=1∏k(1−pi)Xi
Sigmoid CPT
P(Y=1∣X1,X2,…,Xk) = σ(i=1∑kθiXi)
σ(z)=1+e−z1
d-separation
Blocked paths
- Z∈E, edges align
- Z∈E, edges diverge
- Z∈/E,desc(Z)∩E=∅ , edges converge
Polytrees
Inference
P(X=x∣E)∝from above P(X=x∣EX+)from below P(EX−∣X=x)
P(X=x∣EX+)=u∑CPT P(X=x∣U=u)i=1∏mrecursive instance P(Ui=ui∣EUi\X)
P(EYj\X∣X=x)∝yj∑recursive instance P(EYj−∣Yj=yj)zj∑CPT P(yj∣zj,x)k∏recursive instance P(zjk∣EZjk\Yj)
Exact inference in loopy BNs
Node clustering
- CPTs grow exponentially when nodes are clustered.
Cutset conditioning
- The number of times that we must run the polytree algorithm grows exponentially with the size of the cutset.
Approximate inference in loopy BNs
Rejection sampling
P(Q=q∣E=e)≈N(e)N(q,e)=∑i=1NI(e,ei)∑i=1NI(q,qi)I(e,ei)
Likelihood weighting
P(Q=q∣E=e)≈∑i=1NP(E=e∣pa(E)=pi)∑i=1NI(q,qi)P(E=e∣pa(E)=pi)
Markov chain Monte Carlo
-
Initialization
Fix evidence nodes to observed values e,e′.
Initialize non-evidence nodes to random values.
-
Repeat N times
Pick a non-evidence node X at random.
Use Bayes rule to compute P(X∣BX).
Resample x∼P(X∣BX).
Take a snapshot of all the nodes in the BN.
-
Estimate
P(Q=q∣E=e) ≈NN(q)
Learning in BNs
Computing the log-likelihood
L=i=1∑nt=1∑TlogP(xi(t)∣pai(t))=i=1∑nx∑π∑count(Xi=x,pai=π)logP(Xi=x∣pai=π)
Maximum likelihood CPTs
PML(Xi=x)=Tcount(Xi=x)=T1t∑I(xit,x)
PML(Xi=x∣pai=π)=count(pai=π)count(Xi=x,pai=π)=∑tI(pait,π)∑tI(xit,x)I(pait,π)
Linear regression
Gaussian conditional distribution
P(y∣x)=2πσ21exp{−2σ2(y−w⋅x)2}
L(w,σ2)=−21t=1∑T[log(2πσ2)+σ2(yt−w⋅xt)2]
AbwML=t∑xtxt⊤=t∑ytxt=A−1b
Numerical optimization
Gradient ascent
θ←θ+ηt(∂θ∂f)
Newton’s method
θ←θ−H−1∇f
Logistic regression
P(Y=1∣x)=σ(w⋅x)=1+e−w⋅x1
σ(z)σ(−z)dzdσ(z)=1+e−z1=1−σ(z)=σ(z)σ(−z)
L(w)=t=1∑T[ytlogσ(w⋅xt)+(1−yt)logσ(−w⋅xt)]
∂wα∂L=t=1∑Txαt[yt−σ(w⋅xt)]
∂wα∂wβ∂2L=−t=1∑Tσ(w⋅xt)σ(−w⋅xt)xαtxβt
Learning from incomplete data
Computing the log-likelihood with incomplete data
L=t=1∑Tlogh∑i=1∏nP(Xi=xi∣pai=πi)∣∣∣∣∣∣{Ht=h,vt=vt}
Auxiliary functions
θnew =θ arg max Q(θ,θold )
- Q(θ,θ)=f(θ) for all θ
- Q(θ′,θ)≤f(θ′) for all θ,θ′
EM Algorithm
E-step
P(Xi=x∣Vt=vt)
P(Xi=x,pai=π∣Vt=vt)
M-step (Learning)
P(Xi=x)⟵T1t=1∑TP(Xi=x∣Vt=vt)
P(Xi=x∣pai=π)⟵∑t=1TP(pai=π∣Vt=vt)∑t=1P(Xi=x,pai=π∣Vt=vt)
Hidden Markov models
Parameters of HMMs
aij=P(St+1=j∣St=i)
bik=P(Ot=k∣St=i)
πi=P(S1=i)
Forward algorithm
αit=P(o1,o2,…,ot,St=i)
αi1αj,t+1==πibi(o1)∑i=1nαitaijbj(ot+1)
P(o1,o2,…,oT)=i=1∑nαiT
Viterbi algorithm
ℓit∗=s1,s2,...,st−1maxlogP(s1,s2,…,st−1,St=i,o1,o2,…,ot)
ℓi1∗ℓj,t+1∗==logπi+logbi(o1)maxi[ℓit∗+logaij]+logbj(ot+1)
Φt+1(j)=arg maxi[ℓit∗+logaij]
sT∗st∗==argmaxi[ℓiT∗]argmaxi[ℓit∗+logaist+1∗]=Φt+1(st+1∗)
Backward algorithm
βit=P(ot+1,ot+2,…,oT∣St=i)
βiTβit==1∑j=1naijbj(ot+1)βj,t+1
EM algorithm for HMMs
E-step
P(St=i∣o1,…,oT)=∑jαjtβjtαitβit
P(St=i,St+1=j∣o1,…,oT)=∑kαktβktαitaijbj(ot+1)βj,t+1
M-step
πi←P(S1=i∣o1,o2,…,oT)
aij←∑tP(St=i∣o1,o2,…,oT)∑tP(St+1=j,St=i∣o1,o2,…,oT)
bik←∑tP(St=i∣o1,o2,…,oT)∑tI(ot,k)P(St=i∣o1,o2,…,oT)
Multivariate Gaussian distributions
P(x) = (2π)ddet(Σ)1exp{−21(x−μ)⊤Σ−1(x−μ)}
Maximum likelihood (ML) estimation
μML=T1t=1∑Txt
ΣML=T1t=1∑T(xt−μML)(xt−μML)⊤
Clustering with GMMs
ML estimates for complete data
μi=∑t=1TI(zt,i)∑t=1TI(zt,i)xt
Σi=∑t=1TI(zt,i)∑t=1TI(zt,i)(xt−μi)(xt−μi)⊤
EM updates for incomplete data
μi←∑t=1TP(Z=i∣xt)∑t=1TP(Z=i∣xt)xt
Σi←∑t=1TP(Z=i∣xt)∑t=1TP(Z=i∣xt)(xt−μi)(xt−μi)⊤
— Dec 14, 2023