admin 管理员组

文章数量: 887021

模式识别

PRML Chapter 7 Sparse Kernel Machines

7.1 Maximum Margin Classifiers

The two-class classification problem using linear models of the form:

y ( x ) = w T ϕ ( x ) + b y(x) = w^{T}\phi(x) + b y(x)=wTϕ(x)+b

The maximum margin solution is found by solving:

arg ⁡ max ⁡ w , b { 1 ∣ ∣ w ∣ ∣ min ⁡ n [ t n ( w T ϕ ( x n ) + b ) ] } \arg\max_{w, b} \left\{\frac{1}{||w||}\min_{n}[t_{n}(w^{T}\phi(x_{n}) + b)] \right \} argw,bmax​{∣∣w∣∣1​nmin​[tn​(wTϕ(xn​)+b)]}

where we have the constraint: t n y ( x n ) ≥ 0 t_n y(x_n)\geq 0 tn​y(xn​)≥0

The direct solution of this optimization problem would be very complex and we can transform the objective function to:

arg ⁡ max ⁡ w , b 1 2 ∣ ∣ w ∣ ∣ 2 \arg\max_{w, b} \frac{1}{2}||w||^2 argw,bmax​21​∣∣w∣∣2

where we have:

t ( w T ϕ ( x n ) + b ) ≥ 1 t(w^{T}\phi(x_{n}) + b) \geq 1 t(wTϕ(xn​)+b)≥1

To solve the problem, we introduce Lagrange multipliers a n ≥ 0 a_n\geq 0 an​≥0, and the Lagrangian function will be:

L ( w , b , a ) = 1 2 ∣ ∣ w ∣ ∣ 2 − ∑ n = 1 N a n { t n ( w T ϕ ( x n ) − 1 ) } L(w, b, a) = \frac{1}{2}||w||^{2} - \sum_{n=1}^{N}a_{n}\{ t_{n}(w^{T}\phi(x_{n}) - 1) \} L(w,b,a)=21​∣∣w∣∣2−n=1∑N​an​{tn​(wTϕ(xn​)−1)}

Setting the derivatives with respect to w w w and b b b equal to zero we will obtain:

w = ∑ n = 1 N a n t n ϕ ( x n ) w = \sum_{n=1}^{N}a_{n}t_{n}\phi(x_{n}) w=n=1∑N​an​tn​ϕ(xn​)

0 = ∑ n = 1 N a n t n 0 = \sum_{n=1}^{N}a_{n}t_{n} 0=n=1∑N​an​tn​

Eliminating w w w and b b b from L ( w , b , a ) L(w,b,a) L(w,b,a) using these conditions then gives the dual representation of the maximum margin problem in which we maximize:

L ~ ( a ) = ∑ n = 1 N a n − 1 2 ∑ n = 1 N ∑ n = 1 N a − n a m t n t m k ( x n , x m ) \tilde{L}(a) = \sum_{n=1}^{N}a_{n} - \frac{1}{2}\sum_{n=1}^{N}\sum_{n=1}^{N}a-{n}a_{m}t_{n}t_{m}k(x_{n}, x_{m}) L~(a)=n=1∑N​an​−21​n=1∑N​n=1∑N​a−nam​tn​tm​k(xn​,xm​)

with respect to the constraints:

a n ≥ 0 , n = 1 , … , N a_{n} \geq 0,~~~~ n = 1, \dots, N an​≥0,    n=1,…,N

∑ n = 1 N a n t n = 0 \sum_{n=1}^{N}a_{n}t_{n} = 0 n=1∑N​an​tn​=0

Substituting for w w w, using parameters { a n } \{a_n\} {an​} and the kernel function:

y ( x ) = ∑ n = 1 N a n t n k ( x , x n ) + b y(x)=\sum_{n=1}^N a_n t_n k(x,x_n)+b y(x)=n=1∑N​an​tn​k(x,xn​)+b

A constrained optimization of this form satisfies the KKT conditions:

a n ≥ 0 a_{n} \geq 0 an​≥0

t n y ( x n ) − 1 ≥ 0 t_{n}y(x_{n}) - 1 \geq 0 tn​y(xn​)−1≥0

a n { t n y ( x n ) − 1 } = 0 a_{n}\{ t_{n}y(x_{n}) - 1 \} = 0 an​{tn​y(xn​)−1}=0

thus for every data point, either a n = 0 a_n=0 an​=0 or t n y ( x n ) = 1 t_n y(x_n)=1 tn​y(xn​)=1. Solving for b b b to give:

b = 1 N S ∑ n ∈ S ( t n − ∑ m ∈ S a m t m k ( x n , x m ) ) b = \frac{1}{N_{S}}\sum_{n\in S}\left( t_{n} - \sum_{m\in S}a_{m}t_{m}k(x_{n}, x_{m}) \right) b=NS​1​n∈S∑​(tn​−m∈S∑​am​tm​k(xn​,xm​))

7.1.1 Overlapping class distributions

When we are going to maximize the margin while softy penalizing points that lie on the wrong sie of the margin boundary, we therefore minimize:

C ∑ n = 1 N ξ n + 1 2 ∣ ∣ w ∣ ∣ 2 C\sum_{n=1}^{N}\xi_{n} + \frac{1}{2}||w||^{2} Cn=1∑N​ξn​+21​∣∣w∣∣2

The constraints will be replaced with:

t n y ( x n ) ≥ 1 − ξ n , n = 1 , … , N t_{n}y(x_{n}) \geq 1-\xi_{n} , ~~~ n=1, \dots, N tn​y(xn​)≥1−ξn​,   n=1,…,N

ξ n ≥ 0 \xi_{n} \geq 0 ξn​≥0

The corresponding Lagrangian is given by:

L ( w , b , ξ , a , μ ) = 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ n = 1 N ξ n − ∑ n = 1 N a n { t n y ( x n ) − 1 + ξ n } − ∑ n = 1 N μ n ξ n L(w, b, \xi, a, \mu) = \frac{1}{2}||w||^{2} + C\sum_{n=1}^{N}\xi_{n}-\sum_{n=1}^{N}a_{n}\{t_{n}y(x_{n}) - 1 + \xi_{n}\} - \sum_{n=1}^{N}\mu_{n}\xi_{n} L(w,b,ξ,a,μ)=21​∣∣w∣∣2+Cn=1∑N​ξn​−n=1∑N​an​{tn​y(xn​)−1+ξn​}−n=1∑N​μn​ξn​

The corresponding set of KKT conditions are given by:

a n ≥ 0 a_{n} \geq 0 an​≥0

t n y ( x n ) − 1 + ξ n ≥ 0 t_{n}y(x_{n}) - 1 + \xi_{n} \geq 0 tn​y(xn​)−1+ξn​≥0

a n ( t n y ( x n ) − 1 + ξ n ) = 0 a_{n} (t_{n}y(x_{n}) - 1 + \xi_{n}) = 0 an​(tn​y(xn​)−1+ξn​)=0

μ n ≥ 0 \mu_{n} \geq 0 μn​≥0

ξ n ≥ 0 \xi_{n} \geq 0 ξn​≥0

μ n ξ n = 0 \mu_{n}\xi_{n} = 0 μn​ξn​=0

After we eliminate w w w, b b b and ξ n \xi_n ξn​ from the Lagrangian we obtain the dual Lagrangian in the form:

L ~ ( a ) = ∑ n = 1 N a n − 1 2 ∑ n = 1 N ∑ m = 1 N a n a m t n t m k ( x n , x m ) \tilde{L}(a) = \sum_{n=1}^{N}a_{n} - \frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}a_{n}a_{m}t_{n}t_{m}k(x_{n}, x_{m}) L~(a)=n=1∑N​an​−21​n=1∑N​m=1∑N​an​am​tn​tm​k(xn​,xm​)

with respect to the subjection:

0 ≤ a n ≤ C 0 \leq a_{n} \leq C 0≤an​≤C

∑ n = 1 N a n t n = 0 \sum_{n=1}^{N} a_{n}t_{n} = 0 n=1∑N​an​tn​=0

Similarly, a numerically stable solution is obtained by:

b = 1 N M ∑ n ∈ M ( t n − ∑ m ∈ S a m t m k ( x n , x m ) ) b = \frac{1}{N_{M}}\sum_{n\in M}\left( t_{n} - \sum_{m\in S}a_{m}t_{m}k(x_{n}, x_{m}) \right) b=NM​1​n∈M∑​(tn​−m∈S∑​am​tm​k(xn​,xm​))

7.1.2 Relation to logistic regression

The objective function can be written in the form:

∑ n = 1 N E S V ( y n t n ) + λ ∣ ∣ w ∣ ∣ 2 \sum_{n=1}^N E_{SV}(y_n t_n)+\lambda ||w||^2 n=1∑N​ESV​(yn​tn​)+λ∣∣w∣∣2

Also, we can construct an error function by taking the negative logarithm of the likelihood function that, with a quadratic regularizer, takes the form:

∑ n = 1 N E L R ( y n t n ) + λ ∣ ∣ w ∣ ∣ 2 \sum_{n=1}^N E_{LR}(y_n t_n)+\lambda ||w||^2 n=1∑N​ELR​(yn​tn​)+λ∣∣w∣∣2

7.1.3 Multiclass SVMs

7.1.4 SVMs for regression

To obtain sparse solutions, we therefore minimizing a regularized error function:

C ∑ n = 1 N E ϵ ( y ( x ) − t n ) + 1 2 ∣ ∣ w ∣ ∣ 2 C\sum_{n=1}^{N}E_{\epsilon}(y(x) - t_{n}) + \frac{1}{2}||w||^{2} Cn=1∑N​Eϵ​(y(x)−tn​)+21​∣∣w∣∣2

Introducing slack variables, the error function for support vector regression can then be written as:

C ∑ n = 1 N ( ξ n + ξ n ^ ) + 1 2 ∣ ∣ w ∣ ∣ 2 C\sum_{n=1}^{N}(\xi_{n} + \hat{\xi_{n}}) + \frac{1}{2}||w||^{2} Cn=1∑N​(ξn​+ξn​^​)+21​∣∣w∣∣2

with the corresponding conditions:

y n − ϵ ≤ t n ≤ y n + ϵ y_{n} - \epsilon \leq t_{n} \leq y_{n} + \epsilon yn​−ϵ≤tn​≤yn​+ϵ

ξ n ≥ 0 , ξ n ^ ≥ 0 \xi_{n} \geq 0,\ \ \hat{\xi_{n}} \geq 0 ξn​≥0,  ξn​^​≥0

The dual problem involves maximizing:

L ~ ( a , a ^ ) = − 1 2 ∑ n = 1 N ∑ m = 1 N ( a n − a n ^ ) ( a m − a m ^ ) k ( x n , x m ) − ϵ ∑ n = 1 N ( a n + a n ^ ) + ∑ n = 1 N ( a n − a n ^ ) t n \tilde{L}(a, \hat{a}) = -\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}(a_{n} - \hat{a_{n}}) (a_{m} -\hat{a_{m}})k(x_{n}, x_{m}) - \epsilon\sum_{n=1}^{N}(a_{n} + \hat{a_{n}}) + \sum_{n=1}^{N}(a_{n} - \hat{a_{n}})t_{n} L~(a,a^)=−21​n=1∑N​m=1∑N​(an​−an​^​)(am​−am​^​)k(xn​,xm​)−ϵn=1∑N​(an​+an​^​)+n=1∑N​(an​−an​^​)tn​

with respect to the condition:

0 ≤ a n ≤ C 0 \leq a_{n} \leq C 0≤an​≤C

0 ≤ a ^ n ≤ C 0 \leq \hat{a}_{n} \leq C 0≤a^n​≤C

Solve the dual problem and we can see the predictions for new inputs can be made using:

y ( x ) = ∑ n = 1 N ( a n − a ^ n ) k ( x , x n ) + b y(x) = \sum_{n=1}^{N}(a_{n} - \hat{a}_{n})k(x, x_{n}) + b y(x)=n=1∑N​(an​−a^n​)k(x,xn​)+b

which is again expressed in terms of the Kernel function. The corresponding KKT conditions:

a n ( ϵ + ξ n + y n − t n ) = 0 a_{n}(\epsilon + \xi_{n} + y_{n} - t_{n}) = 0 an​(ϵ+ξn​+yn​−tn​)=0

a ^ n ( ϵ + ξ ^ n − y n + t n ) = 0 \hat{a}_{n}(\epsilon + \hat{\xi}_{n} - y_{n} + t_{n}) = 0 a^n​(ϵ+ξ^​n​−yn​+tn​)=0

( C − a n ) ξ n = 0 (C - a_{n})\xi_{n} = 0 (C−an​)ξn​=0

( C − a ^ n ) ξ ^ n = 0 (C - \hat{a}_{n})\hat{\xi}_{n} = 0 (C−a^n​)ξ^​n​=0

Solving for b b b we obtain:

b = t n − ϵ − ∑ m = 1 N ( a m − a ^ m ) k ( x n , x m ) b = t_{n} - \epsilon - \sum_{m=1}^{N}(a_{m} - \hat{a}_{m})k(x_{n}, x_{m}) b=tn​−ϵ−m=1∑N​(am​−a^m​)k(xn​,xm​)

7.1.5 Computational learning theory

7.2 Relevance Vector Machines

7.2.1 RVM for regression

The relevance vector machine for regression is a linear model of the form discussed in Chapter 3 but with a modified prior that results in sparse solutions. The condition distribution is:

p ( t ∣ x , w , β ) = N ( t ∣ y ( x ) , β − 1 ) p(t| \mathbf{x}, \mathbf{w}, \beta) = N(t | y(x), \beta^{-1}) p(t∣x,w,β)=N(t∣y(x),β−1)

where β = σ − 2 \beta=\sigma^{-2} β=σ−2 is the noise precision and the mean is given by a linear model of the form:

y ( x ) = ∑ i = 1 M w i ϕ i ( x ) = w T ϕ ( x ) y(x) = \sum_{i=1}^{M}w_{i}\phi_{i}(x) = w^{T}\phi(x) y(x)=i=1∑M​wi​ϕi​(x)=wTϕ(x)

this general expression takes the SVM-like form:

y ( x ) = ∑ n = 1 N w n k ( x , x n ) + b y(x) = \sum_{n=1}^{N}w_{n}k(x, x_{n}) + b y(x)=n=1∑N​wn​k(x,xn​)+b

Suppose we have N observations, the likelihood function is given by:

p ( t ∣ X , w , β ) = ∏ n = 1 N p ( t n ∣ x n , w , β ) p(\mathbf{t} | \mathbf{X}, w, \beta) = \prod_{n=1}^{N}p(t_{n} | x_{n}, w, \beta) p(t∣X,w,β)=n=1∏N​p(tn​∣xn​,w,β)

The prior distribution over the parameter w w w:

p ( w ∣ α ) = ∏ i = 1 M N ( w i ∣ 0 , α i − 1 ) p(w| \alpha) = \prod_{i=1}^{M}N(w_{i} | 0, \alpha_{i}^{-1}) p(w∣α)=i=1∏M​N(wi​∣0,αi−1​)

It is easy to infer that the posterior distribution for w w w takes the form:

p ( w ∣ t , X , α , β ) = N ( w ∣ m , Σ ) p(w|\mathbf{t},\mathbf{X},\alpha,\beta)=N(w|m,\Sigma) p(w∣t,X,α,β)=N(w∣m,Σ)

where m = β Σ Φ T t m=\beta\Sigma\mathbf{\Phi^T t} m=βΣΦTt and Σ = ( A + β Φ T Φ ) − 1 \Sigma=(\mathbf{A}+\beta\mathbf{\Phi^T\Phi})^{-1} Σ=(A+βΦTΦ)−1.

We should maximize the marginal likelihood function:

p ( t ∣ X , α , β ) = ∫ p ( t ∣ X , w , β ) p ( w ∣ α ) d w p(\mathbf{t}|\mathbf{X},\alpha,\beta)=\int p(\mathbf{t}|\mathbf{X},w,\beta)p(w|\alpha)dw p(t∣X,α,β)=∫p(t∣X,w,β)p(w∣α)dw

The log marginal likelihood will be the form:

p ( t ∣ X , α , β ) = ln ⁡ N ( t ∣ 0 , C ) p(\mathbf{t} | \mathbf{X}, \alpha, \beta) = \ln N(\mathbf{t} | \mathbf{0, \mathbf{C}}) p(t∣X,α,β)=lnN(t∣0,C)

C = β − 1 I + Φ A − 1 Φ T C = \beta^{-1}\mathbf{I} + \mathbf{\Phi A^{-1}\Phi^{T}} C=β−1I+ΦA−1ΦT

The predictive distribution will be:

p ( t ∣ x , X , t , α ∗ , β ∗ ) = N ( t ∣ m T ϕ ( x ) , σ 2 ( x ) ) p(t| x, \mathbf{X}, \mathbf{t}, \alpha^{*}, \beta^{*}) = N(t | m^{T}\phi(x), \sigma^{2}(x)) p(t∣x,X,t,α∗,β∗)=N(t∣mTϕ(x),σ2(x))

7.2.2 Analysis of sparsity

7.2.3 RVM for classification

We can extend the relevance vector machine framework to classification problems by applying the ARD prior over weights to a probabilistic linear classification model of that in Chapter 4.

本文标签: 模式识别