admin 管理员组文章数量: 887021
模式识别
PRML Chapter 7 Sparse Kernel Machines
7.1 Maximum Margin Classifiers
The two-class classification problem using linear models of the form:
y ( x ) = w T ϕ ( x ) + b y(x) = w^{T}\phi(x) + b y(x)=wTϕ(x)+b
The maximum margin solution is found by solving:
arg max w , b { 1 ∣ ∣ w ∣ ∣ min n [ t n ( w T ϕ ( x n ) + b ) ] } \arg\max_{w, b} \left\{\frac{1}{||w||}\min_{n}[t_{n}(w^{T}\phi(x_{n}) + b)] \right \} argw,bmax{∣∣w∣∣1nmin[tn(wTϕ(xn)+b)]}
where we have the constraint: t n y ( x n ) ≥ 0 t_n y(x_n)\geq 0 tny(xn)≥0
The direct solution of this optimization problem would be very complex and we can transform the objective function to:
arg max w , b 1 2 ∣ ∣ w ∣ ∣ 2 \arg\max_{w, b} \frac{1}{2}||w||^2 argw,bmax21∣∣w∣∣2
where we have:
t ( w T ϕ ( x n ) + b ) ≥ 1 t(w^{T}\phi(x_{n}) + b) \geq 1 t(wTϕ(xn)+b)≥1
To solve the problem, we introduce Lagrange multipliers a n ≥ 0 a_n\geq 0 an≥0, and the Lagrangian function will be:
L ( w , b , a ) = 1 2 ∣ ∣ w ∣ ∣ 2 − ∑ n = 1 N a n { t n ( w T ϕ ( x n ) − 1 ) } L(w, b, a) = \frac{1}{2}||w||^{2} - \sum_{n=1}^{N}a_{n}\{ t_{n}(w^{T}\phi(x_{n}) - 1) \} L(w,b,a)=21∣∣w∣∣2−n=1∑Nan{tn(wTϕ(xn)−1)}
Setting the derivatives with respect to w w w and b b b equal to zero we will obtain:
w = ∑ n = 1 N a n t n ϕ ( x n ) w = \sum_{n=1}^{N}a_{n}t_{n}\phi(x_{n}) w=n=1∑Nantnϕ(xn)
0 = ∑ n = 1 N a n t n 0 = \sum_{n=1}^{N}a_{n}t_{n} 0=n=1∑Nantn
Eliminating w w w and b b b from L ( w , b , a ) L(w,b,a) L(w,b,a) using these conditions then gives the dual representation of the maximum margin problem in which we maximize:
L ~ ( a ) = ∑ n = 1 N a n − 1 2 ∑ n = 1 N ∑ n = 1 N a − n a m t n t m k ( x n , x m ) \tilde{L}(a) = \sum_{n=1}^{N}a_{n} - \frac{1}{2}\sum_{n=1}^{N}\sum_{n=1}^{N}a-{n}a_{m}t_{n}t_{m}k(x_{n}, x_{m}) L~(a)=n=1∑Nan−21n=1∑Nn=1∑Na−namtntmk(xn,xm)
with respect to the constraints:
a n ≥ 0 , n = 1 , … , N a_{n} \geq 0,~~~~ n = 1, \dots, N an≥0, n=1,…,N
∑ n = 1 N a n t n = 0 \sum_{n=1}^{N}a_{n}t_{n} = 0 n=1∑Nantn=0
Substituting for w w w, using parameters { a n } \{a_n\} {an} and the kernel function:
y ( x ) = ∑ n = 1 N a n t n k ( x , x n ) + b y(x)=\sum_{n=1}^N a_n t_n k(x,x_n)+b y(x)=n=1∑Nantnk(x,xn)+b
A constrained optimization of this form satisfies the KKT conditions:
a n ≥ 0 a_{n} \geq 0 an≥0
t n y ( x n ) − 1 ≥ 0 t_{n}y(x_{n}) - 1 \geq 0 tny(xn)−1≥0
a n { t n y ( x n ) − 1 } = 0 a_{n}\{ t_{n}y(x_{n}) - 1 \} = 0 an{tny(xn)−1}=0
thus for every data point, either a n = 0 a_n=0 an=0 or t n y ( x n ) = 1 t_n y(x_n)=1 tny(xn)=1. Solving for b b b to give:
b = 1 N S ∑ n ∈ S ( t n − ∑ m ∈ S a m t m k ( x n , x m ) ) b = \frac{1}{N_{S}}\sum_{n\in S}\left( t_{n} - \sum_{m\in S}a_{m}t_{m}k(x_{n}, x_{m}) \right) b=NS1n∈S∑(tn−m∈S∑amtmk(xn,xm))
7.1.1 Overlapping class distributions
When we are going to maximize the margin while softy penalizing points that lie on the wrong sie of the margin boundary, we therefore minimize:
C ∑ n = 1 N ξ n + 1 2 ∣ ∣ w ∣ ∣ 2 C\sum_{n=1}^{N}\xi_{n} + \frac{1}{2}||w||^{2} Cn=1∑Nξn+21∣∣w∣∣2
The constraints will be replaced with:
t n y ( x n ) ≥ 1 − ξ n , n = 1 , … , N t_{n}y(x_{n}) \geq 1-\xi_{n} , ~~~ n=1, \dots, N tny(xn)≥1−ξn, n=1,…,N
ξ n ≥ 0 \xi_{n} \geq 0 ξn≥0
The corresponding Lagrangian is given by:
L ( w , b , ξ , a , μ ) = 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ n = 1 N ξ n − ∑ n = 1 N a n { t n y ( x n ) − 1 + ξ n } − ∑ n = 1 N μ n ξ n L(w, b, \xi, a, \mu) = \frac{1}{2}||w||^{2} + C\sum_{n=1}^{N}\xi_{n}-\sum_{n=1}^{N}a_{n}\{t_{n}y(x_{n}) - 1 + \xi_{n}\} - \sum_{n=1}^{N}\mu_{n}\xi_{n} L(w,b,ξ,a,μ)=21∣∣w∣∣2+Cn=1∑Nξn−n=1∑Nan{tny(xn)−1+ξn}−n=1∑Nμnξn
The corresponding set of KKT conditions are given by:
a n ≥ 0 a_{n} \geq 0 an≥0
t n y ( x n ) − 1 + ξ n ≥ 0 t_{n}y(x_{n}) - 1 + \xi_{n} \geq 0 tny(xn)−1+ξn≥0
a n ( t n y ( x n ) − 1 + ξ n ) = 0 a_{n} (t_{n}y(x_{n}) - 1 + \xi_{n}) = 0 an(tny(xn)−1+ξn)=0
μ n ≥ 0 \mu_{n} \geq 0 μn≥0
ξ n ≥ 0 \xi_{n} \geq 0 ξn≥0
μ n ξ n = 0 \mu_{n}\xi_{n} = 0 μnξn=0
After we eliminate w w w, b b b and ξ n \xi_n ξn from the Lagrangian we obtain the dual Lagrangian in the form:
L ~ ( a ) = ∑ n = 1 N a n − 1 2 ∑ n = 1 N ∑ m = 1 N a n a m t n t m k ( x n , x m ) \tilde{L}(a) = \sum_{n=1}^{N}a_{n} - \frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}a_{n}a_{m}t_{n}t_{m}k(x_{n}, x_{m}) L~(a)=n=1∑Nan−21n=1∑Nm=1∑Nanamtntmk(xn,xm)
with respect to the subjection:
0 ≤ a n ≤ C 0 \leq a_{n} \leq C 0≤an≤C
∑ n = 1 N a n t n = 0 \sum_{n=1}^{N} a_{n}t_{n} = 0 n=1∑Nantn=0
Similarly, a numerically stable solution is obtained by:
b = 1 N M ∑ n ∈ M ( t n − ∑ m ∈ S a m t m k ( x n , x m ) ) b = \frac{1}{N_{M}}\sum_{n\in M}\left( t_{n} - \sum_{m\in S}a_{m}t_{m}k(x_{n}, x_{m}) \right) b=NM1n∈M∑(tn−m∈S∑amtmk(xn,xm))
7.1.2 Relation to logistic regression
The objective function can be written in the form:
∑ n = 1 N E S V ( y n t n ) + λ ∣ ∣ w ∣ ∣ 2 \sum_{n=1}^N E_{SV}(y_n t_n)+\lambda ||w||^2 n=1∑NESV(yntn)+λ∣∣w∣∣2
Also, we can construct an error function by taking the negative logarithm of the likelihood function that, with a quadratic regularizer, takes the form:
∑ n = 1 N E L R ( y n t n ) + λ ∣ ∣ w ∣ ∣ 2 \sum_{n=1}^N E_{LR}(y_n t_n)+\lambda ||w||^2 n=1∑NELR(yntn)+λ∣∣w∣∣2
7.1.3 Multiclass SVMs
7.1.4 SVMs for regression
To obtain sparse solutions, we therefore minimizing a regularized error function:
C ∑ n = 1 N E ϵ ( y ( x ) − t n ) + 1 2 ∣ ∣ w ∣ ∣ 2 C\sum_{n=1}^{N}E_{\epsilon}(y(x) - t_{n}) + \frac{1}{2}||w||^{2} Cn=1∑NEϵ(y(x)−tn)+21∣∣w∣∣2
Introducing slack variables, the error function for support vector regression can then be written as:
C ∑ n = 1 N ( ξ n + ξ n ^ ) + 1 2 ∣ ∣ w ∣ ∣ 2 C\sum_{n=1}^{N}(\xi_{n} + \hat{\xi_{n}}) + \frac{1}{2}||w||^{2} Cn=1∑N(ξn+ξn^)+21∣∣w∣∣2
with the corresponding conditions:
y n − ϵ ≤ t n ≤ y n + ϵ y_{n} - \epsilon \leq t_{n} \leq y_{n} + \epsilon yn−ϵ≤tn≤yn+ϵ
ξ n ≥ 0 , ξ n ^ ≥ 0 \xi_{n} \geq 0,\ \ \hat{\xi_{n}} \geq 0 ξn≥0, ξn^≥0
The dual problem involves maximizing:
L ~ ( a , a ^ ) = − 1 2 ∑ n = 1 N ∑ m = 1 N ( a n − a n ^ ) ( a m − a m ^ ) k ( x n , x m ) − ϵ ∑ n = 1 N ( a n + a n ^ ) + ∑ n = 1 N ( a n − a n ^ ) t n \tilde{L}(a, \hat{a}) = -\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}(a_{n} - \hat{a_{n}}) (a_{m} -\hat{a_{m}})k(x_{n}, x_{m}) - \epsilon\sum_{n=1}^{N}(a_{n} + \hat{a_{n}}) + \sum_{n=1}^{N}(a_{n} - \hat{a_{n}})t_{n} L~(a,a^)=−21n=1∑Nm=1∑N(an−an^)(am−am^)k(xn,xm)−ϵn=1∑N(an+an^)+n=1∑N(an−an^)tn
with respect to the condition:
0 ≤ a n ≤ C 0 \leq a_{n} \leq C 0≤an≤C
0 ≤ a ^ n ≤ C 0 \leq \hat{a}_{n} \leq C 0≤a^n≤C
Solve the dual problem and we can see the predictions for new inputs can be made using:
y ( x ) = ∑ n = 1 N ( a n − a ^ n ) k ( x , x n ) + b y(x) = \sum_{n=1}^{N}(a_{n} - \hat{a}_{n})k(x, x_{n}) + b y(x)=n=1∑N(an−a^n)k(x,xn)+b
which is again expressed in terms of the Kernel function. The corresponding KKT conditions:
a n ( ϵ + ξ n + y n − t n ) = 0 a_{n}(\epsilon + \xi_{n} + y_{n} - t_{n}) = 0 an(ϵ+ξn+yn−tn)=0
a ^ n ( ϵ + ξ ^ n − y n + t n ) = 0 \hat{a}_{n}(\epsilon + \hat{\xi}_{n} - y_{n} + t_{n}) = 0 a^n(ϵ+ξ^n−yn+tn)=0
( C − a n ) ξ n = 0 (C - a_{n})\xi_{n} = 0 (C−an)ξn=0
( C − a ^ n ) ξ ^ n = 0 (C - \hat{a}_{n})\hat{\xi}_{n} = 0 (C−a^n)ξ^n=0
Solving for b b b we obtain:
b = t n − ϵ − ∑ m = 1 N ( a m − a ^ m ) k ( x n , x m ) b = t_{n} - \epsilon - \sum_{m=1}^{N}(a_{m} - \hat{a}_{m})k(x_{n}, x_{m}) b=tn−ϵ−m=1∑N(am−a^m)k(xn,xm)
7.1.5 Computational learning theory
7.2 Relevance Vector Machines
7.2.1 RVM for regression
The relevance vector machine for regression is a linear model of the form discussed in Chapter 3 but with a modified prior that results in sparse solutions. The condition distribution is:
p ( t ∣ x , w , β ) = N ( t ∣ y ( x ) , β − 1 ) p(t| \mathbf{x}, \mathbf{w}, \beta) = N(t | y(x), \beta^{-1}) p(t∣x,w,β)=N(t∣y(x),β−1)
where β = σ − 2 \beta=\sigma^{-2} β=σ−2 is the noise precision and the mean is given by a linear model of the form:
y ( x ) = ∑ i = 1 M w i ϕ i ( x ) = w T ϕ ( x ) y(x) = \sum_{i=1}^{M}w_{i}\phi_{i}(x) = w^{T}\phi(x) y(x)=i=1∑Mwiϕi(x)=wTϕ(x)
this general expression takes the SVM-like form:
y ( x ) = ∑ n = 1 N w n k ( x , x n ) + b y(x) = \sum_{n=1}^{N}w_{n}k(x, x_{n}) + b y(x)=n=1∑Nwnk(x,xn)+b
Suppose we have N observations, the likelihood function is given by:
p ( t ∣ X , w , β ) = ∏ n = 1 N p ( t n ∣ x n , w , β ) p(\mathbf{t} | \mathbf{X}, w, \beta) = \prod_{n=1}^{N}p(t_{n} | x_{n}, w, \beta) p(t∣X,w,β)=n=1∏Np(tn∣xn,w,β)
The prior distribution over the parameter w w w:
p ( w ∣ α ) = ∏ i = 1 M N ( w i ∣ 0 , α i − 1 ) p(w| \alpha) = \prod_{i=1}^{M}N(w_{i} | 0, \alpha_{i}^{-1}) p(w∣α)=i=1∏MN(wi∣0,αi−1)
It is easy to infer that the posterior distribution for w w w takes the form:
p ( w ∣ t , X , α , β ) = N ( w ∣ m , Σ ) p(w|\mathbf{t},\mathbf{X},\alpha,\beta)=N(w|m,\Sigma) p(w∣t,X,α,β)=N(w∣m,Σ)
where m = β Σ Φ T t m=\beta\Sigma\mathbf{\Phi^T t} m=βΣΦTt and Σ = ( A + β Φ T Φ ) − 1 \Sigma=(\mathbf{A}+\beta\mathbf{\Phi^T\Phi})^{-1} Σ=(A+βΦTΦ)−1.
We should maximize the marginal likelihood function:
p ( t ∣ X , α , β ) = ∫ p ( t ∣ X , w , β ) p ( w ∣ α ) d w p(\mathbf{t}|\mathbf{X},\alpha,\beta)=\int p(\mathbf{t}|\mathbf{X},w,\beta)p(w|\alpha)dw p(t∣X,α,β)=∫p(t∣X,w,β)p(w∣α)dw
The log marginal likelihood will be the form:
p ( t ∣ X , α , β ) = ln N ( t ∣ 0 , C ) p(\mathbf{t} | \mathbf{X}, \alpha, \beta) = \ln N(\mathbf{t} | \mathbf{0, \mathbf{C}}) p(t∣X,α,β)=lnN(t∣0,C)
C = β − 1 I + Φ A − 1 Φ T C = \beta^{-1}\mathbf{I} + \mathbf{\Phi A^{-1}\Phi^{T}} C=β−1I+ΦA−1ΦT
The predictive distribution will be:
p ( t ∣ x , X , t , α ∗ , β ∗ ) = N ( t ∣ m T ϕ ( x ) , σ 2 ( x ) ) p(t| x, \mathbf{X}, \mathbf{t}, \alpha^{*}, \beta^{*}) = N(t | m^{T}\phi(x), \sigma^{2}(x)) p(t∣x,X,t,α∗,β∗)=N(t∣mTϕ(x),σ2(x))
7.2.2 Analysis of sparsity
7.2.3 RVM for classification
We can extend the relevance vector machine framework to classification problems by applying the ARD prior over weights to a probabilistic linear classification model of that in Chapter 4.
本文标签: 模式识别
版权声明:本文标题:模式识别 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.freenas.com.cn/jishu/1693600550h231659.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论