admin 管理员组

文章数量: 887021

模式识别

PRML Chapter 7 Sparse Kernel Machines

7.1 Maximum Margin Classifiers

The two-class classification problem using linear models of the form:

y ( x ) = w T ϕ ( x ) + b y(x) = w^{T}\phi(x) + b y(x)=wTϕ(x)+b

The maximum margin solution is found by solving:

arg ⁡ max ⁡ w , b { 1 ∣ ∣ w ∣ ∣ min ⁡ n [ t n ( w T ϕ ( x n ) + b ) ] } \arg\max_{w, b} \left\{\frac{1}{||w||}\min_{n}[t_{n}(w^{T}\phi(x_{n}) + b)] \right \} argw,bmax{∣∣w∣∣1nmin[tn(wTϕ(xn)+b)]}

where we have the constraint: t n y ( x n ) ≥ 0 t_n y(x_n)\geq 0 tny(xn)≥0

The direct solution of this optimization problem would be very complex and we can transform the objective function to:

arg ⁡ max ⁡ w , b 1 2 ∣ ∣ w ∣ ∣ 2 \arg\max_{w, b} \frac{1}{2}||w||^2 argw,bmax21∣∣w∣∣2

where we have:

t ( w T ϕ ( x n ) + b ) ≥ 1 t(w^{T}\phi(x_{n}) + b) \geq 1 t(wTϕ(xn)+b)≥1

To solve the problem, we introduce Lagrange multipliers a n ≥ 0 a_n\geq 0 an≥0, and the Lagrangian function will be:

L ( w , b , a ) = 1 2 ∣ ∣ w ∣ ∣ 2 − ∑ n = 1 N a n { t n ( w T ϕ ( x n ) − 1 ) } L(w, b, a) = \frac{1}{2}||w||^{2} - \sum_{n=1}^{N}a_{n}\{ t_{n}(w^{T}\phi(x_{n}) - 1) \} L(w,b,a)=21∣∣w∣∣2−n=1∑Nan{tn(wTϕ(xn)−1)}

Setting the derivatives with respect to w w w and b b b equal to zero we will obtain:

w = ∑ n = 1 N a n t n ϕ ( x n ) w = \sum_{n=1}^{N}a_{n}t_{n}\phi(x_{n}) w=n=1∑Nantnϕ(xn)

0 = ∑ n = 1 N a n t n 0 = \sum_{n=1}^{N}a_{n}t_{n} 0=n=1∑Nantn

Eliminating w w w and b b b from L ( w , b , a ) L(w,b,a) L(w,b,a) using these conditions then gives the dual representation of the maximum margin problem in which we maximize:

L ~ ( a ) = ∑ n = 1 N a n − 1 2 ∑ n = 1 N ∑ n = 1 N a − n a m t n t m k ( x n , x m ) \tilde{L}(a) = \sum_{n=1}^{N}a_{n} - \frac{1}{2}\sum_{n=1}^{N}\sum_{n=1}^{N}a-{n}a_{m}t_{n}t_{m}k(x_{n}, x_{m}) L~(a)=n=1∑Nan−21n=1∑Nn=1∑Na−namtntmk(xn,xm)

with respect to the constraints:

a n ≥ 0 , n = 1 , … , N a_{n} \geq 0,~~~~ n = 1, \dots, N an≥0, n=1,…,N

∑ n = 1 N a n t n = 0 \sum_{n=1}^{N}a_{n}t_{n} = 0 n=1∑Nantn=0

Substituting for w w w, using parameters { a n } \{a_n\} {an} and the kernel function:

y ( x ) = ∑ n = 1 N a n t n k ( x , x n ) + b y(x)=\sum_{n=1}^N a_n t_n k(x,x_n)+b y(x)=n=1∑Nantnk(x,xn)+b

A constrained optimization of this form satisfies the KKT conditions:

a n ≥ 0 a_{n} \geq 0 an≥0

t n y ( x n ) − 1 ≥ 0 t_{n}y(x_{n}) - 1 \geq 0 tny(xn)−1≥0

a n { t n y ( x n ) − 1 } = 0 a_{n}\{ t_{n}y(x_{n}) - 1 \} = 0 an{tny(xn)−1}=0

thus for every data point, either a n = 0 a_n=0 an=0 or t n y ( x n ) = 1 t_n y(x_n)=1 tny(xn)=1. Solving for b b b to give:

b = 1 N S ∑ n ∈ S ( t n − ∑ m ∈ S a m t m k ( x n , x m ) ) b = \frac{1}{N_{S}}\sum_{n\in S}\left( t_{n} - \sum_{m\in S}a_{m}t_{m}k(x_{n}, x_{m}) \right) b=NS1n∈S∑(tn−m∈S∑amtmk(xn,xm))

7.1.1 Overlapping class distributions

When we are going to maximize the margin while softy penalizing points that lie on the wrong sie of the margin boundary, we therefore minimize:

C ∑ n = 1 N ξ n + 1 2 ∣ ∣ w ∣ ∣ 2 C\sum_{n=1}^{N}\xi_{n} + \frac{1}{2}||w||^{2} Cn=1∑Nξn+21∣∣w∣∣2

The constraints will be replaced with:

t n y ( x n ) ≥ 1 − ξ n , n = 1 , … , N t_{n}y(x_{n}) \geq 1-\xi_{n} , ~~~ n=1, \dots, N tny(xn)≥1−ξn, n=1,…,N

ξ n ≥ 0 \xi_{n} \geq 0 ξn≥0

The corresponding Lagrangian is given by:

L ( w , b , ξ , a , μ ) = 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ n = 1 N ξ n − ∑ n = 1 N a n { t n y ( x n ) − 1 + ξ n } − ∑ n = 1 N μ n ξ n L(w, b, \xi, a, \mu) = \frac{1}{2}||w||^{2} + C\sum_{n=1}^{N}\xi_{n}-\sum_{n=1}^{N}a_{n}\{t_{n}y(x_{n}) - 1 + \xi_{n}\} - \sum_{n=1}^{N}\mu_{n}\xi_{n} L(w,b,ξ,a,μ)=21∣∣w∣∣2+Cn=1∑Nξn−n=1∑Nan{tny(xn)−1+ξn}−n=1∑Nμnξn

The corresponding set of KKT conditions are given by:

a n ≥ 0 a_{n} \geq 0 an≥0

t n y ( x n ) − 1 + ξ n ≥ 0 t_{n}y(x_{n}) - 1 + \xi_{n} \geq 0 tny(xn)−1+ξn≥0

a n ( t n y ( x n ) − 1 + ξ n ) = 0 a_{n} (t_{n}y(x_{n}) - 1 + \xi_{n}) = 0 an(tny(xn)−1+ξn)=0

μ n ≥ 0 \mu_{n} \geq 0 μn≥0

ξ n ≥ 0 \xi_{n} \geq 0 ξn≥0

μ n ξ n = 0 \mu_{n}\xi_{n} = 0 μnξn=0

After we eliminate w w w, b b b and ξ n \xi_n ξn from the Lagrangian we obtain the dual Lagrangian in the form:

L ~ ( a ) = ∑ n = 1 N a n − 1 2 ∑ n = 1 N ∑ m = 1 N a n a m t n t m k ( x n , x m ) \tilde{L}(a) = \sum_{n=1}^{N}a_{n} - \frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}a_{n}a_{m}t_{n}t_{m}k(x_{n}, x_{m}) L~(a)=n=1∑Nan−21n=1∑Nm=1∑Nanamtntmk(xn,xm)

with respect to the subjection:

0 ≤ a n ≤ C 0 \leq a_{n} \leq C 0≤an≤C

∑ n = 1 N a n t n = 0 \sum_{n=1}^{N} a_{n}t_{n} = 0 n=1∑Nantn=0

Similarly, a numerically stable solution is obtained by:

b = 1 N M ∑ n ∈ M ( t n − ∑ m ∈ S a m t m k ( x n , x m ) ) b = \frac{1}{N_{M}}\sum_{n\in M}\left( t_{n} - \sum_{m\in S}a_{m}t_{m}k(x_{n}, x_{m}) \right) b=NM1n∈M∑(tn−m∈S∑amtmk(xn,xm))

7.1.2 Relation to logistic regression

The objective function can be written in the form:

∑ n = 1 N E S V ( y n t n ) + λ ∣ ∣ w ∣ ∣ 2 \sum_{n=1}^N E_{SV}(y_n t_n)+\lambda ||w||^2 n=1∑NESV(yntn)+λ∣∣w∣∣2

Also, we can construct an error function by taking the negative logarithm of the likelihood function that, with a quadratic regularizer, takes the form:

∑ n = 1 N E L R ( y n t n ) + λ ∣ ∣ w ∣ ∣ 2 \sum_{n=1}^N E_{LR}(y_n t_n)+\lambda ||w||^2 n=1∑NELR(yntn)+λ∣∣w∣∣2

7.1.3 Multiclass SVMs

7.1.4 SVMs for regression

To obtain sparse solutions, we therefore minimizing a regularized error function:

C ∑ n = 1 N E ϵ ( y ( x ) − t n ) + 1 2 ∣ ∣ w ∣ ∣ 2 C\sum_{n=1}^{N}E_{\epsilon}(y(x) - t_{n}) + \frac{1}{2}||w||^{2} Cn=1∑NEϵ(y(x)−tn)+21∣∣w∣∣2

Introducing slack variables, the error function for support vector regression can then be written as:

C ∑ n = 1 N ( ξ n + ξ n ^ ) + 1 2 ∣ ∣ w ∣ ∣ 2 C\sum_{n=1}^{N}(\xi_{n} + \hat{\xi_{n}}) + \frac{1}{2}||w||^{2} Cn=1∑N(ξn+ξn^)+21∣∣w∣∣2

with the corresponding conditions:

y n − ϵ ≤ t n ≤ y n + ϵ y_{n} - \epsilon \leq t_{n} \leq y_{n} + \epsilon yn−ϵ≤tn≤yn+ϵ

ξ n ≥ 0 , ξ n ^ ≥ 0 \xi_{n} \geq 0,\ \ \hat{\xi_{n}} \geq 0 ξn≥0, ξn^≥0

The dual problem involves maximizing:

L ~ ( a , a ^ ) = − 1 2 ∑ n = 1 N ∑ m = 1 N ( a n − a n ^ ) ( a m − a m ^ ) k ( x n , x m ) − ϵ ∑ n = 1 N ( a n + a n ^ ) + ∑ n = 1 N ( a n − a n ^ ) t n \tilde{L}(a, \hat{a}) = -\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}(a_{n} - \hat{a_{n}}) (a_{m} -\hat{a_{m}})k(x_{n}, x_{m}) - \epsilon\sum_{n=1}^{N}(a_{n} + \hat{a_{n}}) + \sum_{n=1}^{N}(a_{n} - \hat{a_{n}})t_{n} L~(a,a^)=−21n=1∑Nm=1∑N(an−an^)(am−am^)k(xn,xm)−ϵn=1∑N(an+an^)+n=1∑N(an−an^)tn

with respect to the condition:

0 ≤ a n ≤ C 0 \leq a_{n} \leq C 0≤an≤C

0 ≤ a ^ n ≤ C 0 \leq \hat{a}_{n} \leq C 0≤a^n≤C

Solve the dual problem and we can see the predictions for new inputs can be made using:

y ( x ) = ∑ n = 1 N ( a n − a ^ n ) k ( x , x n ) + b y(x) = \sum_{n=1}^{N}(a_{n} - \hat{a}_{n})k(x, x_{n}) + b y(x)=n=1∑N(an−a^n)k(x,xn)+b

which is again expressed in terms of the Kernel function. The corresponding KKT conditions:

a n ( ϵ + ξ n + y n − t n ) = 0 a_{n}(\epsilon + \xi_{n} + y_{n} - t_{n}) = 0 an(ϵ+ξn+yn−tn)=0

a ^ n ( ϵ + ξ ^ n − y n + t n ) = 0 \hat{a}_{n}(\epsilon + \hat{\xi}_{n} - y_{n} + t_{n}) = 0 a^n(ϵ+ξ^n−yn+tn)=0

( C − a n ) ξ n = 0 (C - a_{n})\xi_{n} = 0 (C−an)ξn=0

( C − a ^ n ) ξ ^ n = 0 (C - \hat{a}_{n})\hat{\xi}_{n} = 0 (C−a^n)ξ^n=0

Solving for b b b we obtain:

b = t n − ϵ − ∑ m = 1 N ( a m − a ^ m ) k ( x n , x m ) b = t_{n} - \epsilon - \sum_{m=1}^{N}(a_{m} - \hat{a}_{m})k(x_{n}, x_{m}) b=tn−ϵ−m=1∑N(am−a^m)k(xn,xm)

7.1.5 Computational learning theory

7.2 Relevance Vector Machines

7.2.1 RVM for regression

The relevance vector machine for regression is a linear model of the form discussed in Chapter 3 but with a modified prior that results in sparse solutions. The condition distribution is:

p ( t ∣ x , w , β ) = N ( t ∣ y ( x ) , β − 1 ) p(t| \mathbf{x}, \mathbf{w}, \beta) = N(t | y(x), \beta^{-1}) p(t∣x,w,β)=N(t∣y(x),β−1)

where β = σ − 2 \beta=\sigma^{-2} β=σ−2 is the noise precision and the mean is given by a linear model of the form:

y ( x ) = ∑ i = 1 M w i ϕ i ( x ) = w T ϕ ( x ) y(x) = \sum_{i=1}^{M}w_{i}\phi_{i}(x) = w^{T}\phi(x) y(x)=i=1∑Mwiϕi(x)=wTϕ(x)

this general expression takes the SVM-like form:

y ( x ) = ∑ n = 1 N w n k ( x , x n ) + b y(x) = \sum_{n=1}^{N}w_{n}k(x, x_{n}) + b y(x)=n=1∑Nwnk(x,xn)+b

Suppose we have N observations, the likelihood function is given by:

p ( t ∣ X , w , β ) = ∏ n = 1 N p ( t n ∣ x n , w , β ) p(\mathbf{t} | \mathbf{X}, w, \beta) = \prod_{n=1}^{N}p(t_{n} | x_{n}, w, \beta) p(t∣X,w,β)=n=1∏Np(tn∣xn,w,β)

The prior distribution over the parameter w w w:

p ( w ∣ α ) = ∏ i = 1 M N ( w i ∣ 0 , α i − 1 ) p(w| \alpha) = \prod_{i=1}^{M}N(w_{i} | 0, \alpha_{i}^{-1}) p(w∣α)=i=1∏MN(wi∣0,αi−1)

It is easy to infer that the posterior distribution for w w w takes the form:

p ( w ∣ t , X , α , β ) = N ( w ∣ m , Σ ) p(w|\mathbf{t},\mathbf{X},\alpha,\beta)=N(w|m,\Sigma) p(w∣t,X,α,β)=N(w∣m,Σ)

where m = β Σ Φ T t m=\beta\Sigma\mathbf{\Phi^T t} m=βΣΦTt and Σ = ( A + β Φ T Φ ) − 1 \Sigma=(\mathbf{A}+\beta\mathbf{\Phi^T\Phi})^{-1} Σ=(A+βΦTΦ)−1.

We should maximize the marginal likelihood function:

p ( t ∣ X , α , β ) = ∫ p ( t ∣ X , w , β ) p ( w ∣ α ) d w p(\mathbf{t}|\mathbf{X},\alpha,\beta)=\int p(\mathbf{t}|\mathbf{X},w,\beta)p(w|\alpha)dw p(t∣X,α,β)=∫p(t∣X,w,β)p(w∣α)dw

The log marginal likelihood will be the form:

p ( t ∣ X , α , β ) = ln ⁡ N ( t ∣ 0 , C ) p(\mathbf{t} | \mathbf{X}, \alpha, \beta) = \ln N(\mathbf{t} | \mathbf{0, \mathbf{C}}) p(t∣X,α,β)=lnN(t∣0,C)

C = β − 1 I + Φ A − 1 Φ T C = \beta^{-1}\mathbf{I} + \mathbf{\Phi A^{-1}\Phi^{T}} C=β−1I+ΦA−1ΦT

The predictive distribution will be:

p ( t ∣ x , X , t , α ∗ , β ∗ ) = N ( t ∣ m T ϕ ( x ) , σ 2 ( x ) ) p(t| x, \mathbf{X}, \mathbf{t}, \alpha^{*}, \beta^{*}) = N(t | m^{T}\phi(x), \sigma^{2}(x)) p(t∣x,X,t,α∗,β∗)=N(t∣mTϕ(x),σ2(x))

7.2.2 Analysis of sparsity

7.2.3 RVM for classification

We can extend the relevance vector machine framework to classification problems by applying the ARD prior over weights to a probabilistic linear classification model of that in Chapter 4.

本文标签：模式识别

版权声明：本文标题：模式识别内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.freenas.com.cn/jishu/1693600550h231659.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

技术交流 – FreeNAS中文网

模式识别

模式识别

PRML Chapter 7 Sparse Kernel Machines

7.1 Maximum Margin Classifiers

7.1.1 Overlapping class distributions

7.1.2 Relation to logistic regression

7.1.3 Multiclass SVMs

7.1.4 SVMs for regression

7.1.5 Computational learning theory

7.2 Relevance Vector Machines

7.2.1 RVM for regression

7.2.2 Analysis of sparsity

7.2.3 RVM for classification

更多相关文章

模式识别

Matlab做mil,模式识别

使用Python进行数据挖掘和模式识别

Python中的数据挖掘和模式识别算法案例

发表评论

推荐文章

STATA：安装命令与实证结果导出

linux笔记本关盖不关机,关闭盖子后笔记本电脑不会挂起

speak 计算机英语作文,实用的学英语作文5篇

node.js 实现发送短信功能

如何封装一个自己的win7系统并安装到电脑做成双系统

热门文章

word表格跨页后自动生成的顶部横线【去除方法】

Claude 与 ChatGPT：哪个更适合学术写作，深入对比分析

盘古社区:在上海财经大学和同济大学之间到底该如何选择?

Stata 17中文版软件：Stata 最新版下载附安装教程及功能使用(2)

awk 精萃

Windows远程连接macOS

Windows下的SSL 应用

映射Ubuntu虚拟机的文件到windows下

win7分区软件_全自动分区装系统PE纯净无广告

[.NET] Windows系统全局代理IP设置

最新文章

Raid技术

LSI_阵列卡操作手册

破解Centos7_root用户密码

Redhat重置Root用户密码方法

远程批量修改linux服务器密码的脚本

Windows7 系统安全设置权限技巧

（Windows系统）详细介绍Windows系统 含有英文版

最新Windows 11教育版下载：专为教育设计的系统！

Win7系统下搭建NFS服务器

零基础使用UltraISO制作并安装纯净Win10系统指南

（Windows系统）详细介绍Windows系统含有英文版