admin 管理员组文章数量: 887007
Grounding of Textual Phrases in Images by Reconstruction
Grounding of Textual Phrases in Images by Reconstruction
(看看 image grounding 论文,给 moment 任务找找灵感)
注:ECCV 2016, paper arXiv 传送门
motivation:
-
Many prior efforts in this area have focused on rather constrained settings with a small number of nouns to ground . On the contrary, we want to tackle the problem of grounding arbitrary natural language phrases in images. (之前工作主要针对少量名词做image grounding → \to → 本文针对任意自然语言 phrase)
-
Most parallel corpora of sentence/visual data do not provide localization annotations (e.g. bounding boxes) and the annotation process is costly. We propose an approach which can learn to localize phrases relying only on phrases associated with images without bounding box annotations but which is also able to incorporate phrases with bounding box supervision when available. (标注成本过高 → \to → 适用于弱监督下的grounding,即训练集只有 image 的标注 phrase 而没有 image 内对应区域的标注)
Target:
弱监督条件,即给定phrase p p p 和相应的 image I I I,得到 I I I 中与 p p p 相关的 region r i r_i ri(segment 或者bounding box)
Main Idea:
既然有 phrase p p p , image I I I与 region r i r_{i} ri 的对应关系,即 f : p , I → r i f:p,I \to r_i f:p,I→ri ;那么理想情况下, 也可以通过 image I I I与 region r i r_{i} ri , 重构出 phrase p p p (类似于 autoencoder 的思路)
Contribution:
-
提出的 model 在 grounding 阶段使用 attention 机制
-
加入重构 phrase p p p 的模块,引入重构损失,使得提出的 model 可以用于各种监督条件:监督、半监督、非监督
-
good performance
Model:
1)Grounding
To select the correct bounding box from region proposal { r i } i = 1 , . . . , N {\{r_i\}_{i=1,...,N}} {ri}i=1,...,N, we define an attention function f A T T f_{ATT} fATT and select the box j j j which receives the maximum attention (寻找图像中attention区域):
KaTeX parse error: Got function '\arg' with no arguments as argument to '\mathop' at position 28: …rset{i}{\mathop\̲a̲r̲g̲ ̲\max }\,{f}_{AT…
具体:对单个 Word 用 one-hot vector 编码后嵌入低维空间 (不用word embedding??)
用LSTM作为 phrase p p p 的 encoder, 最终隐层作为 phrase representation:
h = f L S T M ( p ) h=f_{LSTM}\left( {p} \right) h=fLSTM(p)
用CNN提取所有 图像区域 r i r_i ri 的特征:
v i = f C N N ( r i ) v_i = f_{CNN}\left( {r_i} \right) vi=fCNN(ri)
attention module(非常常规的做法):
a ˉ i = f A T T ( p , r i ) = W 2 R e L U ( W h h + W v v i + b 1 ) + b 2 \bar{a}_{i} = f_{ATT}\left( {p,{r}_{i}} \right) = W_2ReLU\left( {W_hh + W_vv_{i}+b_1}\right)+b_2 aˉi=fATT(p,ri)=W2ReLU(Whh+Wvvi+b1)+b2
对 a ˉ i \bar{a}_i aˉi 进行 softmax,得到各 region r i r_i ri 的 probability a i a_i ai of being the correct region KaTeX parse error: Got function '\hat' with no arguments as subscript at position 3: r_\̲h̲a̲t̲{j}
2)Reconstruction
对于 attended bounding boxes, 计算其图像特征的加权和
v a t t = ∑ i=1 N a i v i v_{att}=\sum\limits_{\text{i=1}}^{N}{a_iv_i} vatt=i=1∑Naivi
对得到的特征 v a t t v_{att} vatt 进行 encoding:
v a t t ′ = f R E C ( v a t t ) = R e L U ( W a v a t t + b a ) v'_{att}=f_{REC}\left( {v_{att}} \right)=ReLU\left({W_av_{att}+b_a}\right) vatt′=fREC(vatt)=ReLU(Wavatt+ba)
将最终图像特征 v a t t ′ v'_{att} vatt′ 作为初始状态特征输入单层LSTM网络,得到 v a t t ′ v'_{att} vatt′ 输入下关于 phrase p p p 的概率分布,即 phrase 的重构:
P ( p ∣ v a t t ′ ) = f L S T M ( v a t t ′ ) P\left({p|v'_{att}}\right)=f_{LSTM}\left({v'_{att}}\right) P(p∣vatt′)=fLSTM(vatt′)
3)Loss Function
对于监督训练条件,有groundtruth region KaTeX parse error: Got function '\hat' with no arguments as subscript at position 3: r_\̲h̲a̲t̲{j},引入预测损失:
L a t t = − 1 B ∑ b=1 B l o g ( P ( j ^ ∣ a ˉ ) ) L_{att}=-\frac{1}{B}{\sum\limits_{\text{b=1}}^{B}}{log\left( {P\left({\hat{j}|\bar{a}}\right)}\right)} Latt=−B1b=1∑Blog(P(j^∣aˉ))
对于无监督训练条件下,没有groundtruth region, 舍弃预测损失,直接构建重构损失,即最大化重构 phrase p p p 的 likelihood:
L r e c = − 1 B ∑ b=1 B l o g ( P ( p ^ ∣ v a t t ′ ) ) L_{rec}=-\frac{1}{B}{\sum\limits_{\text{b=1}}^{B}}{log\left( {P\left({\hat{p}|v'_{att}}\right)}\right)} Lrec=−B1b=1∑Blog(P(p^∣vatt′))
显然,半监督条件下,可以同时加入这两种损失函数:
L = λ L a t t + L r e c L=\lambda{L_{att}}+L_{rec} L=λLatt+Lrec
小结
参考文献
[1] Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., & Schiele, B. (2016). Grounding of Textual Phrases in Images by Reconstruction. European Conference on Computer Vision.
本文标签: Grounding of Textual Phrases in Images by Reconstruction
版权声明:本文标题:Grounding of Textual Phrases in Images by Reconstruction 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.freenas.com.cn/jishu/1732354024h1533919.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论