Grounding of Textual Phrases in Images by Reconstruction-FreeNAS中文网

admin 管理员组

文章数量: 887007

Grounding of Textual Phrases in Images by Reconstruction

（看看 image grounding 论文，给 moment 任务找找灵感）

注：ECCV 2016, paper arXiv 传送门

motivation：

Many prior efforts in this area have focused on rather constrained settings with a small number of nouns to ground . On the contrary, we want to tackle the problem of grounding arbitrary natural language phrases in images. (之前工作主要针对少量名词做image grounding → \to → 本文针对任意自然语言 phrase)
Most parallel corpora of sentence/visual data do not provide localization annotations (e.g. bounding boxes) and the annotation process is costly. We propose an approach which can learn to localize phrases relying only on phrases associated with images without bounding box annotations but which is also able to incorporate phrases with bounding box supervision when available. (标注成本过高 → \to → 适用于弱监督下的grounding，即训练集只有 image 的标注 phrase 而没有 image 内对应区域的标注)

Target:

弱监督条件，即给定phrase p p p 和相应的 image I I I，得到 I I I 中与 p p p 相关的 region r i r_i ri(segment 或者bounding box)

Main Idea:

既然有 phrase p p p , image I I I与 region r i r_{i} ri 的对应关系，即 f : p , I → r i f:p,I \to r_i f:p,I→ri ;那么理想情况下, 也可以通过 image I I I与 region r i r_{i} ri , 重构出 phrase p p p (类似于 autoencoder 的思路)

Contribution:

提出的 model 在 grounding 阶段使用 attention 机制
加入重构 phrase p p p 的模块，引入重构损失，使得提出的 model 可以用于各种监督条件:监督、半监督、非监督
good performance

Model:

1)Grounding

To select the correct bounding box from region proposal { r i } i = 1 , . . . , N {\{r_i\}_{i=1,...,N}} {ri}i=1,...,N, we define an attention function f A T T f_{ATT} fATT and select the box j j j which receives the maximum attention (寻找图像中attention区域):

KaTeX parse error: Got function '\arg' with no arguments as argument to '\mathop' at position 28: …rset{i}{\mathop\̲a̲r̲g̲ ̲\max }\,{f}_{AT…

具体：对单个 Word 用 one-hot vector 编码后嵌入低维空间 (不用word embedding??)

用LSTM作为 phrase p p p 的 encoder, 最终隐层作为 phrase representation：

h = f L S T M ( p ) h=f_{LSTM}\left( {p} \right) h=fLSTM(p)

用CNN提取所有图像区域 r i r_i ri 的特征：

v i = f C N N ( r i ) v_i = f_{CNN}\left( {r_i} \right) vi=fCNN(ri)

attention module(非常常规的做法)：

a ˉ i = f A T T ( p , r i ) = W 2 R e L U ( W h h + W v v i + b 1 ) + b 2 \bar{a}_{i} = f_{ATT}\left( {p,{r}_{i}} \right) = W_2ReLU\left( {W_hh + W_vv_{i}+b_1}\right)+b_2 aˉi=fATT(p,ri)=W2ReLU(Whh+Wvvi+b1)+b2

对 a ˉ i \bar{a}_i aˉi 进行 softmax，得到各 region r i r_i ri 的 probability a i a_i ai of being the correct region KaTeX parse error: Got function '\hat' with no arguments as subscript at position 3: r_\̲h̲a̲t̲{j}

2)Reconstruction

对于 attended bounding boxes, 计算其图像特征的加权和

v a t t = ∑ i=1 N a i v i v_{att}=\sum\limits_{\text{i=1}}^{N}{a_iv_i} vatt=i=1∑Naivi

对得到的特征 v a t t v_{att} vatt 进行 encoding:

v a t t ′ = f R E C ( v a t t ) = R e L U ( W a v a t t + b a ) v'_{att}=f_{REC}\left( {v_{att}} \right)=ReLU\left({W_av_{att}+b_a}\right) vatt′=fREC(vatt)=ReLU(Wavatt+ba)

将最终图像特征 v a t t ′ v'_{att} vatt′ 作为初始状态特征输入单层LSTM网络，得到 v a t t ′ v'_{att} vatt′ 输入下关于 phrase p p p 的概率分布，即 phrase 的重构：

P ( p ∣ v a t t ′ ) = f L S T M ( v a t t ′ ) P\left({p|v'_{att}}\right)=f_{LSTM}\left({v'_{att}}\right) P(p∣vatt′)=fLSTM(vatt′)

3)Loss Function

对于监督训练条件，有groundtruth region KaTeX parse error: Got function '\hat' with no arguments as subscript at position 3: r_\̲h̲a̲t̲{j}，引入预测损失：

L a t t = − 1 B ∑ b=1 B l o g ( P ( j ^ ∣ a ˉ ) ) L_{att}=-\frac{1}{B}{\sum\limits_{\text{b=1}}^{B}}{log\left( {P\left({\hat{j}|\bar{a}}\right)}\right)} Latt=−B1b=1∑Blog(P(j^∣aˉ))

对于无监督训练条件下，没有groundtruth region，舍弃预测损失，直接构建重构损失，即最大化重构 phrase p p p 的 likelihood:

L r e c = − 1 B ∑ b=1 B l o g ( P ( p ^ ∣ v a t t ′ ) ) L_{rec}=-\frac{1}{B}{\sum\limits_{\text{b=1}}^{B}}{log\left( {P\left({\hat{p}|v'_{att}}\right)}\right)} Lrec=−B1b=1∑Blog(P(p^∣vatt′))

显然，半监督条件下，可以同时加入这两种损失函数：

L = λ L a t t + L r e c L=\lambda{L_{att}}+L_{rec} L=λLatt+Lrec

小结

参考文献

[1] Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., & Schiele, B. (2016). Grounding of Textual Phrases in Images by Reconstruction. European Conference on Computer Vision.

本文标签： Grounding of Textual Phrases in Images by Reconstruction

版权声明：本文标题：Grounding of Textual Phrases in Images by Reconstruction 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.freenas.com.cn/jishu/1732354024h1533919.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

技术交流 – FreeNAS中文网

Grounding of Textual Phrases in Images by Reconstruction

Grounding of Textual Phrases in Images by Reconstruction

Grounding of Textual Phrases in Images by Reconstruction

motivation：

Target:

Main Idea:

Contribution:

Model:

1)Grounding

2)Reconstruction

3)Loss Function

小结

参考文献

更多相关文章

Grounding of Textual Phrases in Images by Reconstruction

发表评论

推荐文章

基于模板匹配方案和app开发工具箱图片英文字母识别matlab仿真

如何在亚马逊上更改您的默认信用卡（并清理列表）

技术人攻略访谈二十：智能家居行业破局者

Spleeter 安装教程（windows）

word删除空白页：前后纸张大小不同，删除空白页后格式不变

热门文章

最新配置电脑安装Win7、win server 2008R2等老系统的方法

常见错误码

服务器的CentOS7系统安装

hp g8类似更换主板后的问题及处理总结

openwrt移植过程的问题记录

马哥SRE第二周课程作业

Springboot连接阿里云ES实现文档搜索

windows 7搭建流媒体服务

狗日的系统之家下载的Windows 10 18031809系统不干净，捆绑自动安装腾讯关键等软件...

《黑神话：悟空》闪退、黑屏与错误代码全解：保姆级快速修复指南，在此恭迎天命人！让你重返取经之路！

最新文章

Raid技术

LSI_阵列卡操作手册

破解Centos7_root用户密码

Redhat重置Root用户密码方法

远程批量修改linux服务器密码的脚本

win7计算机管理中看不到新加的硬盘,win7系统看不到第二块硬盘的解决方法.

[转]笔记本电脑处理器(CPU)性能排行榜

project安装包的下载和安装教程

测试模式 windows2008 内部版本7601

如何区分自己的windows系统是正版还是盗版 ？从零基础到精通，收藏这篇就够了！

如何区分自己的windows系统是正版还是盗版？从零基础到精通，收藏这篇就够了！