admin 管理员组

文章数量: 887007

Grounding of Textual Phrases in Images by Reconstruction

Grounding of Textual Phrases in Images by Reconstruction

(看看 image grounding 论文,给 moment 任务找找灵感)

注:ECCV 2016, paper arXiv 传送门

motivation:

  1. Many prior efforts in this area have focused on rather constrained settings with a small number of nouns to ground . On the contrary, we want to tackle the problem of grounding arbitrary natural language phrases in images. (之前工作主要针对少量名词做image grounding → \to → 本文针对任意自然语言 phrase)

  2. Most parallel corpora of sentence/visual data do not provide localization annotations (e.g. bounding boxes) and the annotation process is costly. We propose an approach which can learn to localize phrases relying only on phrases associated with images without bounding box annotations but which is also able to incorporate phrases with bounding box supervision when available. (标注成本过高 → \to → 适用于弱监督下的grounding,即训练集只有 image 的标注 phrase 而没有 image 内对应区域的标注)

Target:

弱监督条件,即给定phrase p p p 和相应的 image I I I,得到 I I I 中与 p p p 相关的 region r i r_i ri​(segment 或者bounding box)

Main Idea:

既然有 phrase p p p , image I I I与 region r i r_{i} ri​ 的对应关系,即 f : p , I → r i f:p,I \to r_i f:p,I→ri​ ;那么理想情况下, 也可以通过 image I I I与 region r i r_{i} ri​ , 重构出 phrase p p p (类似于 autoencoder 的思路)

Contribution:

  1. 提出的 model 在 grounding 阶段使用 attention 机制

  2. 加入重构 phrase p p p 的模块,引入重构损失,使得提出的 model 可以用于各种监督条件:监督、半监督、非监督

  3. good performance

Model:

1)Grounding

To select the correct bounding box from region proposal { r i } i = 1 , . . . , N {\{r_i\}_{i=1,...,N}} {ri​}i=1,...,N​, we define an attention function f A T T f_{ATT} fATT​ and select the box j j j which receives the maximum attention (寻找图像中attention区域):

KaTeX parse error: Got function '\arg' with no arguments as argument to '\mathop' at position 28: …rset{i}{\mathop\̲a̲r̲g̲ ̲\max }\,{f}_{AT…

具体:对单个 Word 用 one-hot vector 编码后嵌入低维空间 (不用word embedding??)

用LSTM作为 phrase p p p 的 encoder, 最终隐层作为 phrase representation:

h = f L S T M ( p ) h=f_{LSTM}\left( {p} \right) h=fLSTM​(p)

用CNN提取所有 图像区域 r i r_i ri​ 的特征:

v i = f C N N ( r i ) v_i = f_{CNN}\left( {r_i} \right) vi​=fCNN​(ri​)

attention module(非常常规的做法):

a ˉ i = f A T T ( p , r i ) = W 2 R e L U ( W h h + W v v i + b 1 ) + b 2 \bar{a}_{i} = f_{ATT}\left( {p,{r}_{i}} \right) = W_2ReLU\left( {W_hh + W_vv_{i}+b_1}\right)+b_2 aˉi​=fATT​(p,ri​)=W2​ReLU(Wh​h+Wv​vi​+b1​)+b2​

对 a ˉ i \bar{a}_i aˉi​ 进行 softmax,得到各 region r i r_i ri​ 的 probability a i a_i ai​ of being the correct region KaTeX parse error: Got function '\hat' with no arguments as subscript at position 3: r_\̲h̲a̲t̲{j}

2)Reconstruction

对于 attended bounding boxes, 计算其图像特征的加权和

v a t t = ∑ i=1 N a i v i v_{att}=\sum\limits_{\text{i=1}}^{N}{a_iv_i} vatt​=i=1∑N​ai​vi​

对得到的特征 v a t t v_{att} vatt​ 进行 encoding:

v a t t ′ = f R E C ( v a t t ) = R e L U ( W a v a t t + b a ) v'_{att}=f_{REC}\left( {v_{att}} \right)=ReLU\left({W_av_{att}+b_a}\right) vatt′​=fREC​(vatt​)=ReLU(Wa​vatt​+ba​)

将最终图像特征 v a t t ′ v'_{att} vatt′​ 作为初始状态特征输入单层LSTM网络,得到 v a t t ′ v'_{att} vatt′​ 输入下关于 phrase p p p 的概率分布,即 phrase 的重构:

P ( p ∣ v a t t ′ ) = f L S T M ( v a t t ′ ) P\left({p|v'_{att}}\right)=f_{LSTM}\left({v'_{att}}\right) P(p∣vatt′​)=fLSTM​(vatt′​)

3)Loss Function

对于监督训练条件,有groundtruth region KaTeX parse error: Got function '\hat' with no arguments as subscript at position 3: r_\̲h̲a̲t̲{j},引入预测损失:

L a t t = − 1 B ∑ b=1 B l o g ( P ( j ^ ∣ a ˉ ) ) L_{att}=-\frac{1}{B}{\sum\limits_{\text{b=1}}^{B}}{log\left( {P\left({\hat{j}|\bar{a}}\right)}\right)} Latt​=−B1​b=1∑B​log(P(j^​∣aˉ))

对于无监督训练条件下,没有groundtruth region, 舍弃预测损失,直接构建重构损失,即最大化重构 phrase p p p 的 likelihood:

L r e c = − 1 B ∑ b=1 B l o g ( P ( p ^ ∣ v a t t ′ ) ) L_{rec}=-\frac{1}{B}{\sum\limits_{\text{b=1}}^{B}}{log\left( {P\left({\hat{p}|v'_{att}}\right)}\right)} Lrec​=−B1​b=1∑B​log(P(p^​∣vatt′​))

显然,半监督条件下,可以同时加入这两种损失函数:

L = λ L a t t + L r e c L=\lambda{L_{att}}+L_{rec} L=λLatt​+Lrec​

小结

参考文献

[1] Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., & Schiele, B. (2016). Grounding of Textual Phrases in Images by Reconstruction. European Conference on Computer Vision.


本文标签: Grounding of Textual Phrases in Images by Reconstruction