ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs

(95) 2024-07-02 23:01:01

文章目录

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 BCNN: Basic Bi-CNN
  • 4 ABCNN: Attention-Based BCNN
  • 5 Experiments
    • 5.1 Answer Selection
    • 5.2 Paraphrase Identification
    • 5.3 Textual Entailment
  • 6 Summary

Abstract

How to model a pair of sentences is a critical issue in many NLP tasks such as answer selection (AS), paraphrase identification (PI) and textual entailment (TE). Most prior work (i) deals with one individual task by fine-tuning a specific system; (ii) models each sentence’s representation separately, rarely considering the impact of the other sentence; or (iii) relies fully on manually designed, task-specific linguistic features. This work presents a general Attention Based Convolutional Neural Network (ABCNN) for modeling a pair of sentences. We make three contributions. (i) The ABCNN can be applied to a wide variety of tasks that require modeling of sentence pairs. (ii) We propose three attention schemes that integrate mutual influence between sentences into CNNs; thus, the representation of each sentence takes into consideration its counterpart. These interdependent sentence pair representations are more powerful than isolated sentence representations. (iii) ABCNNs achieve state-of-the-art performance on AS, PI and TE tasks. We release code at: https://github.com/yinwenpeng/Answer_Selection.
如何对一对句子建模是许多NLP任务中的关键问题,例如答案选择(AS),段落识别(PI)和文本蕴涵(TE)。大多数先前的工作(i)通过微调特定系统来处理一项单独的任务; (ii)分别对每个句子的表示进行建模,很少考虑另一句话的影响;或(iii)完全依赖于手动设计的,任务特定的语言特征。这项工作提出了一般的基于注意力的卷积神经网络(ABCNN),用于建模一对句子。我们做了三个贡献。 (i)ABCNN可以应用于需要对句子对进行建模的各种任务。 (ii)我们提出三种注意方案,将句子之间的相互影响纳入CNN;因此,每个句子的表示都考虑到了它的对应物。这些相互依赖的句子对表示比孤立的句子表示更强大。 (iii)ABCNN在AS,PI和TE任务上实现最先进的性能。我们在以下网址发布代码:https://github.com/yinwenpeng/Answer_Selection。

1 Introduction

How to model a pair of sentences is a critical issue in many NLP tasks such as answer selection (AS) (Yu et al., 2014; Feng et al., 2015), paraphrase identification (PI) (Madnani et al., 2012; Yin and Schutze, 2015a), textual entailment (TE) (Marelli et al., 2014a; Bowman et al., 2015a) etc.
在许多NLP任务中,如何为一对句子建模是一个关键问题,如答案选择(AS)(Yu等人,2014; Feng等人,2015),段落识别(PI)(Madnani等,2012 ; Yin和Schutze,2015a),文本蕴含(TE)(Marelli等,2014a; Bowman等,2015a)等。
ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs (https://mushiming.com/)  第1张

Most prior work derives each sentence’s representation separately, rarely considering the impact of the other sentence. This neglects the mutual influence of the two sentences in the context of the task. It also contradicts what humans do when comparing two sentences. We usually focus on key parts of one sentence by extracting parts from the other sentence that are related by identity, synonymy, antonymy and other relations. Thus, human beings model the two sentences together, using the content of one sentence to guide the representation of the other.
大多数先前的工作分别得出每个句子的表示,很少考虑另一个句子的影响。 这忽略了两个句子在任务背景下的相互影响。 它也与人类在比较两个句子时的行为相矛盾。 我们通常通过从另一个句子中提取与身份,同义词,反义和其他关系相关的部分来关注一个句子的关键部分。 因此,人类将两个句子一起建模,使用一个句子的内容来指导另一个句子的表示。

Figure 1 demonstrates that each sentence of a pair partially determines which parts of the other sentence we must focus on. For AS, correctly answering s 0 s_0 s0 requires attention on “gross”: s 1 + s^+_ 1 s1+ contains a corresponding unit (“earned”) while s 1 − s^− _1 s1 does not. For PI, focus should be removed from “today” to correctly recognize < s 0 , s 1 + > < s_0, s^+_ 1 > <s0,s1+> as paraphrases and < s 0 , s 1 − > < s_0, s^− _1 > <s0,s1> as non-paraphrases. For TE, we need to focus on “full of people” (to recognize TE for < s 0 , s 1 + > <s_0, s^+ _1 > <s0,s1+>) and on “outdoors” / “indoors” (to recognize non-TE for < s 0 , s 1 − > < s_0, s^− _1 > <s0,s1>). These examples show the need for an architecture that computes different representations of s i s_i si for different s 1 − i ( i ∈ { 0 , 1 } ) s_{1−i} (i \in \{0,1\}) s1i(i{
0,1})
.
图1表明,一对中的每个句子部分地决定了我们必须关注的另一个句子的哪些部分。 对于AS,正确回答 s 0 s_0 s0需要注意“粗略”: s 1 + s^ +_ 1 s1+包含相应的单位(“赢得”)而 s 1 − s^-_1 s1不包含。 对于PI,应从“今天”中删除焦点以正确识别 < s 0 , s 1 + > <s0, s^ +_ 1> <s0,s1+>作为释义和 < s 0 , s 1 − > <s_0, s^-_1> <s0,s1>作为非释义。 对于TE,我们需要关注“满满的人”(识别TE为 < s 0 , s 1 + > <s0, s^ +_ 1> <s0,s1+>)和“户外”/“室内”(以识别非TE为 < s 0 , s 1 − > <s_0, s^-_1> <s0,s1>)。 这些示例表明需要一种体系结构,该体系结构为不同的 s 1 − i ( i ∈ { 0 , 1 } ) s_{1−i} (i \in \{0,1\}) s1i(i{
0,1})
计算 s i s_i si的不同表示。

Convolutional Neural Networks (CNNs) (LeCun et al., 1998) are widely used to model sentences (Kalchbrenner et al., 2014; Kim, 2014) and sentence pairs (Socher et al., 2011; Yin and Schutze, 2015a), especially in classification tasks. CNNs are supposed to be good at extracting robust and abstract features of input. This work presents the ABCNN, an attention-based convolutional neural network, that has a powerful mechanism for modeling a sentence pair by taking into account the interdependence between the two sentences. The ABCNN is a general architecture that can handle a wide variety of sentence pair modeling tasks.
卷积神经网络(CNNs)(LeCun等,1998)被广泛用于句子模型(Kalchbrenner等,2014; Kim,2014)和句子对(Socher等,2011; Yin和Schutze,2015a),特别是在分类任务中。 CNN善于提取输入的强大和抽象的特征。这项工作提出了ABCNN,一种基于注意力的卷积神经网络,它具有强大的机制,通过考虑两个句子之间的相互依赖性来建模句子对。 ABCNN是一种通用架构,可以处理各种句子对建模任务。

Some prior work proposes simple mechanisms that can be interpreted as controlling varying attention; e.g., Yih et al. (2013) employ word alignment to match related parts of the two sentences. In contrast, our attention scheme based on CNNs models relatedness between two parts fully automatically. Moreover, attention at multiple levels of granularity, not only at word level, is achieved as we stack multiple convolution layers that increase abstraction.
一些先前的工作提出了简单的机制,可以解释为控制不同的注意力;例如,Yih等人。 (2013)使用单词对齐来匹配两个句子的相关部分。相比之下,我们基于CNN的关注方案完全自动地模拟了两个部分之间的相关性。此外,当我们堆叠多个增加抽象的卷积层时,不仅在词级上实现了对多个粒度级别的关注。

Prior work on attention in deep learning (DL) mostly addresses long short-term memory networks (LSTMs) (Hochreiter and Schmidhuber, 1997). LSTMs achieve attention usually in a word-to-word scheme, and word representations mostly encode the whole context within the sentence (Bahdanau et al., 2015; Rocktaschel et al., 2016). It is not clear whether this is the best strategy; e.g., in the AS example in Figure 1, it is possible to determine that “how much” in s 0 s_0 s0 matches “$161.5 million” in s 1 s_1 s1 without taking the entire sentence contexts into account. This observation was also investigated by Yao et al. (2013b) where an information retrieval system retrieves sentences with tokens labeled as DATE by named entity recognition or as CD by POS tagging if there is a “when” question. However, labels or POS tags require extra tools. CNNs benefit from incorporating attention into representations of local phrases detected by filters; in contrast, LSTMs encode the whole context to form attention-based word representations – a strategy that is more complex than the CNN strategy and (as our experiments suggest) performs less well for some tasks.
深度学习(DL)中关注的先前工作主要涉及长期短期记忆网络(LSTM)(Hochreiter和Schmidhuber,1997)。 LSTM通常在单词到单词方案中获得关注,并且单词表示主要编码句子中的整个上下文(Bahdanau等,2015; Rocktaschel等,2016)。目前尚不清楚这是否是最佳策略;例如,在图1的AS示例中,可以确定 s 0 s_0 s0中的“多少”与 s 1 s_1 s1中的“$ 161.5百万”匹配,而不考虑整个句子上下文。 Yao等人(2013b)也研究了这一观察结果, 其中信息检索系统通过命名实体识别来检索具有标记为DATE的标记的句子,或者如果存在“何时”问题则通过POS标记检索CD。但是,标签或POS标签需要额外的工具。 CNN受益于将注意力集中到过滤器检测到的本地短语的表示中;相反,LSTM对整个上下文进行编码以形成基于注意力的单词表示 - 这种策略比CNN策略更复杂,并且(如我们的实验所示)对某些任务执行得不太好。

Apart from these differences, it is clear that attention has as much potential for CNNs as it does for LSTMs. As far as we know, this is the first NLP paper that incorporates attention into CNNs. Our ABCNNs get state-of-the-art in AS and TE tasks, and competitive performance in PI, then obtains further improvements over all three tasks when linguistic features are used.
除了这些差异之外,很明显,注意力与CNS相比具有与LSTM一样多的潜力。据我们所知,这是第一份将注意力纳入CNN的NLP论文。我们的ABCNN在AS和TE任务中获得最先进的技术,在PI中具有竞争性,然后在使用语言特征时获得对所有三个任务的进一步改进。

2 Related Work

Non-DL on Sentence Pair Modeling. Sentence pair modeling has attracted lots of attention in the past decades. Many tasks can be reduced to a semantic text matching problem. In this paper, we adopt the arguments by Yih et al. (2013) who argue against shallow approaches as well as against semantic text matching approaches that can be computationally expensive:
句子对建模的非DL方法。
在过去的几十年中,句子对建模引起了很多关注。 许多任务可以简化为语义文本匹配问题。 在本文中,我们采用了Yih等人(2013)的论证,他们反对浅层方法以及语义文本匹配方法,这些方法计算成本很高:

Due to the variety of word choices and inherent ambiguities in natural language, bag-of-word approaches with simple surface-form word matching tend to produce brittle results with poor prediction accuracy (Bilotti et al., 2007). As a result, researchers put more emphasis on exploiting syntactic and semantic structure. Representative examples include methods based on deeper semantic analysis (Shen and Lapata, 2007; Moldovan et al., 2007), tree edit-distance (Punyakanok et al., 2004; Heilman and Smith, 2010) and quasi-synchronous grammars (Wang et al., 2007) that match the dependency parse trees of the two sentences.
由于自然语言中词语选择的多样性和固有的模糊性,具有简单表面形式词匹配的词袋方法往往会产生脆弱的结果,预测准确性较差(Bilotti等,2007)。 因此,研究人员更加重视利用句法和语义结构。 代表性的例子包括基于深度语义分析的方法(Shen和Lapata,2007; Moldovan等,2007),树编辑距离(Punyakanok等,2004; Heilman和Smith,2010)和准同步语法(Wang et al,2007)匹配两个句子的依赖解析树。

Instead of focusing on the high-level semantic representation, Yih et al. (2013) turn their attention to improving the shallow semantic component, lexical semantics, by performing semantic matching based on a latent word-alignment structure (cf. Chang et al. (2010)). Lai and Hockenmaier (2014) explore finergrained word overlap and alignment between two sentences using negation, hypernym, synonym and antonym relations. Yao et al. (2013a) extend word-to-word alignment to phrase-to-phrase alignment by a semi-Markov CRF. However, such approaches often require more computational resources. In addition, employing syntactic or semantic parsers – which produce errors on many sentences – to find the best match between the structured representations of two sentences is not trivial.
而不是专注于高级语义表示,Yih等(2013)通过基于潜在的词对齐结构执行语义匹配(参见Chang等人(2010)),将注意力转向改进浅层语义成分,词汇语义。 Lai和Hockenmaier(2014)使用否定,上位词,同义词和反义词关系来探索两个句子之间的细粒度重叠和对齐。 姚等人(2013a)通过半马尔可夫CRF将词对齐扩展到词组到词组的对齐。 然而,这种方法通常需要更多的计算资源。 此外,使用句法或语义解析器(在许多句子上产生错误)来找到两个句子的结构化表示之间的最佳匹配并非易事。

DL on Sentence Pair Modeling.
To address some of the challenges of non-DL work, much recent work uses neural networks to model sentence pairs for AS, PI and TE.
关于句子对建模的DL。
为了解决非DL工作的一些挑战,最近的工作使用神经网络来模拟AS,PI和TE的句子对。

For AS, Yu et al. (2014) present a bigram CNN to model question and answer candidates. Yang et al. (2015) extend this method and get state-of-the-art performance on the WikiQA dataset (Section 5.1). Feng et al. (2015) test various setups of a bi-CNN architecture on an insurance domain QA dataset. Tan et al. (2016) explore bidirectional LSTMs on the same dataset. Our approach is different because we do not model the sentences by two independent neural networks in parallel, but instead as an interdependent sentence pair, using attention.
对于AS,Yu等人(2014)提出一个二元组CNN来模拟问答候选人。 杨等人(2015)扩展此方法并在WikiQA数据集上获得最先进的性能(第5.1节)。 冯等人(2015)在保险领域QA数据集上测试双CNN架构的各种设置。 Tan等人(2016)在同一数据集上探索双向LSTM。 我们的方法是不同的,因为我们不是通过两个独立的神经网络并行建立句子,而是作为一个相互依赖的句子对,使用注意。

For PI, Blacoe and Lapata (2012) form sentence representations by summing up word embeddings. Socher et al. (2011) use recursive autoencoders (RAEs) to model representations of local phrases in sentences, then pool similarity values of phrases from the two sentences as features for binary classification. Yin and Schutze (2015a) similarly replace an RAE with a CNN. In all three papers, the representation of one sentence is not influenced by the other – in contrast to our attention-based model.
对于PI,Blacoe和Lapata(2012)通过总结单词嵌入来形成句子表示。 Socher等。 (2011)使用递归自动编码器(RAE)来模拟句子中的局部短语的表示,然后将来自两个句子的短语的相似度值作为二元分类的特征。 Yin和Schutze(2015a)同样用CNN取代RAE。 在所有三篇论文中,一个句子的表示不受另一个句子的影响- 与我们基于注意力的模型相反。

For TE, Bowman et al. (2015b) use recursive neural networks to encode entailment on SICK (Marelli et al., 2014b). Rocktaschel et al. (2016) present an attention-based LSTM for the Stanford natural language inference corpus (Bowman et al., 2015a). Our system is the first CNN-based work on TE.
对于TE,Bowman等人(2015b)使用递归神经网络编码SICK上的蕴涵(Marelli等,2014b)。 Rocktaschel等人 (2016)提出了斯坦福自然语言推理语料库的基于注意力的LSTM(Bowman et al,2015a)。 我们的系统是第一个基于CNN的TE工作。

Some prior work aims to solve a general sentence matching problem. Hu et al. (2014) present two CNN architectures, ARC-I and ARC-II, for sentence matching. ARC-I focuses on sentence representation learning while ARC-II focuses on matching features on phrase level. Both systems were tested on PI, sentence completion (SC) and tweetresponse matching. Yin and Schutze (2015b) propose the MultiGranCNN architecture to model general sentence matching based on phrase matching on multiple levels of granularity and get promising results for PI and SC. Wan et al. (2015) try to match two sentences in AS and SC by multiple sentence representations, each coming from the local representations of two LSTMs. Our work is the first one to investigate attention for the general sentence matching task.
一些先前的工作旨在解决一般句子匹配问题。 胡等人(2014)提出两种CNN架构,ARC-I和ARC-II,用于句子匹配。 ARC-I侧重于句子表示学习,而ARC-II侧重于短语级别的匹配特征。 两个系统都在PI,句子完成(SC)和tweetresponse匹配上进行了测试。 Yin和Schutze(2015b)提出了MultiGranCNN架构,以基于多个粒度级别的短语匹配对一般句子匹配进行建模,并为PI和SC获得有希望的结果。 Wan等人(2015)尝试通过多个句子表示来匹配AS和SC中的两个句子,每个句子来自两个LSTM的局部表示。 我们的工作是第一个调查一般句子匹配任务的注意力的工作。

Attention-Based DL in Non-NLP Domains.
Even though there is little if any work on attention mechanisms in CNNs for NLP, attention-based CNNs have been used in computer vision for visual question answering (Chen et al., 2015), image classification (Xiao et al., 2015), caption generation (Xu et al., 2015), image segmentation (Hong et al., 2016) and object localization (Cao et al., 2015).
非NLP域中基于注意的DL。
尽管对于NLP的CNN中关注机制的研究很少,但基于注意力的CNN已被用于计算机视觉中的视觉问答(Chen et al,2015),图像分类(Xiao et al,2015), 标题生成(Xu et al,2015),图像分割(Hong et al,2016)和对象定位(Cao et al,2015)。

Mnih et al. (2014) apply attention in recurrent neural networks (RNNs) to extract “information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Gregor et al. (2015) combine a spatial attention mechanism with RNNs for image generation. Ba et al. (2015) investigate attention-based RNNs for recognizing multiple objects in images. Chorowski et al. (2014) and Chorowski et al. (2015) use attention in RNNs for speech recognition.
Mnih等人(2014)将注意力应用于递归神经网络(RNN),通过自适应地选择一系列区域或位置来提取“来自图像或视频的信息,并且仅以高分辨率处理所选区域。Gregor等(2015)将空间注意机制与RNN结合用于图像生成。 Ba等人 (2015)研究基于注意力的RNN,用于识别图像中的多个对象。 Chorowski等(2014年)和Chorowski等人(2015)在RNN中使用注意力进行语音识别。

Attention-Based DL in NLP.
Attention-based DL systems have been applied to NLP after their success in computer vision and speech recognition. They mainly rely on RNNs and end-to-end encoderdecoders for tasks such as machine translation (Bahdanau et al., 2015; Luong et al., 2015) and text reconstruction (Li et al., 2015; Rush et al., 2015). Our work takes the lead in exploring attention mechanisms in CNNs for NLP tasks.
** NLP中基于注意力的DL。**
在计算机视觉和语音识别方面取得成功后,基于注意力的DL系统已应用于NLP。 他们主要依靠RNNs和终端到终端的encoderdecoders任务,如机器翻译(Bahdanau等人,2015年;陈德良等人,2015年)和文本重建(李等人,2015年;Rush等人,2015年)。我们的工作率先探索CNN中针对NLP任务的注意机制。

3 BCNN: Basic Bi-CNN

We now introduce our basic (non-attention) CNN that is based on the Siamese architecture (Bromley et al., 1993), i.e., it consists of two weightsharing CNNs, each processing one of the two sentences, and a final layer that solves the sentence pair task. See Figure 2. We refer to this architecture as the BCNN. The next section will then introduce the ABCNN, an attention architecture that extends the BCNN. Table 1 gives our notational conventions.
我们现在介绍基于Siamese架构的基本(非注意)CNN(Bromley等,1993),即它由两个权重共享CNN组成,每个处理两个句子中的一个,最后一层解决句子对任务。 参见图2.我们将此体系结构称为BCNN。 接下来的部分将介绍ABCNN,这是一种扩展BCNN的注意力架构。 表1给出了我们的符号约定。

In our implementation and also in the mathematical formalization of the model given below, we pad the two sentences to have the same length s = m a x ( s 0 , s 1 ) s = max(s_0, s_1) s=max(s0,s1). However, in the figures we show different lengths because this gives a better intuition of how the model works.
在我们的实现中以及下面给出的模型的数学形式化中,我们将两个句子填充为具有相同的长度 s = m a x ( s 0 , s 1 ) s = max(s_0, s_1) s=max(s0,s1)。 但是,在图中我们显示了不同的长度,因为这样可以更好地直观地了解模型的工作原理。

We now describe the BCNN’s four types of layers: input, convolution, average pooling and output.
我们现在描述BCNN的四种类型的层:输入,卷积,平均池和输出。
ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs (https://mushiming.com/)  第2张
Input layer.
In the example in the figure, the two input sentences have 5 and 7 words, respectively. Each word is represented as a d 0 d_0 d0-dimensional precomputed word2vec (Mikolov et al., 2013) embedding, d 0 = 300 d_0 = 300 d0=300. As a result, each sentence is represented as a feature map of dimension d 0 ×

THE END

发表回复