神经机器翻译中语言学知识的引入

发布于:2021-09-22 03:13:04

博客地址:http://blog.csdn.net/wangxinginnlp/article/details/56488921




准备在组内做一个关于神经机器翻译中语言学知识的引入(syntax + NMT)的报告,先把相关文章进行罗列下:


**下面都是个人观点


?


1. Linguistic Input Features Improve Neural Machine Translation? (WMT2016)


http://www.statmt.org/wmt16/pdf/W16-2209.pdf


*提出用各种linguistic features来丰富encoding信息


*设计一种把linguistic features输入encoder的模式(虽然很简单)




2. ?Tree-to-Sequence Attentional Neural Machine Translation ?(ACL2016)


http://www.aclweb.org/anthology/P/P16/P16-1078.pdf


*设计一种tree-based encoder显示地把tree中间的节点表示出来,但tree-based encoder貌似只用到binary合成信息,没有用节点的类别信息(NP VP这种)


*attention mechanism在tree-based encoder和sequence encoder两种encoder的输出上分别做attention,然后把两种context vector简单相加形成最终的context vector供decoder使用




3.?Multi-task sequenceto sequence learning (ICLR 2016)


https://arxiv.org/pdf/1511.06114.pdf


*One-to-many Setting中一个encoder,多个decoder分别用来做translation,parsing和auto-encode


*source-side syntax很巧妙的被利用:用source-side syntax做parsing,监督信号会被传到共享的encoder帮助更好地进行source sentence representation,从而帮助更好地进行translation,反之亦然




4.?Factored Neural Machine Translation?


https://arxiv.org/pdf/1609.04621.pdf


*主要针对limited vocabulary问题,target-side vocabulary


*用morphological analyser获得a factored representation of each word (lemmas, Partof Speech tag, tense, person, gender and number)


*decoder输出lemmas?sequence和factors sequence这两种sequence,还提出一种解决方案限制两种sequence长度相同


*最后通过lemmas + factors恢复出原始的word




5.?Learning to Parse and Translate Improves Neural Machine Translation


https://arxiv.org/pdf/1702.03525.pdf


*用到了recurrentneural network grammars(参见 引文1 http://www.aclweb.org/anthology/N/N16/N16-1024.pdf)


*利用target-side parse information来帮助更好地encoding source sentence


*引文1 中?recurrentneural network grammars的output是action sequence,decoder每一次都使用Stack Buffer Action三者的hidden states来预测下一个action。这个工作里面把Buffer hidden state换成NMT decoder hidden state,并且每次在output出shift action才会output一个word。所以新的decoder output sequence是mixture sequence of actions and target words。


*类似,这里面的action sequence用到了target-side syntax information,帮助更好地进行source sentence representation


*问题:testing阶段,decoding出来的action sequence可能是不完整






6.?Syntax-aware Neural Machine Translation Using CCG


https://arxiv.org/pdf/1702.01147.pdf


*source-side和target-side都用到syntactic information


*source-side syntactic information用文章[1]的所提的方法进行使用,重点在target-side syntactic information的使用


*三种方法:1>?serializing ? CCG tag和word放在一个sequence里面进行解码


? ? ? ? ? ? ? ? ? ? 2>?Multitasking (1) ? shared encoder 一个encoder,两个decoder,一个解码CCG,一个解码word


? ? ? ? ? ? ? ? ? ? 3> Multitasking (2) ? distinct softmax 一个encoder一个decoder,不过要从同一时刻的decoder hidden state中分别解码 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?出word和CCG


作者很诚实地指出:The serializing approach increases the length of the target sequence which might lead to loss of information learned at lexical level. For the multitasking (1) approach there is no explicit way to constrain the number of predicted words and tags to match. The multitasking (2) approach does not condition the prediction of target words on thesyntactic context.?




7.?Neural Machine Translation with Source-Side Latent Graph Parsing


https://arxiv.org/pdf/1702.02265.pdf


*写的不明白,部分地方没有读懂


*做了一个多层的encoder。first layer hidden state来弄pos tag,second layer hidden state拿出来弄dependency parsing,但是在dependency parsing layer中每个hidden state和其他hidden state的倒腾下,通过这个倒腾计算他们之间relation的强弱关系,这个强弱关系和其他的hidden state加权求和用来形成一个新的hidden state(这个state包含了长距离的关系,毕竟显示地和其他的长距离的second layer hidden state直接倒腾过)。 decoding时候,decoder需要在新的hidden state sequence上也算一个context vector(这个vector被认为包含了source sentence内部长距离关系),最后这个新的context vector也参与target word prediction。


*亮点就是source-side?dependency?parsing layer?hidden state之间要算一下relation,帮助捕捉long distance dependencies






相关论文(在encoder-decoder结构中考虑tree信息):


101.?Language to Logical Form with Neural Attention (ACL 2016)


http://www.aclweb.org/anthology/P16-1004


*任务semantic parsing是实现natural language input utterances到logical forms的转换,作者采用流行的encoder-decoder结构来进行。


*作者提出一种hierarchical tree decoder,作者认为这种decoder可以explicitly captures the compositional structure of logical forms,也就是实现了sequence-to-tree的转换。想法很直观。


*hierarchical tree decoder广度优先生成target tree,并提出一种parent-feeding connection来做下一层subtree的生成(下一层subtree时间是等上一层生成完,时间上和上一层最后一个终结符节点最相关,但是结构上和上一层对应的非终结符节点更相关。见Figure 4的说明)。




102.?Tree-structured decoding with doubly-recurrent neural networks (ICLR 2017)


https://openreview.net/pdf?id=HkYhZDqxg


*和文章101一样,也是广度优先生成target tree(见Figure 1)。


*和文章101不一样 1)节点生成(node generation)部分。需要两个RNN分别记录父亲节点(parent)和前一时刻兄弟节点(previous sibling),称之为ancestral and fraternal hidden states。2)树结构预测,每个节点生成后需要做两次预测(1)是否有孩子(2)是否有后面的兄弟节点,从而可以生成树的结构。而文章101则是直接在每层中对应地方加非终结符和每个子树后加一个生成终结符。


*为什么树结构生成没有和文章101一样?作者的从训练复杂度和概率估计两个方面解释了自己的做法:These ideas has been adopted by most tree decoders (Dong & Lapata, 2016). There are two important downsides of using a padding strategy for topology prediction in trees. First,the size of the tree can grow considerably. While in the sequence framework only one stoppingtoken is needed, a tree with n nodes might need up to O(n) padding nodes to be added. This canhave important effects in training speed. The second reason is that a single stopping token selected competitively with other tokens requires one to continually update the associated parameters in response to any changes in the distribution over ordinary tokens so as to maintain topological control.




103.?Does String-Based Neural MT Learn Source Syntax???(EMNLP 2016)


http://www.aclweb.org/anthology/D/D16/D16-1159.pdf


*很有意思,作者从encoder的不同层的hidden state中去预测不同层次的syntax information,得到了一些有意思的结论:We find that both local and global syntactic information about source sentences is captured by the encoder. Different types of syntax is stored in different layers, with different concentration degrees. 简单说就是encoder学*出的东西多多少少含有(隐式的)syntax information,encoder的对句子的不同层次抽象与syntax information也或多或少有些对应关系。




104.?When Are Tree Structures Necessary for Deep Learning of Representations? (EMNLP 2015)


http://www.aclweb.org/anthology/D/D15/


*和文章103一样,读读非常有意思。




105.?Tree Memory Networks for Modelling Long-term Temporal Dependencies


https://arxiv.org/pdf/1703.04706.pdf


看题目很有意思,留着改天看。




总结:


1)丰富输入单元的信息


2)多任务间互相促进:multi-task learning


3)结构改变:sequence-to-tree or tree-to-sequence


4)http://hlt.suda.edu.cn/~xwang/slides/syntax_NMT.pdf

相关推荐

最新更新

猜你喜欢