待完成

传统Seq2Seq

左侧为 input 将句子一个一个投入到 encoder 中，

encoder整个处理其相关性得到 context，吐给 decoder，

decoder 进行一个一个解码输出，得到整个翻译后的句子。

An attention model differs from a classic sequence-to-sequence model in two main ways:

First, the encoder passes a lot more data to the decoder. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder:

注意力机制将产生的隐藏层信息(时间步骤信息)，全部保留，一次性传给 Decoder。

Second, an attention decoder does an extra step before producing its output. In order to focus on the parts of the input that are relevant to this decoding time step, the decoder does the following:
1. Look at the set of encoder hidden states it received – each encoder hidden state is most associated with a certain word in the input sentence
2. Give each hidden state a score (let’s ignore how the scoring is done for now)
3. Multiply each hidden state by its softmaxed score, thus amplifying hidden states with high scores, and drowning out hidden states with low scores

decoder 将 encoder 输入的隐藏层的 vector 进行打分得到一个分数vector，

将分数 vector 做 softmax，得到一个权重 vector，

将权重 vector 与隐藏层 vector 相乘得到注意力 vector，

最后把注意力 vector 进行相加就完成了。