待完成

  • 增加decoder inference模块
  • 前言

架构分四个大块

  • encoder
    • encoder-layer
  • decoder
    • decoder-layer

细节三种mask

  • encoder-mask
  • decoder-mask
  • cross-mask

Embedding

句子表示为 [token1, token2, …tokens]

句子1 = [token_1, token_2, …token_x]

句子2 = [token_1, token_2, …token_y] x 不一定等于 y

token 构造

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import torch.nn.functional as F
import torch

src_vocab_size = 16
tgt_vocab_size = 16
batch_size = 4
max_len = 6

src_len = torch.randint(2,7, (batch_size,))
tgt_len = torch.randint(2,7, (batch_size,))
'''
结果如下
tensor([8, 6, 5, 9]) 此batch的第一个句子长8,第二个6
tensor([6, 5, 9, 5])
'''

# 接下来我们把不定长的句子padding
'''
randint 生成(i,)形状的数据
padding 0次或者 max_len - len(i) 的次数
unsqueeze 增加一个维度
'''

src_seq = [F.pad(torch.randint(1, src_vocab_size, (i,)), (0, max_len-i)).unsqueeze(0) for i in src_len]
tgt_seq = [F.pad(torch.randint(1, tgt_vocab_size, (i,)), (0, max_len-i)).unsqueeze(0) for i in tgt_len]

# 将整个batch的句子整合
src_seq = torch.cat(src_seq)
tgt_seq = torch.cat(tgt_seq)

'''
如下
tensor([[12, 15, 10, 5, 3, 14],
[ 5, 7, 9, 3, 12, 1],
[ 3, 1, 1, 9, 3, 4],
[ 9, 6, 0, 0, 0, 0]])

tensor([[11, 12, 11, 3, 5, 15],
[ 7, 9, 11, 0, 0, 0],
[12, 6, 13, 11, 0, 0],
[13, 3, 0, 0, 0, 0]])
'''

词向量空间 映射

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import torch.nn as nn
d_model = 8

src_embedding = nn.Embedding(src_vocab_size+1, d_model)
tgt_embedding = nn.Embedding(tgt_vocab_size+1, d_model)

src_embedding.weight # shape为(17, 8)
'''
第零行为pad的数据
Parameter containing:
tensor([[ 2.4606, 1.7139, -0.2859, -0.5058, 0.6229, -0.0470, 2.1517, 0.2996],
[ 0.0077, -0.4292, -0.2397, 1.2366, -0.3061, 0.9196, -1.4222, -1.6431],
[-0.6378, -0.7809, -0.4206, 0.5759, -1.4899, 1.2241, 0.9220, -0.6333],
[ 0.0303, -1.4113, 0.9164, -0.1200, 1.7224, -0.4996, -1.6708, -1.8563],
[ 0.0235, 0.0155, -0.1292, -0.9274, -1.1351, -0.9155, 0.4391, -0.0437],
[ 0.8498, 0.4709, -0.9168, -2.1307, 0.1840, 0.3554, -0.3986, 1.2806],
[ 0.7256, 1.2303, -0.8280, -0.2173, 0.8939, 2.4122, 0.4820, -1.9615],
[-0.8607, 2.4886, -0.8877, -0.8852, 0.3905, 0.9511, -0.3732, 0.4872],
[ 0.4882, -0.4518, -0.1945, 0.2857, -0.6832, -0.4870, -1.7165, -2.0987],
[-0.0512, 0.2692, -1.0003, 0.7896, 0.5004, 0.3594, -1.5923, -1.5618],
[ 0.4012, 0.1614, 1.8939, 0.3862, -0.6733, -1.2442, -0.6540, -1.6772],
[ 1.4784, 2.7430, 0.0159, 0.5944, -1.0025, 1.0843, 0.4580, -0.6515],
[ 0.3905, 0.6118, -0.1256, -0.6725, 1.2366, 0.8272, 0.0838, -1.5124],
[-0.1470, 0.2149, -1.4561, 1.8008, 0.7764, -0.8517, -0.3204, -0.2550],
[-1.1534, -0.6837, -1.7165, -1.7905, -1.5423, 1.8812, -0.1794, -0.2357],
[ 1.3046, 1.5021, 1.4846, 1.0622, 1.4066, 0.7299, 0.7929, -1.0107],
[-0.3920, 0.7482, 1.5976, 1.7429, -0.4683, 0.2286, 0.1320, -0.5826]],
requires_grad=True)
'''
scr_embedding(scr_seq[0]) # 将取出对应的[12, 15, 10, 5, 3, 14]行

Key_mask

由于在encoder_layer pad的位置经过softmax 也会分得或多或少注意力分数,这些pad不是我们希望模型从数据中学到的,所以这里我们引入Key_mask 帮助encoder更好的关注在需要关注的位置上。 也是特别重要的一个细节

  • 前置
1
2
3
4
5
6
7
8
9
10
11
12
13
max_len = 6
embed_dim = 8
vacab_max = 5
token_ids = [torch.randint(1,6, (len,)) for len in torch.randint(1,7, (max_len,))]
token_ids
'''
[tensor([2, 3, 5, 2, 4]),
tensor([3, 3, 4, 4, 5, 4]),
tensor([4, 5, 3]),
tensor([5, 1, 4]),
tensor([2, 1, 5, 3]),
tensor([1, 3, 3, 1])]
'''
  • pad
1
2
3
4
5
6
7
8
9
10
11
token_pad_ids = [F.pad(x, (0, max_len-x.shape[0])).unsqueeze(0) for x in token_ids]
token_pad_ids = torch.cat(token_pad_ids)
token_pad_ids
'''
tensor([[2, 3, 5, 2, 4, 0],
[3, 3, 4, 4, 5, 4],
[4, 5, 3, 0, 0, 0],
[5, 1, 4, 0, 0, 0],
[2, 1, 5, 3, 0, 0],
[1, 3, 3, 1, 0, 0]])
'''
  • 取得embedding
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
src_embedding = nn.Embedding(vacab_max+1, embed_dim)
tgt_embedding = nn.Embedding(vacab_max+1, embed_dim)
src_embedding.weight, tgt_embedding.weight
'''
(Parameter containing:
tensor([[-1.4019, -0.3245, 0.8569, -1.6555, 1.3478, 0.0979, -1.7458, 1.3138],
[-0.9099, -0.6957, 0.4430, 0.6305, 0.1099, 0.3213, 0.0841, 0.0786],
[-0.1215, -1.4141, 0.8802, -0.3444, 0.3444, -1.4063, -0.5057, 0.1506],
[ 0.9491, 1.7888, 0.3075, -0.6642, 0.3368, 0.3388, -1.2543, -0.8096],
[ 0.7723, -1.2258, -0.4963, 1.4007, -0.8048, -0.1338, 0.0199, 0.4295],
[ 1.3789, -0.9537, 0.3421, 0.0658, -0.7578, -0.7217, -1.3124, 1.6017]],
requires_grad=True),
Parameter containing:
tensor([[ 2.0609, 0.7302, 0.9811, 0.7390, 0.7475, 0.2903, 0.0735, 0.3407],
[ 1.5477, -0.5033, 1.3758, -1.5225, 0.8236, 0.6329, -0.2301, 1.2352],
[-0.2906, -1.8842, -0.9998, 1.6752, 0.7286, -0.4089, -0.0515, 0.5763],
[ 0.2128, 0.7354, -0.4248, 0.7142, 0.4635, 1.1675, 0.7193, 1.3474],
[ 0.3543, 1.2881, -0.8270, 0.6220, -1.6282, 0.1802, -0.9306, -0.2407],
[-1.3339, -0.4192, -0.0800, 0.1614, 0.7026, -0.6851, 0.2386, -0.4954]],
requires_grad=True))
'''
  • 查看True False ,这里我们去src_ids 的第四个,因为零比较多,(embedding的零行是pad的词向量)
1
2
3
4
5
6
7
8
9
10
11
pad = src_embedding.weight[0]
src_embedding(token_pad_ids[3]) == pad, token_pad_ids[3]
'''
(tensor([[False, False, False, False, False, False, False, False],
[False, False, False, False, False, False, False, False],
[False, False, False, False, False, False, False, False],
[ True, True, True, True, True, True, True, True],
[ True, True, True, True, True, True, True, True],
[ True, True, True, True, True, True, True, True]]),
tensor([5, 1, 4, 0, 0, 0]))
'''
  • 由于encoder的输入都是src 所以Q*K.T 的维度为(bs, src_len, src_len), mask就直接可以写了
1
2
3
4
5
6
7
8
9
10
11
12
13
a = token_pad_ids[3].unsqueeze(-1)
b = token_pad_ids[3].unsqueeze(0)
torch.matmul(a,b), a.shape, b.shape
'''
(tensor([[25, 5, 20, 0, 0, 0],
[ 5, 1, 4, 0, 0, 0],
[20, 4, 16, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0]]),
torch.Size([6, 1]),
torch.Size([1, 6]))
'''
  • 一批量查看mask
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
mask = torch.matmul(token_pad_ids.unsqueeze(-1),token_pad_ids.unsqueeze(1)) ==0
mask
'''
只看前三个
tensor([[[False, False, False, False, False, True],
[False, False, False, False, False, True],
[False, False, False, False, False, True],
[False, False, False, False, False, True],
[False, False, False, False, False, True],
[ True, True, True, True, True, True]],

[[False, False, False, False, False, False],
[False, False, False, False, False, False],
[False, False, False, False, False, False],
[False, False, False, False, False, False],
[False, False, False, False, False, False],
[False, False, False, False, False, False]],

[[False, False, False, True, True, True],
[False, False, False, True, True, True],
[False, False, False, True, True, True],
[ True, True, True, True, True, True],
[ True, True, True, True, True, True],
[ True, True, True, True, True, True]], 。。。
'''
  • 上刺刀
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
scores = torch.randn(6, 6, 6)
mask = torch.matmul(token_pad_ids.unsqueeze(-1),token_pad_ids.unsqueeze(1))
scores= scores.masked_fill(mask==0, -1e9)
scores.softmax(-1)
'''
依旧只看前三个
tensor([[[0.0356, 0.0985, 0.6987, 0.0902, 0.0770, 0.0000],
[0.4661, 0.0397, 0.3546, 0.0931, 0.0464, 0.0000],
[0.1917, 0.0149, 0.1564, 0.4113, 0.2259, 0.0000],
[0.4269, 0.0352, 0.1605, 0.1334, 0.2441, 0.0000],
[0.0515, 0.4421, 0.0705, 0.2934, 0.1426, 0.0000],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667]],

[[0.0803, 0.0330, 0.3310, 0.0243, 0.3612, 0.1701],
[0.2160, 0.1483, 0.0312, 0.1804, 0.3861, 0.0380],
[0.2151, 0.0807, 0.1072, 0.4335, 0.1200, 0.0435],
[0.0285, 0.2684, 0.1558, 0.2210, 0.1880, 0.1383],
[0.0889, 0.4485, 0.1067, 0.1028, 0.1901, 0.0630],
[0.2885, 0.1682, 0.0935, 0.0179, 0.0289, 0.4031]],

[[0.2862, 0.3934, 0.3204, 0.0000, 0.0000, 0.0000],
[0.2426, 0.2206, 0.5369, 0.0000, 0.0000, 0.0000],
[0.1487, 0.2483, 0.6030, 0.0000, 0.0000, 0.0000],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667]],。。。
'''

同时我们可以看见 全都是pad自身的注意力分数是平均的,也就是跟乱猜一样,没有意义。

Position Embedding

按公式写就行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
pe = torch.zeros(max_len, d_model)
pos = torch.arange(0, max_len).unsqueeze(1) # 形状为(max_len, 1)
idx = torch.pow(10000, torch.arange(0, 8, 2).unsqueeze(0)/ d_model ) # 形状为 (1, 4)
pe[:, 0::2] = torch.sin(pos / idx) # 触发广播 (max_len, 4)
pe[:, 1::2] = torch.sin(pos / idx)

'''
此处演示只加奇数列的效果
tensor([[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[ 0.8415, 0.0000, 0.0998, 0.0000, 0.0100, 0.0000, 0.0010, 0.0000],
[ 0.9093, 0.0000, 0.1987, 0.0000, 0.0200, 0.0000, 0.0020, 0.0000],
[ 0.1411, 0.0000, 0.2955, 0.0000, 0.0300, 0.0000, 0.0030, 0.0000],
[-0.7568, 0.0000, 0.3894, 0.0000, 0.0400, 0.0000, 0.0040, 0.0000],
[-0.9589, 0.0000, 0.4794, 0.0000, 0.0500, 0.0000, 0.0050, 0.0000]])
'''

emb+pe

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import torch.nn as nn
import torch
import math

class Embeddings(nn.Module):
def __init__(self, vocab, d_model ):
super(Embeddings, self).__init__()
self.emb = nn.Embedding(vocab, d_model)
self.d_model = d_model

def forward(self, x):
return self.emb(x) * math.sqrt(self.d_model) # 对embedding的值缩放 制position数值的影响


class PositionalEncoding(nn.Module):
def __init__(self, max_len, d_model , dropout=0.1):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)

pe = torch.zeros(max_len, d_model)

position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

pe[:, 0::2] = torch.sin(position * div_term) # 所有行,列起始位为1,步长为2
pe[:, 1::2] = torch.cos(position * div_term)

pe = pe.unsqueeze(0).transpose(0, 1) # 从(batch_size, seq_len) --> (1.)

self.register_buffer('pe', pe) # 设置缓冲区,表示参数不更新
def forward(self, x):
x = x + self.pe[:x.size(0), :]
return self.dropout(x)


if __name__ == '__main__':
x = torch.randint(1,40, (2, 28))
emb = Embeddings(40, 784)
pe = PositionalEncoding(40, 784)
print((pe(emb(x))).shape)

Multi-head Attention

Query_mask & Scaled_Attention

1
2
3
4
5
6
7
8
9
10
11
12
def ScaledAttention(query, key, value, d_k, mask=None, dropout=None):
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k) # (batch_size, channel, head, d_k)

if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
scores = scores.softmax(dim=-1)
if dropout is not None:
scores = dropout(scores)

value = torch.matmul(scores, value)

return value, scores
  • 第一次matmul: 维度变化为 Q@K.T —> (batch_size * n_head, tgt_len, src_len)
    • 中间的scores.masked_fill(mask, -1e9) 根据不同的mask做掩码填充
  • 第二次matmul: 维度变化为 scores@V —> (batch_size, tgt_len, embed_dim) 且有注意力分数分配权重
    • 这样就是decoder中的目标序列长度
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import torch
import torch.nn as nn
import math
from utils import clones

class MultiheadAttention(nn.Module):
def __init__(self, d_model, n_head, QKV_O_linear = 4, drop_rate=0.1):
super().__init__()
assert d_model % n_head == 0

self.d_model = d_model
self.n_head = n_head

self.dropout = nn.Dropout(p=drop_rate)
self.linear = clones(nn.Linear(d_model, d_model) ,QKV_O_linear)

def forward(self, Q, K, V, mask=None):
batch_size, _, emb_dim = Q.shape # (batch_size, seq_len, emb_dim)
d_k = emb_dim // self.n_head

Q_heads = self.linear[0](Q).view(batch_size, self.n_head, -1, d_k).transpose(0,1)
K_heads = self.linear[1](K).view(batch_size, self.n_head, -1, d_k).transpose(0,1)
V_heads = self.linear[2](V).view(batch_size, self.n_head, -1, d_k).transpose(0,1)

V_att, scores_att= ScaledAttention(Q_heads, K_heads, V_heads, d_k, mask, self.dropout)
V_att = V_att.permute(1, 2, 0, 3).contiguous().view(batch_size, -1, self.n_head * d_k)
V_att = self.linear[3](V_att)
K_lin = K_heads.permute(1, 2, 0, 3).contiguous().view(batch_size, -1, self.n_head * d_k)

return K_lin, V_att, # scores_att

关于contiguous

  • clones之后的linear列表有4个layer

    • zip函数 对不通长度的对象直接以最小长度进行截断,所以return那里可以用linear列表的最后一个返回输出

    • ```python
      a = list(range(6)) # 0 - 5 六个数
      b = list(‘asdfg’) # 5个字母
      list(zip(a,b))

      ‘’’
      [(0, ‘a’), (1, ‘s’), (2, ‘d’), (3, ‘f’), (4, ‘g’)]
      ‘’’

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27



      + contiguous()可以开辟新的内存存储数据,is_contiguous()可以判断数据的底层内存是否连续存取,这里配合view使用

      + ```python
      t = torch.arange(12).reshape(3,4)
      '''
      tensor([[ 0, 1, 2, 3],
      [ 4, 5, 6, 7],
      [ 8, 9, 10, 11]])'''

      t.stride()
      # (4, 1)
      t2 = t.transpose(0,1)
      '''
      tensor([[ 0, 4, 8],
      [ 1, 5, 9],
      [ 2, 6, 10],
      [ 3, 7, 11]])'''

      t2.stride()
      # (1, 4)
      t.data_ptr() == t2.data_ptr() # 底层数据是同一个一维数组
      # True
      t.is_contiguous(),t2.is_contiguous() # t连续,t2不连续
      # (True, False)

      stride(d1,d2)表示这个维度上的单位元素之间的内存距离。 如原本0-11是连续的,他们在【0】维上的距离是4。

  • view(b, -1, self.h * self.d_k) 要求其对象在内存上是连续的

    • 即上面的数组[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]如此在内存上排列
    • 即使view不报错,直接在t2上做view返回的也不会是[ 0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11]

ADD & Norm

  • 展示一下实现方式,具体只调用Pytorch的接口

layer_norm 做层级别的归一化,如果说Batch_norm是在 [B, C, H, W]的channel上得到RGB的均值方差,做归一化;

layer_norm就是在 [B, C, H, W]的batch上得到归一化,比如有3个批次,就分别得到【0】【1】【2】的均值方差。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class LayerNorm(nn.Module):
"""构造一个layernorm模块"""
def __init__(self, features, eps=1e-6):
super(LayerNorm, self).__init__()
self.a_2 = nn.Parameter(torch.ones(features))
self.b_2 = nn.Parameter(torch.zeros(features))
self.eps = eps

def forward(self, x):
"Norm"
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
return self.a_2 * (x - mean) / (std + self.eps) + self.b_2


class SublayerConnection(nn.Module):
"""Add+Norm"""
def __init__(self, size, dropout):
super(SublayerConnection, self).__init__()
self.norm = LayerNorm(size)
self.dropout = nn.Dropout(dropout)

def forward(self, x, sublayer):
"add norm"
return x + self.dropout(sublayer(self.norm(x)))

Encoder Layer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import torch
import torch.nn as nn
from multihead_attention import MultiheadAttention
from utils import clones

'''
encoder_layer的流程为qkv送入Multihead_Attention
将得出的kv进行add_norm
最终输送给decoder
'''
class EncoderLayer(nn.Module):
def __init__(self, d_model, n_head, drop_rate=0.1):
super().__init__()
self.mul = MultiheadAttention(d_model, n_head, drop_rate=drop_rate)
self.norm = nn.LayerNorm(d_model)
self.ff = nn.Sequential( nn.Linear(d_model, d_model*4),
nn.ReLU(),
nn.Linear(d_model*4, d_model))

def forward(self, Q, K, V, MASK):
K, V = self.mul(Q, K, V, MASK)
V_1 = self.norm(V + Q)
K_1 = self.norm(K + Q)

V_2= self.ff(V_1)
K_2= self.ff(K_1)
V_2 = self.norm(V_1 + V_2)
K_2 = self.norm(K_1 + K_2)

return K_2, V_2

'''
encoder主要是实现多个layer的堆叠传输
'''
class Encoder(nn.Module):
def __init__(self, model, N):
super().__init__()
self.layers = clones(model, N)

def forward(self, q,k,v, mask):
for layer in self.layers:
K, V = layer(q,k,v,mask)
return K, V


if __name__ == '__main__':
source = torch.randn(4, 2, 28)
ecd_layer = EncoderLayer(784, 7)
ecd = Encoder(ecd_layer, 2)
ecd(source)

Decoder

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import torch
import torch.nn as nn
from multihead_attention import MultiheadAttention, clones

class DecoderLayer(nn.Module):
def __init__(self, d_model, n_head, drop_rate=0.1):
super().__init__()
self.mmul = MultiheadAttention(d_model, n_head, drop_rate=drop_rate)
self.norm = nn.LayerNorm(d_model)
self.ff = nn.Sequential( nn.Linear(d_model, d_model*4),
nn.ReLU(),
nn.Linear(d_model*4, d_model))

def forward(self, q, k, v, dec_mask, cro_mask):

_, V = self.mmul(q, q, q, mask=dec_mask)
Q_1 = self.norm(V + q)

'''
这里写V_2_是因为还要传入下一层,
上面写Q是因为 decoder在上一步只提供Q与下一层的多头层计算
'''
_, V_2_ = self.mmul(Q_1, k, v, mask=cro_mask)
V_2 = self.norm(V_2_ + Q_1)

V_2= self.ff(V_2)
V= self.norm(V_2 + V_2_) # 至此decoder的一轮计算完成

return V

class Decoder(nn.Module):
def __init__(self, model, N):
super().__init__()
self.layers = clones(model, N)

def forward(self, x, k, v, dec_mask, cro_mask):
for layer in self.layers:
V = layer(x, k, v, dec_mask, cro_mask)
return V

if __name__ == '__main__':
x = torch.randn(4, 1, 28, 28)
k = torch.randn(4, 1, 28, 28)
v = torch.randn(4, 1, 28, 28)
dcd_layer = DecoderLayer(784, 7)
dcd = Decoder(dcd_layer, 2)
dcd(x, k, v)

mask

目的是想模仿真实场景中,每次翻译下个词只能在前面产生答案的基础上,所以不会一次产生所有loss

使用torch.functional.cross_entropyignore_index参数将想要mask的位置填为-100即可将loss mask掉

使用ignore_index参数 要配合 reduction = ‘none’ 参数

返回所有损失,而不是平均或者加和后的损失

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import torch
import copy
import torch.nn as nn

def clones(module, N):
return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

def attn_mask(src_ids=None, tgt_ids=None):
# encoder的掩码 正方形
if tgt_ids == None:
mask = torch.matmul(src_ids.unsqueeze(-1), src_ids.unsqueeze(1))
return mask

# decoder第一层的掩码,返回的是一个上三角为1的矩阵(对角线上的没有操作为1)\
# 因为统一是mask==0 进行填充所以这里反一下 返回的地方让 上三角=false,\
# 这样就等于0 将在self-attention填充,正方形
elif src_ids == None:
mask = torch.triu(torch.ones((tgt_ids.shape[-1], tgt_ids.shape[-1])), diagonal=1).type(torch.uint8)
return mask == 0

# decoder第二层cro_mask的掩码,kv来自encoder,q和kv的形状可能不同,长方形
else:
Q= tgt_ids
KV = src_ids
mask = torch.matmul(Q.unsqueeze(-1), KV.unsqueeze(1))
return mask

汽车人变形!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import torch
import torch.nn as nn
from utils import attn_mask
from embedding import Embeddings, PositionalEncoding
from encoder import Encoder, EncoderLayer
from decoder import Decoder, DecoderLayer

class Transformer(nn.Module):
def __init__(self,enc_vocab_size, dec_vocab_size, d_model, n_head, num_layer):
super(Transformer,self).__init__()

self.encoder_emb = Embeddings(enc_vocab_size, d_model)
self.decoder_emb = Embeddings(dec_vocab_size, d_model)
self.encoder_pos = PositionalEncoding(enc_vocab_size, d_model)
self.decoder_pos = PositionalEncoding(dec_vocab_size, d_model)

self.encoder = Encoder(EncoderLayer(d_model=d_model,n_head= n_head), N=num_layer)
self.decoder = Decoder(DecoderLayer(d_model=d_model,n_head= n_head), N=num_layer)
self.answer = nn.Linear(d_model, dec_vocab_size)

def forward(self, src_q, tgt_q):
# 生成掩码
encoder_mask = attn_mask(src_ids=src_q)
decoder_mask = attn_mask(tgt_ids=tgt_q)
cross_mask = attn_mask(src_q, tgt_q)

# encoder部分 一种mask
src_q = self.encoder_emb(src_q)
src_q = self.encoder_pos(src_q)
k, v = self.encoder(src_q, src_q, src_q, encoder_mask)

# decoder部分 要传入两种mask
tgt_q = self.decoder_emb(tgt_q)
tgt_q = self.decoder_pos(tgt_q)
output = self.decoder(tgt_q, k, v, decoder_mask, cross_mask)

# 转换到tgt_vocab 做个argmax进行输出
output = self.answer(output)

return output


if __name__ == '__main__':
x = torch.randint(1,40, (4, 23))
y = torch.randint(1,40, (4, 17))
trs = Transformer(40, 41, 784, 7, 2)
output = trs(x, y)
print(output.argmax(-1).shape)



输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
with torch.no_grad():
input_tensor = tensorFromSentence(input_lang, sentence)
input_length = input_tensor.size()[0]
encoder_hidden = encoder.initHidden()

encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

for ei in range(input_length):
encoder_output, encoder_hidden = encoder(input_tensor[ei],
encoder_hidden)
encoder_outputs[ei] += encoder_output[0, 0]

decoder_input = torch.tensor([[SOS_token]], device=device) # SOS

decoder_hidden = encoder_hidden

decoded_words = []
decoder_attentions = torch.zeros(max_length, max_length)

for di in range(max_length):
decoder_output, decoder_hidden, decoder_attention = decoder(
decoder_input, decoder_hidden, encoder_outputs)
decoder_attentions[di] = decoder_attention.data
topv, topi = decoder_output.data.topk(1)
if topi.item() == EOS_token:
decoded_words.append('<EOS>')
break
else:
decoded_words.append(output_lang.index2word[topi.item()])

decoder_input = topi.squeeze().detach()

return decoded_words, decoder_attentions[:di + 1]
  1. 将encoder解出来
  2. decoder传入 <BOS>
  3. while 判断是否输出<EOS>
  • encoder (64, 24,784) 准备好 KV
  • decoder (64, 1, 784)传入的是 SOS
    • Q@K.T 得到 (64,1,24)的scores
    • scores@V 得到 (64,1,784)
    • 送出去(64,1,56233)做softmax取得token1
      • 得到目前时刻的解码 [<BOS>、token1]送入下一时刻
    • 下一次就是(64,2,56233)
      • 得到目前时刻的解码 [<BOS>、token1、token2]送入下一时刻
  • 逐渐解码到 [<BOS>、token1、token2....<EOS>] 条件成立结束输出

返回的attention_scores可以做相关度矩阵分析

os.environ[“CUDA_VISIBLE_DEVICES”] = “1” 坑的一逼,指定你的gpu的代号,device = torch.device(“cuda”) if torch.cuda.is_available() else torch.device(“cpu”) 就只能用1号,就算你只有一块也是1号,要在device(“cuda:1”)指定坑的一逼

参考