记得排版 分割线

待完成

  • QA部分

本章我们需要对做特殊的tokenizer以适应NER和QA任务数据的特殊性

Fast Tokenizer

1
2
3
4
5
6
7
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
encoding = tokenizer(example)
print(type(encoding))
# <class 'transformers.tokenization_utils_base.BatchEncoding'>

分词后返回的结果类型不简单是字典的映射

还包含很多方法

1
2
3
4
5
6
7
8
9
10
11
12
tokenizer.is_fast, encoding.is_fast
(True,True)

encoding.tokens()
'''
['[CLS]', 'My', 'name', 'is', 'S', '##yl', '##va', '##in', 'and', 'I', 'work', 'at', 'Hu', '##gging', 'Face', 'in',
'Brooklyn', '.', '[SEP]']'''

encoding.word_ids()
'''
[None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]'''

word_ids()方法可看到分词的结果来自哪个单词

最后我们可以使用word_to_chars() or token_to_chars() and char_to_word() or char_to_token() 查看单词

1
2
3
start, end = encoding.word_to_chars(3)
example[start:end]
# Sylvain

NER

在NER中我们以偏移量的标记来锁定原文的字符

pipeline方法

首先查看pipeline方法的ner流程

1
2
3
4
5
6
7
8
9
10
11
12
13
from transformers import pipeline

token_classifier = pipeline("token-classification")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")
'''
[{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S', 'start': 11, 'end': 12},
{'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl', 'start': 12, 'end': 14},
{'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va', 'start': 14, 'end': 16},
{'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in', 'start': 16, 'end': 18},
{'entity': 'I-ORG', 'score': 0.97389334, 'index': 12, 'word': 'Hu', 'start': 33, 'end': 35},
{'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging', 'start': 35, 'end': 40},
{'entity': 'I-ORG', 'score': 0.98879766, 'index': 14, 'word': 'Face', 'start': 41, 'end': 45},
{'entity': 'I-LOC', 'score': 0.99321055, 'index': 16, 'word': 'Brooklyn', 'start': 49, 'end': 57}]'''

简洁版

1
2
3
4
5
6
7
8
from transformers import pipeline

token_classifier = pipeline("token-classification", aggregation_strategy="simple")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")
'''
[{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18},
{'entity_group': 'ORG', 'score': 0.97960204, 'word': 'Hugging Face', 'start': 33, 'end': 45},
{'entity_group': 'LOC', 'score': 0.99321055, 'word': 'Brooklyn', 'start': 49, 'end': 57}]'''

aggregation_strategy有不同的参数,simple是分词后的平均分数

如上面的sylvain分数来自 正常版的四项平均'S', '##yl', '##va', '##in'

  • "first", where the score of each entity is the score of the first token of that entity (so for “Sylvain” it would be 0.993828, the score of the token S)
  • "max", where the score of each entity is the maximum score of the tokens in that entity (so for “Hugging Face” it would be 0.98879766, the score of “Face”)
  • "average", where the score of each entity is the average of the scores of the words composing that entity (so for “Sylvain” there would be no difference from the "simple" strategy, but “Hugging Face” would have a score of 0.9819, the average of the scores for “Hugging”, 0.975, and “Face”, 0.98879)

logits

这里通过返回的结果使用argmax(-1)得到映射的分类

1
2
3
4
5
6
7
8
9
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)
1
2
3
4
5
print(inputs["input_ids"].shape)
print(outputs.logits.shape)
'''
torch.Size([1, 19])
torch.Size([1, 19, 9])'''
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import torch

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
predictions = outputs.logits.argmax(dim=-1)[0].tolist()
print(predictions)
# [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]

model.config.id2label
'''
{0: 'O',
1: 'B-MISC',
2: 'I-MISC',
3: 'B-PER',
4: 'I-PER',
5: 'B-ORG',
6: 'I-ORG',
7: 'B-LOC',
8: 'I-LOC'}'''

偏移量postprocessing

组织一下格式,复现上面的内容

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
results = []
tokens = inputs.tokens()

for idx, pred in enumerate(predictions):
label = model.config.id2label[pred]
if label != "O":
results.append(
{"entity": label, "score": probabilities[idx][pred], "word": tokens[idx]}
)

print(results)

'''
[{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S'},
{'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl'},
{'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va'},
{'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in'},
{'entity': 'I-ORG', 'score': 0.97389334, 'index': 12, 'word': 'Hu'},
{'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging'},
{'entity': 'I-ORG', 'score': 0.98879766, 'index': 14, 'word': 'Face'},
{'entity': 'I-LOC', 'score': 0.99321055, 'index': 16, 'word': 'Brooklyn'}]'''

偏移量 offsets_mapping

1
2
3
4
5
6
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
inputs_with_offsets["offset_mapping"]

'''
[(0, 0), (0, 2), (3, 7), (8, 10), (11, 12), (12, 14), (14, 16), (16, 18), (19, 22), (23, 24), (25, 29), (30, 32),
(33, 35), (35, 40), (41, 45), (46, 48), (49, 57), (57, 58), (0, 0)]'''

这里的19对元组就是对应19个分词后token的下标

1
['[CLS]', 'My', 'name', 'is', 'S', '##yl', '##va', '##in', 'and', 'I', 'work', 'at', 'Hu', '##gging', 'Face', 'in','Brooklyn', '.', '[SEP]']

比如(0,0)是留给[CLS]的; 比如第六个token对应的是 ##ly 那么他在原文中的标注就是(12,14),如下

1
2
example[12:14]
# yl

继续我们的复现pipeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

for idx, pred in enumerate(predictions):
label = model.config.id2label[pred]
if label != "O":
start, end = offsets[idx]
results.append(
{
"entity": label,
"score": probabilities[idx][pred],
"word": tokens[idx],
"start": start,
"end": end,
}
)

print(results)

'''
[{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S', 'start': 11, 'end': 12},
{'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl', 'start': 12, 'end': 14},
{'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va', 'start': 14, 'end': 16},
{'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in', 'start': 16, 'end': 18},
{'entity': 'I-ORG', 'score': 0.97389334, 'index': 12, 'word': 'Hu', 'start': 33, 'end': 35},
{'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging', 'start': 35, 'end': 40},
{'entity': 'I-ORG', 'score': 0.98879766, 'index': 14, 'word': 'Face', 'start': 41, 'end': 45},
{'entity': 'I-LOC', 'score': 0.99321055, 'index': 16, 'word': 'Brooklyn', 'start': 49, 'end': 57}]'''

最后对组织名处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

idx = 0
while idx < len(predictions):
pred = predictions[idx]
label = model.config.id2label[pred]
if label != "O":
# Remove the B- or I-
label = label[2:]
start, _ = offsets[idx]

# Grab all the tokens labeled with I-label
all_scores = []
while (
idx < len(predictions)
and model.config.id2label[predictions[idx]] == f"I-{label}"
):
all_scores.append(probabilities[idx][pred])
_, end = offsets[idx]
idx += 1

# The score is the mean of all the scores of the tokens in that grouped entity
score = np.mean(all_scores).item()
word = example[start:end]
results.append(
{
"entity_group": label,
"score": score,
"word": word,
"start": start,
"end": end,
}
)
idx += 1

print(results)
'''
[{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18},
{'entity_group': 'ORG', 'score': 0.97960204, 'word': 'Hugging Face', 'start': 33, 'end': 45},
{'entity_group': 'LOC', 'score': 0.99321055, 'word': 'Brooklyn', 'start': 49, 'end': 57}]'''

QA

QA中我们主要解决文本答案跨段落的问题

pipeline方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from transformers import pipeline

question_answerer = pipeline("question-answering")
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)
'''
{'score': 0.97773,
'start': 78,
'end': 105,
'answer': 'Jax, PyTorch and TensorFlow'}'''

这个pipeline可以处理很长的序列,基本不会截断你的文本,但是过长如果你的答案在截掉的部分,那也没辙了。

logits

1
2
3
4
5
6
7
8
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

QA任务中,我们会获得,两个logits,分别是开始的坐标和结束的坐标

1
2
3
4
start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)
# torch.Size([1, 66]) torch.Size([1, 66])

由于我们的序列包含[CLS]、[SEP]*n,的特殊token所以需要屏蔽这些影响。根据softmax的特性选择填充一个很大的负数。

1
2
3
4
5
6
7
8
9
10
11
import torch

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000

sequence_ids会将每个token映射到原本的句子,而特殊token则会标记为None

如[None, 0,0,0, None, 1,1,1, None] 这里标识的就是 [CLS]…[SEP]….[SEP]

这里的mask就是跟输出一样的形状他的值是T/F, start_logits[mask]就会选择所有True的地方填充-10000

1
2
start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

上面我们得到了两组向量,形状是 (batch_size, seq_len) seq_len列表就表示每个坐标的概率

接下来就像做attention分数矩阵一样,将开始_结束位置的概率匹配上,mask start>end的组合,并取得概率最大的组合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
start = torch.randn(2,4).softmax(dim=-1)
end = torch.randn(2,4).softmax(dim=-1)
start,end
'''
(tensor([[0.1310, 0.3604, 0.1726, 0.3360],
[0.1955, 0.3245, 0.0700, 0.4101]]),
tensor([[0.2199, 0.1130, 0.3976, 0.2695],
[0.0455, 0.0804, 0.0514, 0.8227]]))'''

start.unsqueeze(1).transpose(-1,-2) @ end.unsqueeze(1)
'''
tensor([[[0.0288, 0.0148, 0.0521, 0.0353],
[0.0792, 0.0407, 0.1433, 0.0971],
[0.0380, 0.0195, 0.0686, 0.0465],
[0.0739, 0.0380, 0.1336, 0.0905]],

[[0.0089, 0.0157, 0.0100, 0.1608],
[0.0148, 0.0261, 0.0167, 0.2670],
[0.0032, 0.0056, 0.0036, 0.0576],
[0.0187, 0.0330, 0.0211, 0.3374]]])'''

a[0, 0]* b[0, 1]
# tensor(0.0148)

如上 start(2, 4, 1) @ end(2, 1, 4) = (2, 4, 4) 单看(4, 4)的矩阵就是 横为end的坐标,纵为start的坐标 中间就是他们的概率积

接下里mask掉 不合理的位置

1
2
3
4
5
6
7
8
9
10
11
fin = torch.triu(scores)
'''
tensor([[[0.0288, 0.0148, 0.0521, 0.0353],
[0.0000, 0.0407, 0.1433, 0.0971],
[0.0000, 0.0000, 0.0686, 0.0465],
[0.0000, 0.0000, 0.0000, 0.0905]],

[[0.0089, 0.0157, 0.0100, 0.1608],
[0.0000, 0.0261, 0.0167, 0.2670],
[0.0000, 0.0000, 0.0036, 0.0576],
[0.0000, 0.0000, 0.0000, 0.3374]]])'''

这里官网刷了个把戏,代码只是对一个句子的坐标处理

1
2
3
4
5
6
7
8
9
10
11
12
fin[0]
'''
tensor([[[0.0288, 0.0148, 0.0521, 0.0353],
[0.0000, 0.0407, 0.1433, 0.0971],
[0.0000, 0.0000, 0.0686, 0.0465],
[0.0000, 0.0000, 0.0000, 0.0905]],'''


start_index = fin[0].argmax() // 4
end_index = fin[0].argmax() % 4
start_index, end_index, fin[0].argmax()
# 1, 2, 6

(1,2)就是0.1433 也就是开始坐标是1,结束是2,argmax返回的是绝对坐标

相对应的,做batch处理只要使用循环语句即可

offset-map

1
2
3
4
5
6
inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]

这里offsets是元组形式表示一个token结束和开始 如(3, 7),表示这个token在原文的开始为3结束为7

取得start_token的最小值,再取得end_token的最大值,中间的区间就是,原文字符串。

另外offset是除去##等prefix的值,所以不会有影响。

超长文处理

对于过长的文章,我们使用trucation截断,并使用overlap进行覆盖,防止答案截成两段没法做答

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(
sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

for ids in inputs["input_ids"]:
print(tokenizer.decode(ids))

'''
'[CLS] This sentence is not [SEP]'
'[CLS] is not too long [SEP]'
'[CLS] too long but we [SEP]'
'[CLS] but we are going [SEP]'
'[CLS] are going to split [SEP]'
'[CLS] to split it anyway [SEP]'
'[CLS] it anyway. [SEP]''''

如上,我们设定overla的窗口为2 这使得每次截断都有两个词是重叠的,可以看到特殊token并不计算在窗口内

1
2
3
4
print(inputs.keys())
# dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])
print(inputs["overflow_to_sample_mapping"])
#[0, 0, 0, 0, 0, 0, 0]

overflow_to_sample_mapping标记我们截断后的句子来自原文中的哪个句子,具体如下

1
2
3
4
5
6
7
8
9
10
sentences = [
"This sentence is not too long but we are going to split it anyway.",
"This sentence is shorter but will still get split.",
]
inputs = tokenizer(
sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

print(inputs["overflow_to_sample_mapping"])
# [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

很明显了。

接下来回到我们长文处理

1
2
3
4
5
6
7
8
9
10
inputs = tokenizer(
question,
long_context,
stride=128,
max_length=384,
padding="longest",
truncation="only_second",
return_overflowing_tokens=True,
return_offsets_mapping=True,
)

QA tokenizer一般第一个位置为question,设定truncation="only_second"表示只对long_context进行截断

下面我们处理tokenizer的输出,将不必要的部分pop出去

1
2
3
4
5
6
_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs = inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape)
# torch.Size([2, 384])

这里测试下输出

1
2
3
4
5
6
outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)
# torch.Size([2, 384]) torch.Size([2, 384])

处理坐标

1
2
3
4
5
6
7
8
9
10
sequence_ids = inputs.sequence_ids()
mask = [i != 1 for i in sequence_ids]
mask[0] = False
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))

start_logits[mask] = -10000
end_logits[mask] = -10000

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)

torch.logical_or函数做的是或门的处理,只要有非零项就为True,输入的两个矩阵对应位置都为0才输出False

下面循环处理取得下标

1
2
3
4
5
6
7
8
9
10
11
12
candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
scores = start_probs[:, None] * end_probs[None, :]
idx = torch.triu(scores).argmax().item()

start_idx = idx // scores.shape[1]
end_idx = idx % scores.shape[1]
score = scores[start_idx, end_idx].item()
candidates.append((start_idx, end_idx, score))

print(candidates)
# [(0, 18, 0.33867), (173, 184, 0.97149)]

通过之前保存的offset映射到原文

1
2
3
4
5
6
7
8
9
10
11
for candidate, offset in zip(candidates, offsets):
start_token, end_token, score = candidate
start_char, _ = offset[start_token]
_, end_char = offset[end_token]
answer = long_context[start_char:end_char]
result = {"answer": answer, "start": start_char, "end": end_char, "score": score}
print(result)

'''
{'answer': '\n🤗 Transformers: State of the Art NLP', 'start': 0, 'end': 37, 'score': 0.33867}
{'answer': 'Jax, PyTorch and TensorFlow', 'start': 1892, 'end': 1919, 'score': 0.97149}'''

以上就是QA的一般处理,但是对一般的数据集来说,他的答案下标跟你truncation之后的是有偏差的,所以对原文答案的map要另做处理。