从头训练,不如fine-tune,如果你比Google & Mate 有钱当我没说

待完成

  • Accelarator
  • get_scheduler
  • custom_wandb

Translation

这里我们使用zh-en的数据集和模型,进行翻译任务

这里需要注册一个wandb的账号,记得啊。

示例查看

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset

prx = {'https': 'http://127.0.0.1:7890'}
model_name = "Helsinki-NLP/opus-mt-zh-en"
save_path = r'D:\00mydataset\huggingface model'
data_path = r'D:\00mydataset\huggingface dataset'

tokenizer = AutoTokenizer.from_pretrained(model_name, proxies=prx, cache_dir=save_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, proxies=prx, cache_dir=save_path)

dataset = load_dataset('news_commentary','en-fr',cache_dir=data_path)
dataset
'''
DatasetDict({
train: Dataset({
features: ['id', 'translation'],
num_rows: 69206
})
})'''

这个挂个代理加速下

1
2
3
4
5
6
7
8
9
10
11
tokenizer
'''
PreTrainedTokenizer(name_or_path='Helsinki-NLP/opus-mt-zh-en', vocab_size=65001, model_max_len=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})
''''
dataset['train'][1]['translation']
'''
{'id': '1',
'translation': {'en': 'PARIS – As the economic crisis deepens and widens, the world has been searching for historical analogies to help us understand what has been happening. At the start of the crisis, many people likened it to 1982 or 1973, which was reassuring, because both dates refer to classical cyclical downturns.',
'zh': '巴黎-随着经济危机不断加深和蔓延,整个世界一直在寻找历史上的类似事件希望有助于我们了解目前正在发生的情况。一开始,很多人把这次危机比作1982年或1973年所发生的情况,这样得类比是令人宽心的,因为这两段时期意味着典型的周期性衰退。'}}
'''

查看下数据, 可以看到返回的是字典形式,我们主要用到translation下的en、zh

1
2
3
4
5
6
7
8
9
10
11
12
s1 = '天下第一美少女, 罢了'
inputs = tokenizer(s1, return_tensors='pt',)
inputs
'''
({'input_ids': tensor([[ 9705, 359, 3615, 2797, 14889, 2, 7, 40798, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])},)
'''
outputs = model.generate(**inputs)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
'''
["The most beautiful girl in the world, that's all."]
'''

看下输出,还可以

注意,AutoModelForSeq2SeqLM不同于AutoModel的就是加入了model.generate这个特性。

不然model(**inputs)是要你补充目标语言的。

If you are using a multilingual tokenizer such as mBART, mBART-50, or M2M100, you will need to set the language codes of your inputs and targets in the tokenizer by setting tokenizer.src_lang and tokenizer.tgt_lang to the right values.

  • ​ 如果你使用多语言模型,你得指定你的源语言和目标语言的参数

Preprocessing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)
split_datasets
'''
DatasetDict({
train: Dataset({
features: ['id', 'translation'],
num_rows: 189155
})
test: Dataset({
features: ['id', 'translation'],
num_rows: 21018
})
})
'''

split_datasets["validation"] = split_datasets.pop("test")

  • HF的dataset可以直接调用.train_test_split(train_size=0.9, seed=20)

    • HF的dataset可以直接转DataFrame,这样你也可以直接配合Sklearn使用
  • 给test重命名为validation

DataCollatorForSeq2Seq

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
max_length = 128


def preprocess_function(examples):
inputs = [ex["en"] for ex in examples["translation"]]
targets = [ex["fr"] for ex in examples["translation"]]
model_inputs = tokenizer(
inputs, text_target=targets, max_length=max_length, truncation=True
)
return model_inputs

tokenized_datasets = split_datasets.map(
preprocess_function,
batched=True,
remove_columns=split_datasets["train"].column_names,)
'''
这里本来还有['id', 'translation'],通过下面的设置就删除了。
remove_columns:
Remove a selection of columns while doing the mapping.
Columns will be removed before updating the examples with the output of `function`,i.e.
if `function` is adding columns with names in `remove_columns`, these columns will be kept.
'''

We don’t pay attention to the attention mask of the targets, as the model won’t expect it. Instead, the labels corresponding to a padding token should be set to -100 so they are ignored in the loss computation. This will be done by our data collator later on since we are applying dynamic padding, but if you use padding here, you should adapt the preprocessing function to set all labels that correspond to the padding token to -100.

这里我们不会加入padding,mask。之后我们的mask会设成-100 使其不会计算损失。这些都是下一步的操作

ps: 今天看到个bug,应该是没有更新到最新版版,具体来说就是tokenizer之后没有label,如果bug了,可以进行以下替换

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def preprocess_function(examples):
inputs = [ex["zh"] for ex in examples["translation"]]
targets = [ex["en"] for ex in examples["translation"]]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

# Set up the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(targets, max_length=max_target_length, truncation=True)

model_inputs["labels"] = labels["input_ids"]
return model_inputs

tokenized_datasets = split_datasets.map(
preprocess_function,
batched=True,
remove_columns=split_datasets["train"].column_names,)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()
# dict_keys(['attention_mask', 'input_ids', 'labels', 'decoder_input_ids'])

batch["labels"]
'''
tensor([[57483, 7, 3241, 403, 3, 289, 1817, 25787, 22, 6,
38697, 22, 2, 3, 426, 64, 72, 27734, 14, 9054,
56467, 6667, 8, 721, 512, 2498, 209, 64, 72, 11468,
5, 393, 3, 2597, 4, 3, 1817, 2, 469, 235,
238, 24898, 39, 8, 13579, 50, 17528, 2, 60, 42,
56548, 2, 695, 443, 10119, 5543, 8, 53617, 7, 38261,
40490, 22, 5, 0],
[ 24, 22026, 30, 2329, 10349, 22901, 20, 52813, 17, 50,
12, 29940, 4, 3, 2121, 20, 1843, 45, 67, 243,
1945, 30, 368, 36681, 10, 3, 1796, 4, 14961, 2203,
6, 28291, 3, 22986, 2, 11355, 3, 3368, 64, 8700,
18, 469, 38575, 10, 278, 54, 8, 4291, 57, 22301,
1718, 8, 959, 30229, 1294, 6855, 4298, 5, 0, -100,
-100, -100, -100, -100]])'''

# 看下原来的token
for i in range(1, 3):
print(tokenized_datasets["train"][i]["labels"])
'''
[57483, 7, 3241, 403, 3, 289, 1817, 25787, 22, 6, 38697, 22, 2, 3, 426, 64, 72, 27734, 14, 9054, 56467, 6667, 8, 721, 512, 2498, 209, 64, 72, 11468, 5, 393, 3, 2597, 4, 3, 1817, 2, 469, 235, 238, 24898, 39, 8, 13579, 50, 17528, 2, 60, 42, 56548, 2, 695, 443, 10119, 5543, 8, 53617, 7, 38261, 40490, 22, 5, 0]
[24, 22026, 30, 2329, 10349, 22901, 20, 52813, 17, 50, 12, 29940, 4, 3, 2121, 20, 1843, 45, 67, 243, 1945, 30, 368, 36681, 10, 3, 1796, 4, 14961, 2203, 6, 28291, 3, 22986, 2, 11355, 3, 3368, 64, 8700, 18, 469, 38575, 10, 278, 54, 8, 4291, 57, 22301, 1718, 8, 959, 30229, 1294, 6855, 4298, 5, 0]'''

This is all done by a DataCollatorForSeq2Seq. Like the DataCollatorWithPadding, it takes the tokenizer used to preprocess the inputs, but it also takes the model. This is because this data collator will also be responsible for preparing the decoder input IDs, which are shifted versions of the labels(移动版的标签) with a special token at the beginning. Since this shift is done slightly differently for different architectures, the DataCollatorForSeq2Seq needs to know the model object搞得还挺复杂


Metrics

One weakness with BLEU is that it expects the text to already be tokenized, which makes it difficult to compare scores between models that use different tokenizers. So instead, the most commonly used metric for benchmarking translation models today is SacreBLEU, which addresses this weakness (and others) by standardizing the tokenization step

  • 这里我们加入SacreBLEU作为评分标准
1
2
3
4
5
!pip install sacrebleu	

import evaluate
metric = evaluate.load("sacrebleu")

示例1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
predictions = [
"This plugin lets you translate web pages between several languages automatically."
]
references = [
[
"This plugin allows you to automatically translate web pages between several languages."
]
]
metric.compute(predictions=predictions, references=references)

'''
{'score': 46.750469682990165,
'counts': [11, 6, 4, 3],
'totals': [12, 11, 10, 9],
'precisions': [91.67, 54.54, 40.0, 33.33],
'bp': 0.9200444146293233,
'sys_len': 12,
'ref_len': 13}'''

示例2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
predictions = ["This This This This"]
references = [
[
"This plugin allows you to automatically translate web pages between several languages."
]
]
metric.compute(predictions=predictions, references=references)
'''
{'score': 1.683602693167689,
'counts': [1, 0, 0, 0],
'totals': [4, 3, 2, 1],
'precisions': [25.0, 16.67, 12.5, 12.5],
'bp': 0.10539922456186433,
'sys_len': 4,
'ref_len': 13}'''

This gets a BLEU score of 46.75, which is rather good — for reference, the original Transformer model in the “Attention Is All You Need” paper achieved a BLEU score of 41.8 on a similar translation task between English and French! (For more information about the individual metrics, like counts and bp, see the SacreBLEU repository.)

  • 已经比擎天柱还厉害了,另外的标准如下:
    • score: The BLEU score.
    • counts: List of counts of correct ngrams, 1 <= n <= max_ngram_order
    • totals: List of counts of total ngrams, 1 <= n <= max_ngram_order
    • precisions: List of precisions, 1 <= n <= max_ngram_order
    • bp: The brevity penalty.
    • sys_len: The cumulative system length.
    • ref_len: The cumulative reference length.

Compute_metrics

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np


def compute_metrics(eval_preds):
preds, labels = eval_preds
# In case the model returns more than the prediction logits
if isinstance(preds, tuple):
preds = preds[0]

decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

# Replace -100s in the labels as we can't decode them
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

# Some simple post-processing
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_labels = [[label.strip()] for label in decoded_labels]

result = metric.compute(predictions=decoded_preds, references=decoded_labels)
return {"bleu": result["score"]}

  • 因为decode会自动处理pad_token所以使用np.where将-100都替换成pad_token
    • numpy.where(condition, x,y) ,x中条件不成立的都会被填充y

Training Loop

首先看下Seq2SeqTrainer 然后回到自定义的Loop

Seq2SeqTrainer

这里不是重点,但是有些细节可圈可点。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
f"marian-finetuned-kde4-en-to-fr",
evaluation_strategy="no",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=3,
predict_with_generate=True,
fp16=True,
push_to_hub=True,
)

定义参数

  • We don’t set any regular evaluation, as evaluation takes a while; we will just evaluate our model once before training and after.
    • 由于我们自定义评价标准,这里我们在模型训练前测一次scores,训练完成后再测一次
  • We set fp16=True, which speeds up training on modern GPUs.
  • We set predict_with_generate=True, as discussed above.
    • the decoder performs inference by predicting tokens one by one — something that’s implemented behind the scenes in 🤗 Transformers by the generate() method. The Seq2SeqTrainer will let us use that method for evaluation if we set predict_with_generate=True.
  • We use push_to_hub=True to upload the model to the Hub at the end of each epoch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)

trainer.evaluate(max_length=max_length)
'''
{'eval_loss': 1.6964408159255981,
'eval_bleu': 39.26865061007616,
'eval_runtime': 965.8884,
'eval_samples_per_second': 21.76,
'eval_steps_per_second': 0.341}'''

trainer.train()
trainer.evaluate(max_length=max_length)
'''
{'eval_loss': 0.8558505773544312,
'eval_bleu': 52.94161337775576,
'eval_runtime': 714.2576,
'eval_samples_per_second': 29.426,
'eval_steps_per_second': 0.461,
'epoch': 3.0}'''

That’s a nearly 14-point improvement, which is great.

Custom Training Loop

接下来就是重点了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from torch.utils.data import DataLoader
from transformers import AdamW

tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(
tokenized_datasets["train"],
shuffle=True,
collate_fn=data_collator, # 就是上面的DataCollatorForSeq2Seq(tokenizer, model=model)
batch_size=8,
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
optimizer = AdamW(model.parameters(), lr=2e-5)

以上是常规定义

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from accelerate import Accelerator
from transformers import get_scheduler

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader)

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps,
)

这两部分都需要注意

  • Once we have all those objects, we can send them to the accelerator.prepare() method. Remember that if you want to train on TPUs in a Colab notebook, you will need to move all of this code into a training function, and that shouldn’t execute any cell that instantiates an Accelerator
    • 不要在没有把所有部件转到TPU前使用Accelerator
  • we can use its length to compute the number of training steps. Remember we should always do this after preparing the dataloader, as that method will change the length of the DataLoader
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def postprocess(predictions, labels):
predictions = predictions.cpu().numpy()
labels = labels.cpu().numpy()

decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

# Some simple post-processing
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_labels = [[label.strip()] for label in decoded_labels]
return decoded_preds, decoded_labels

正式训练之前,先定义一下后处理函数,输出我们预测的标签给Sacrebleu

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
# Training
model.train()
for batch in train_dataloader:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)

optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)

# Evaluation
model.eval()
for batch in tqdm(eval_dataloader):
with torch.no_grad():
generated_tokens = accelerator.unwrap_model(model).generate(
batch["input_ids"],
attention_mask=batch["attention_mask"],
max_length=128,
)
labels = batch["labels"]

# Necessary to pad predictions and labels for being gathered
generated_tokens = accelerator.pad_across_processes(
generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
)
labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

predictions_gathered = accelerator.gather(generated_tokens)
labels_gathered = accelerator.gather(labels)

decoded_preds, decoded_labels = postprocess(predictions_gathered, labels_gathered)
metric.add_batch(predictions=decoded_preds, references=decoded_labels)

results = metric.compute()
print(f"epoch {epoch}, BLEU score: {results['score']:.2f}")

# Save and upload
# accelerator.wait_for_everyone()
# unwrapped_model = accelerator.unwrap_model(model)
# unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
# if accelerator.is_main_process:
# tokenizer.save_pretrained(output_dir)
# repo.push_to_hub(
# commit_message=f"Training in progress epoch {epoch}", blocking=False
# )

最后通过pipeline测试下输出

1
2
3
4
5
6
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "huggingface-course/marian-finetuned-kde4-en-to-fr"
translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")

总结

  • 首先是pipeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
!pip install sacrebleu
!pip install evaluate
!pip install accelerate
!pip install --upgrade transformers


import evaluate
import numpy as np
import torch
from datasets import load_metric
from accelerate import Accelerator
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
from transformers import (AdamW, AutoModelForSeq2SeqLM, AutoTokenizer,
DataCollatorForSeq2Seq, Seq2SeqTrainer,
Seq2SeqTrainingArguments, get_scheduler,pipeline)

import warnings
import logging
warnings.filterwarnings('ignore')
logging.disable(logging.WARNING)
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
CONFIG = {"seed": 2021,
"epochs": 3,
"model_name": "roberta-base",
"train_batch_size": 32,
"valid_batch_size": 64,
"max_length": 128,
"learning_rate": 1e-4,
"scheduler": 'CosineAnnealingLR',
"min_lr": 1e-6,
"T_max": 500,
"weight_decay": 1e-6,
"n_fold": 5,
"n_accumulate": 1,
"num_classes": 1,
"margin": 0.5,
"device": torch.device("cuda:0" if torch.cuda.is_available() else "cpu"),
"hash_name": HASH_NAME
}

def set_seed(seed=42):
'''Sets the seed of the entire notebook so results are the same every time we run.
This is for REPRODUCIBILITY.'''
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
# When running on the CuDNN backend, two further options must be set
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# Set a fixed value for the hash seed
os.environ['PYTHONHASHSEED'] = str(seed)

set_seed(CONFIG['seed'])
1
2
3
4
5
6
7
model_name = "Helsinki-NLP/opus-mt-zh-en"

tokenizer = AutoTokenizer.from_pretrained(model_name, return_tensors="pt")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

dataset_raw = load_dataset('news_commentary','en-zh')
dataset_raw, dataset_raw['train']['translation'][1]