NLP Baseline 01 翻译
从头训练,不如fine-tune,如果你比Google & Mate 有钱当我没说
待完成
- Accelarator
- get_scheduler
- custom_wandb
Translation
这里我们使用zh-en的数据集和模型,进行翻译任务
这里需要注册一个wandb的账号,记得啊。
示例查看
1 | from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
这个挂个代理加速下
1 | tokenizer |
查看下数据, 可以看到返回的是字典形式,我们主要用到translation下的en、zh
1 | s1 = '天下第一美少女, 罢了' |
看下输出,还可以
注意,AutoModelForSeq2SeqLM不同于AutoModel的就是加入了
model.generate
这个特性。不然model(**inputs)是要你补充目标语言的。
If you are using a multilingual tokenizer such as mBART, mBART-50, or M2M100, you will need to set the language codes of your inputs and targets in the tokenizer by setting
tokenizer.src_lang
andtokenizer.tgt_lang
to the right values.
- 如果你使用多语言模型,你得指定你的源语言和目标语言的参数
Preprocessing
1 | split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20) |
HF的dataset可以直接调用
.train_test_split(train_size=0.9, seed=20)
- HF的dataset可以直接转DataFrame,这样你也可以直接配合Sklearn使用
给test重命名为validation
DataCollatorForSeq2Seq
1 | max_length = 128 |
We don’t pay attention to the attention mask of the targets, as the model won’t expect it. Instead, the labels corresponding to a padding token should be set to
-100
so they are ignored in the loss computation. This will be done by our data collator later on since we are applying dynamic padding, but if you use padding here, you should adapt the preprocessing function to set all labels that correspond to the padding token to-100
.这里我们不会加入padding,mask。之后我们的mask会设成-100 使其不会计算损失。这些都是下一步的操作
ps: 今天看到个bug,应该是没有更新到最新版版,具体来说就是tokenizer之后没有label,如果bug了,可以进行以下替换
1 | def preprocess_function(examples): |
1 | from transformers import DataCollatorForSeq2Seq |
- 可以看到padding的位置都变成-100了,pytorch中也有这个设定可见我之前讲Transformer的内容
This is all done by a
DataCollatorForSeq2Seq
. Like theDataCollatorWithPadding
, it takes thetokenizer
used to preprocess the inputs, but it also takes themodel
. This is because this data collator will also be responsible for preparing the decoder input IDs, which are shifted versions of the labels(移动版的标签) with a special token at the beginning. Since this shift is done slightly differently for different architectures, theDataCollatorForSeq2Seq
needs to know themodel
object搞得还挺复杂
Metrics
One weakness with BLEU is that it expects the text to already be tokenized, which makes it difficult to compare scores between models that use different tokenizers. So instead, the most commonly used metric for benchmarking translation models today is SacreBLEU, which addresses this weakness (and others) by standardizing the tokenization step
- 这里我们加入SacreBLEU作为评分标准
1 | !pip install sacrebleu |
示例1
1 | predictions = [ |
示例2
1 | predictions = ["This This This This"] |
This gets a BLEU score of 46.75, which is rather good — for reference, the original Transformer model in the “Attention Is All You Need” paper achieved a BLEU score of 41.8 on a similar translation task between English and French! (For more information about the individual metrics, like
counts
andbp
, see the SacreBLEU repository.)
- 已经比擎天柱还厉害了,另外的标准如下:
- score: The BLEU score.
- counts: List of counts of correct ngrams, 1 <= n <= max_ngram_order
- totals: List of counts of total ngrams, 1 <= n <= max_ngram_order
- precisions: List of precisions, 1 <= n <= max_ngram_order
- bp: The brevity penalty.
- sys_len: The cumulative system length.
- ref_len: The cumulative reference length.
Compute_metrics
1 | import numpy as np |
- 因为decode会自动处理pad_token所以使用
np.where
将-100都替换成pad_token- numpy.where(condition, x,y) ,x中条件不成立的都会被填充y
Training Loop
首先看下Seq2SeqTrainer
然后回到自定义的Loop
Seq2SeqTrainer
这里不是重点,但是有些细节可圈可点。
1 | from transformers import Seq2SeqTrainingArguments |
定义参数
- We don’t set any regular evaluation, as evaluation takes a while; we will just evaluate our model once before training and after.
- 由于我们自定义评价标准,这里我们在模型训练前测一次scores,训练完成后再测一次
- We set
fp16=True
, which speeds up training on modern GPUs. - We set
predict_with_generate=True
, as discussed above.- the decoder performs inference by predicting tokens one by one — something that’s implemented behind the scenes in 🤗 Transformers by the
generate()
method. TheSeq2SeqTrainer
will let us use that method for evaluation if we setpredict_with_generate=True
.
- the decoder performs inference by predicting tokens one by one — something that’s implemented behind the scenes in 🤗 Transformers by the
- We use
push_to_hub=True
to upload the model to the Hub at the end of each epoch
1 | from transformers import Seq2SeqTrainer |
That’s a nearly 14-point improvement, which is great.
Custom Training Loop
接下来就是重点了
1 | from torch.utils.data import DataLoader |
以上是常规定义
1 | from accelerate import Accelerator |
这两部分都需要注意
- Once we have all those objects, we can send them to the
accelerator.prepare()
method. Remember that if you want to train on TPUs in a Colab notebook, you will need to move all of this code into a training function, and that shouldn’t execute any cell that instantiates anAccelerator
- 不要在没有把所有部件转到TPU前使用
Accelerator
- 不要在没有把所有部件转到TPU前使用
- we can use its length to compute the number of training steps. Remember we should always do this after preparing the dataloader, as that method will change the length of the
DataLoader
1 | def postprocess(predictions, labels): |
正式训练之前,先定义一下后处理函数,输出我们预测的标签给Sacrebleu
1 | from tqdm.auto import tqdm |
最后通过pipeline测试下输出
1 | from transformers import pipeline |
总结
- 首先是pipeline
1 | !pip install sacrebleu |
1 | CONFIG = {"seed": 2021, |
1 | model_name = "Helsinki-NLP/opus-mt-zh-en" |