HF Course 03 微调范式
data
In this section we will use as an example the MRPC (Microsoft Research Paraphrase Corpus) dataset, introduced in a paper by William B. Dolan and Chris Brockett. The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing). We’ve selected it for this chapter because it’s a small dataset, so it’s easy to experiment with training on it.
let’s focus on the MRPC dataset! This is one of the 10 datasets composing the GLUE benchmark
使用的是MRPC,很小很好实验
它属于 GLUE
接下来查看下数据
1 | from datasets import load_dataset |
1 | raw_train_dataset = raw_datasets["train"] |
1 | raw_train_dataset.features |
预处理
tokenizer方法
1 | tokenized_dataset = tokenizer( |
This works well, but it has the disadvantage of returning a dictionary (with our keys, input_ids
, attention_mask
, and token_type_ids
, and values that are lists of lists). It will also only work if you have enough RAM to store your whole dataset during the tokenization (whereas the datasets from the 🤗 Datasets library are Apache Arrow files stored on the disk, so you only keep the samples you ask for loaded in memory).
- 占内存
dataset.map 方法
1 | def tokenize_function(example): |
since the tokenizer
works on lists of pairs of sentences, as seen before. This will allow us to use the option batched=True
in our call to map()
, which will greatly speed up the tokenization. The tokenizer
is backed by a tokenizer written in Rust from the 🤗 Tokenizers library. This tokenizer can be very fast, but only if we give it lots of inputs at once.
- batch
This is because padding all the samples to the maximum length is not efficient: it’s better to pad the samples when we’re building a batch, as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset. This can save a lot of time and processing power when the inputs have very variable lengths!
- batch内最长padding,将在下一小节介绍
DataCollatorWithPadding
You can even use multiprocessing when applying your preprocessing function with map()
by passing along a num_proc
argument. We didn’t do this here because the 🤗 Tokenizers library already uses multiple threads to tokenize our samples faster, but if you are not using a fast tokenizer backed by this library, this could speed up your preprocessing.
- 多线程
but note that if you’re training on a TPU it can cause problems — TPUs prefer fixed shapes, even when that requires extra padding.
- tpu更喜欢恒定形状,所以你很少数据with large pad 也没事
Dynamic padding
The function that is responsible for putting together samples inside a batch is called a collate function. It’s an argument you can pass when you build a DataLoader
, the default being a function that will just convert your samples to PyTorch tensors and concatenate them (recursively if your elements are lists, tuples, or dictionaries). This won’t be possible in our case since the inputs we have won’t all be of the same size.
- DataCollator可以看成是一个函数,可以传入pytorch的dataloader的
collate_fn
参数
1 | from transformers import DataCollatorWithPadding |
军火展示
1 | samples = tokenized_datasets["train"][:8] |
Fine-tune
Trainer
首先是TrainingArguments的定义
The first step before we can define our Trainer
is to define a TrainingArguments
class that will contain all the hyperparameters the Trainer
will use for training and evaluation. The only argument you have to provide is a directory where the trained model will be saved, as well as the checkpoints along the way. For all the rest, you can leave the defaults, which should work pretty well for a basic fine-tuning.
1 | from transformers import TrainingArguments |
You will notice that unlike in Chapter 2, you get a warning after instantiating this pretrained model. This is because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead. The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now.
- 使用细分模型 (automode后带任务名称的)不会得到警告,是因为他会加载 最后面那个多分类的权重给你,这样预训练就又快了些,上面写的head 应该是指classifier的那几层吧。
接下来可以train了
1 | from transformers import Trainer |
Note that when you pass the tokenizer
as we did here, the default data_collator
used by the Trainer
will be a DataCollatorWithPadding
as defined previously, so you can skip the line data_collator=data_collator
in this call.
This will start the fine-tuning (which should take a couple of minutes on a GPU) and report the training loss every 500 steps.
- 不写collate的话默认就是DataCollatorWithPadding,不过声明一下,比较好,为了可读性
- 每500步给你一个loss返回,
看看need loss有多大
Evaluation
1 | predictions = trainer.predict(tokenized_datasets["validation"]) |
- predict的结果就是二分类的logits 接下来做个argmax 取位置信息即可
1 | import numpy as np |
The table in the BERT paper reported an F1 score of 88.9 for the base model. That was the uncased
model while we are currently using the cased
model, which explains the better result.
- 上面说 微调的power!
1 | def compute_metrics(eval_preds): |
封装代码
1 | training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch") |
- 这里我们设置每个epoch进行一次evaluation,评测方法用上面封装的函数
The Trainer
will work out of the box on multiple GPUs or TPUs and provides lots of options, like mixed-precision training (use fp16 = True
in your training arguments)
- 如果你的机器支持float 16训练速度会更快
小结
以上是通过trainer定义的训练,可以作为我们正式工作前的训练测试,接下来我们使用pytorch来正式工作。
full training
preprocessing
we need to apply a bit of postprocessing to our tokenized_datasets
, to take care of some things that the Trainer
did for us automatically.
Remove the columns corresponding to values the model does not expect (like the
sentence1
andsentence2
columns).Rename the column
label
tolabels
(because the model expects the argument to be namedlabels
).Set the format of the datasets so they return PyTorch tensors instead of lists.
做点后处理,删除、改名、return PyTorch tensors
1 | tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"]) |
remove的操作可以在dataset.map
里面设置remove_columns=split_datasets["train"].column_names
, 记得查看一下name别删除了不该删除的东西
1 | from torch.utils.data import DataLoader |
加载数据并check一下形状
1 | from transformers import AutoModelForSequenceClassification |
optimizer
1 | from transformers import AdamW |
Finally, the learning rate scheduler used by default is just a linear decay from the maximum value (5e-5) to 0. To properly define it, we need to know the number of training steps we will take, which is the number of epochs we want to run multiplied by the number of training batches (which is the length of our training dataloader).
- 我推荐 3e-4 cosine的组合
1 | from transformers import get_scheduler |
loop & evaluation
train 阶段就是pytorch那几样,eval换成了HF的api
1 | import torch |
Accelerate
1 | + from accelerate import Accelerator |
进行以上代码的修改,减号就是在原先的代码上删减的,加号反之
The first line to add is the import line. The second line instantiates an Accelerator
object that will look at the environment and initialize the proper distributed setup. 🤗 Accelerate handles the device placement for you, so you can remove the lines that put the model on the device (or, if you prefer, change them to use accelerator.device
instead of device
).
Then the main bulk of the work is done in the line that sends the dataloaders, the model, and the optimizer to accelerator.prepare()
. This will wrap those objects in the proper container to make sure your distributed training works as intended. The remaining changes to make are removing the line that puts the batch on the device
(again, if you want to keep this you can just change it to use accelerator.device
) and replacing loss.backward()
with accelerator.backward(loss)
.
- 引入并实例化
- 将 the dataloaders, the model, and the optimizer送入accelerator
- 使用accelerator.backward(loss)
In order to benefit from the speed-up offered by Cloud TPUs, we recommend padding your samples to a fixed length with the
padding="max_length"
andmax_length
arguments of the tokenizer.
- 如果你使用tpu记得直接放最大长度就可以了