HF Course 02 API概要
outputs
Note that the outputs of 🤗 Transformers models behave like namedtuple
s or dictionaries. You can access the elements by attributes (like we did) or by key (outputs["last_hidden_state"]
), or even by index if you know exactly where the thing you are looking for is (outputs[0]
).
- HF的输入返回 大多以是元组或字典形式出,处理的时候要注意。
从二分类到多分类,多分类中每个类别分别作二分类,是否属于这个类别进行输出
[[0.2,0.8]、[0.4,0.6]、[0.7,0.3]]
= [[1]、[1]、[0]]
Config
1 | from transformers import BertConfig, BertModel |
环境变量
The weights have been downloaded and cached (so future calls to the from_pretrained()
method won’t re-download them) in the cache folder, which defaults to ~/.cache/huggingface/transformers. You can customize your cache folder by setting the HF_HOME
environment variable.
- 配置你当前的环境变量
os.environ['HF_HOME']= '~/.cache/huggingface/transformers'
Saving
1 | model.save_pretrained("directory_on_my_computer") |
If you take a look at the config.json file, you’ll recognize the attributes necessary to build the model architecture. This file also contains some metadata, such as where the checkpoint originated and what 🤗 Transformers version you were using when you last saved the checkpoint
The pytorch_model.bin file is known as the state dictionary; it contains all your model’s weights. The two files go hand in hand; the configuration is necessary to know your model’s architecture, while the model weights are your model’s parameters.
Tokenizer
encode
通过tokenizer.tokenize(sequence)
查看分词后的结果
1 | from transformers import AutoTokenizer |
ids = tokenizer.convert_tokens_to_ids(tokens)
还可以反过来得到token,也就是跟上面的 tokenize(seq) 一样的效果
1 | ids = tokenizer.convert_tokens_to_ids(tokens) |
decode
1 | decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014]) |
padding
查看tokenizer.pad_token_id
pad的id
1 | # Will pad the sequences up to the maximum sequence length |
- 直接max_length是到模型的最大长度,longest是到批次里句子的最大长度
1 | sequence1_ids = [[200, 200, 200]] |
This is because the key feature of Transformer models is attention layers that contextualize each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask.
- 这里两条单独的数据的结果跟组合起来的是不同的,是因为pad的位置也分散了模型的注意力,这不是我们希望模型学习的地方
ATTENTION MASK
1 | batched_ids = [ |
长句子处理
Models have different supported sequence lengths, and some specialize in handling very long sequences. Longformer is one example, and another is LED. If you’re working on a task that requires very long sequences, we recommend you take a look at those models.
一般是在tokenizer里设置truncation max_len,这里没讲,可以去看看模型。