outputs

Note that the outputs of 🤗 Transformers models behave like namedtuples or dictionaries. You can access the elements by attributes (like we did) or by key (outputs["last_hidden_state"]), or even by index if you know exactly where the thing you are looking for is (outputs[0]).

HF的输入返回大多以是元组或字典形式出，处理的时候要注意。

从二分类到多分类，多分类中每个类别分别作二分类，是否属于这个类别进行输出

[[0.2,0.8]、[0.4,0.6]、[0.7,0.3]] = [[1]、[1]、[0]]

Config

from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)
config
'''
BertConfig {
  [...]
  "hidden_size": 768,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  [...]
}'''

环境变量

The weights have been downloaded and cached (so future calls to the from_pretrained() method won’t re-download them) in the cache folder, which defaults to ~/.cache/huggingface/transformers. You can customize your cache folder by setting the HF_HOME environment variable.

配置你当前的环境变量 os.environ['HF_HOME']= '~/.cache/huggingface/transformers'

Saving

model.save_pretrained("directory_on_my_computer")


'''
This saves two files to your disk:

ls directory_on_my_computer
config.json pytorch_model.bin'''

If you take a look at the config.json file, you’ll recognize the attributes necessary to build the model architecture. This file also contains some metadata, such as where the checkpoint originated and what 🤗 Transformers version you were using when you last saved the checkpoint
The pytorch_model.bin file is known as the state dictionary; it contains all your model’s weights. The two files go hand in hand; the configuration is necessary to know your model’s architecture, while the model weights are your model’s parameters.

Tokenizer

encode

通过tokenizer.tokenize(sequence)查看分词后的结果

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)
'''
['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']'''

ids = tokenizer.convert_tokens_to_ids(tokens)

还可以反过来得到token，也就是跟上面的 tokenize(seq) 一样的效果

1
2
3

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

decode

decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)
'''
'Using a Transformer network is simple'
'''

padding

查看tokenizer.pad_token_id pad的id

# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

直接max_length是到模型的最大长度，longest是到批次里句子的最大长度

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]
'''
tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)'''

This is because the key feature of Transformer models is attention layers that contextualize each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask.

这里两条单独的数据的结果跟组合起来的是不同的，是因为pad的位置也分散了模型的注意力，这不是我们希望模型学习的地方

ATTENTION MASK

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)
'''
tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)'''

长句子处理

Models have different supported sequence lengths, and some specialize in handling very long sequences. Longformer is one example, and another is LED. If you’re working on a task that requires very long sequences, we recommend you take a look at those models.

一般是在tokenizer里设置truncation max_len，这里没讲，可以去看看模型。