与之前继承式的分词器不同,这词我们将从语料库中训练一个全新的分词器
首先我们设置一个WordPiece类型的分词器
加载文档 1 2 3 4 5 6 7 8 9 10 11 12 13 from datasets import load_datasetdataset = load_dataset("wikitext" , name="wikitext-2-raw-v1" , split="train" ) def get_training_corpus (): for i in range (0 , len (dataset), 1000 ): yield dataset[i : i + 1000 ]["text" ] with open ("wikitext-2.txt" , "w" , encoding="utf-8" ) as f: for i in range (len (dataset)): f.write(dataset[i]["text" ] + "\n" )
加载构件 1 2 3 4 5 6 7 8 9 10 11 from tokenizers import ( decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer, ) tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]" ))
我们从tokenizer库
中加载特殊的model
构件,来使用WordPiece
方法
设定遇到没见过的词标记为[UNK], 同时可以设置max_input_chars_per_word
作为最大词长
设置Normalizer 这里我们选择bert的设置,包括:
所有字母小写、strip_accents除去重音、删除控制字符、将所有多个空格设置为单个空格、汉字周围放置空格。
1 2 3 4 5 6 7 8 9 tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True ) tokenizer.normalizer = normalizers.Sequence ( [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()] ) print (tokenizer.normalizer.normalize_str("Héllò hôw are ü?" ))
上面自定义中我们使用Sequence
方法定义我们自己的规范化规则
Pre-tokenization 和上面一样可以通过tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
套用bert的设置
下面是custom版本
1 2 3 4 5 tokenizer.pre_tokenizer = pre_tokenizers.Whitespace() tokenizer.pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer." ) ''' [('Let', (0, 3)), ("'", (3, 4)), ('s', (4, 5)), ('test', (6, 10)), ('my', (11, 13)), ('pre', (14, 17)), ('-', (17, 18)), ('tokenizer', (18, 27)), ('.', (27, 28))]'''
.Whitespace()
是对标点空格分隔,你可用下面的分隔
1 2 3 4 pre_tokenizer = pre_tokenizers.WhitespaceSplit() pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer." ) ''' [("Let's", (0, 5)), ('test', (6, 10)), ('my', (11, 13)), ('pre-tokenizer.', (14, 28))]'''
推荐使用Sequence方法组合你的预分词
1 2 3 4 5 6 7 pre_tokenizer = pre_tokenizers.Sequence ( [pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()] ) pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer." ) ''' [('Let', (0, 3)), ("'", (3, 4)), ('s', (4, 5)), ('test', (6, 10)), ('my', (11, 13)), ('pre', (14, 17)), ('-', (17, 18)), ('tokenizer', (18, 27)), ('.', (27, 28))]'''
Trainer 训练之前我们需要加入特殊token因为他不在你的词库之中
1 2 special_tokens = ["[UNK]" , "[PAD]" , "[CLS]" , "[SEP]" , "[MASK]" ] trainer = trainers.WordPieceTrainer(vocab_size=25000 , special_tokens=special_tokens)
As well as specifying the vocab_size
and special_tokens
, we can set the min_frequency
(the number of times a token must appear to be included in the vocabulary) or change the continuing_subword_prefix
(if we want to use something different from ##
).
改某个token必须出现多少次、改连接前缀##为别的
开始训练
1 2 3 4 5 tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer) tokenizer.model = models.WordPiece(unk_token="[UNK]" ) tokenizer.train(["wikitext-2.txt" ], trainer=trainer)
第一种方法是用上面定义的生成器
第二种传入”wikitext-2.txt”文件
到此我们tokenizer就具有了一般tokenizer的所有方法如encode
1 2 3 4 encoding = tokenizer.encode("Let's test this tokenizer." ) print (encoding.tokens)
Post-processing 最后我们需要包裹我们的token到特殊的格式如: [CLS]…[SEP]…[SEP]
首先我们获取所需的特殊token的下标
1 2 3 4 5 cls_token_id = tokenizer.token_to_id("[CLS]" ) sep_token_id = tokenizer.token_to_id("[SEP]" ) print (cls_token_id, sep_token_id)
接下来处理我们的模板
1 2 3 4 5 tokenizer.post_processor = processors.TemplateProcessing( single=f"[CLS]:0 $A:0 [SEP]:0" , pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1" , special_tokens=[("[CLS]" , cls_token_id), ("[SEP]" , sep_token_id)], )
模板我们需要设置两种模式:
single–单个句子情况下
pair
最后指定特殊token的id
查看
1 2 3 4 5 6 7 8 9 10 encoding = tokenizer.encode("Let's test this tokenizer." ) print (encoding.tokens)encoding = tokenizer.encode("Let's test this tokenizer..." , "on a pair of sentences." ) print (encoding.tokens)print (encoding.type_ids)''' ['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '...', '[SEP]', 'on', 'a', 'pair', 'of', 'sentences', '.', '[SEP]'] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]'''
decoder 接下来对解码器做一定设置
1 2 3 4 tokenizer.decoder = decoders.WordPiece(prefix="##" ) tokenizer.decode(encoding.ids)
保存 & 加载 Custom Tokenizer tokenizer.save("tokenizer.json")
new_tokenizer = Tokenizer.from_file("tokenizer.json")
转成Fast Tokenizer To use this tokenizer in 🤗 Transformers, we have to wrap it in a PreTrainedTokenizerFast
. We can either use the generic class or, if our tokenizer corresponds to an existing model, use that class (here, BertTokenizerFast
). If you apply this lesson to build a brand new tokenizer, you will have to use the first option.
可以继承你的特定类BertTokenizerFast
,也可以用泛类PreTrainedTokenizerFast
1 2 3 4 5 6 7 8 9 10 11 from transformers import PreTrainedTokenizerFastwrapped_tokenizer = PreTrainedTokenizerFast( tokenizer_object=tokenizer, unk_token="[UNK]" , pad_token="[PAD]" , cls_token="[CLS]" , sep_token="[SEP]" , mask_token="[MASK]" , )
这里可以从文件中加载你的tokenizer设置、也可直接赋值、注意你的特殊符号必须重新定义
BPE类型的分词器 1 2 3 4 5 6 7 tokenizer = Tokenizer(models.BPE()) tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False ) tokenizer.pre_tokenizer.pre_tokenize_str("Let's test pre-tokenization!" ) ''' [('Let', (0, 3)), ("'s", (3, 5)), ('Ġtest', (5, 10)), ('Ġpre', (10, 14)), ('-', (14, 15)), ('tokenization', (15, 27)), ('!', (27, 28))]'''
GPT2只需要开始和结束的特殊token
1 2 3 4 5 6 7 8 9 10 trainer = trainers.BpeTrainer(vocab_size=25000 , special_tokens=["<|endoftext|>" ]) tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer) tokenizer.model = models.BPE() tokenizer.train(["wikitext-2.txt" ], trainer=trainer) encoding = tokenizer.encode("Let's test this tokenizer." ) print (encoding.tokens)''' ['L', 'et', "'", 's', 'Ġtest', 'Ġthis', 'Ġto', 'ken', 'izer', '.']'''
1 2 3 4 5 6 7 8 tokenizer.post_processor = processors.ByteLevel(trim_offsets=False ) sentence = "Let's test this tokenizer." encoding = tokenizer.encode(sentence) start, end = encoding.offsets[4 ] sentence[start:end]
The trim_offsets = False
option indicates to the post-processor that we should leave the offsets of tokens that begin with ‘Ġ’ as they are: this way the start of the offsets will point to the space before the word, not the first character of the word (since the space is technically part of the token). Let’s have a look at the result with the text we just encoded, where 'Ġtest'
is the token at index 4
trim_offsets
设定是否修正字符的空格位置进入偏移量
1 2 3 tokenizer.decoder = decoders.ByteLevel() tokenizer.decode(encoding.ids)
包装
1 2 3 4 5 6 7 8 9 10 11 12 from transformers import PreTrainedTokenizerFastwrapped_tokenizer = PreTrainedTokenizerFast( tokenizer_object=tokenizer, bos_token="<|endoftext|>" , eos_token="<|endoftext|>" , ) from transformers import GPT2TokenizerFastwrapped_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)
Unigram类型的分词器 1 2 3 4 5 6 7 8 9 10 11 12 13 tokenizer = Tokenizer(models.Unigram()) from tokenizers import Regextokenizer.normalizer = normalizers.Sequence ( [ normalizers.Replace("``" , '"' ), normalizers.Replace("''" , '"' ), normalizers.NFKD(), normalizers.StripAccents(), normalizers.Replace(Regex(" {2,}" ), " " ), ] )
第一、二个norm将符号替换,最后一个将多个空格替换成一个
1 2 3 4 tokenizer.pre_tokenizer = pre_tokenizers.Metaspace() tokenizer.pre_tokenizer.pre_tokenize_str("Let's test the pre-tokenizer!" )
1 2 3 4 5 6 7 8 9 special_tokens = ["<cls>" , "<sep>" , "<unk>" , "<pad>" , "<mask>" , "<s>" , "</s>" ] trainer = trainers.UnigramTrainer( vocab_size=25000 , special_tokens=special_tokens, unk_token="<unk>" ) tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer) tokenizer.model = models.Unigram() tokenizer.train(["wikitext-2.txt" ], trainer=trainer)
1 2 3 4 encoding = tokenizer.encode("Let's test this tokenizer." ) print (encoding.tokens)''' ['▁Let', "'", 's', '▁test', '▁this', '▁to', 'ken', 'izer', '.']'''
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 cls_token_id = tokenizer.token_to_id("<cls>" ) sep_token_id = tokenizer.token_to_id("<sep>" ) print (cls_token_id, sep_token_id)tokenizer.post_processor = processors.TemplateProcessing( single="$A:0 <sep>:0 <cls>:2" , pair="$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2" , special_tokens=[("<sep>" , sep_token_id), ("<cls>" , cls_token_id)], ) encoding = tokenizer.encode("Let's test this tokenizer..." , "on a pair of sentences!" ) print (encoding.tokens)print (encoding.type_ids)''' ['▁Let', "'", 's', '▁test', '▁this', '▁to', 'ken', 'izer', '.', '.', '.', '<sep>', '▁', 'on', '▁', 'a', '▁pair', '▁of', '▁sentence', 's', '!', '<sep>', '<cls>'] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]'''
XLnet填充pad在左边且[CLS]在最后,这些我们都需要指明给Fast
1 2 3 4 5 6 7 8 9 10 11 12 13 from transformers import PreTrainedTokenizerFastwrapped_tokenizer = PreTrainedTokenizerFast( tokenizer_object=tokenizer, bos_token="<s>" , eos_token="</s>" , unk_token="<unk>" , pad_token="<pad>" , cls_token="<cls>" , sep_token="<sep>" , mask_token="<mask>" , padding_side="left" , )
1 2 3 from transformers import XLNetTokenizerFastwrapped_tokenizer = XLNetTokenizerFast(tokenizer_object=tokenizer)