tokenizer = AutoTokenizer.from_pretrained("t5-small") tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are you?") # [('▁Hello,', (0, 6)), ('▁how', (7, 10)), ('▁are', (11, 14)), ('▁you?', (16, 20))]
三种分词算法总览
如上,不同的模型适用不同的分词算法
sentencepiece
它经常与unigram算法一起,且并不需要预分词,是特攻中文日文,这种无法分词的语言的
算法预览
Model
BPE
WordPiece
Unigram
Training
Starts from a small vocabulary and learns rules to merge tokens
Starts from a small vocabulary and learns rules to merge tokens
Starts from a large vocabulary and learns rules to remove tokens
Training step
Merges the tokens corresponding to the most common pair
Merges the tokens corresponding to the pair with the best score based on the frequency of the pair, privileging pairs where each individual token is less frequent
Removes all the tokens in the vocabulary that will minimize the loss computed on the whole corpus
Learns
Merge rules and a vocabulary
Just a vocabulary
A vocabulary with a score for each token
Encoding
Splits a word into characters and applies the merges learned during training
Finds the longest subword starting from the beginning that is in the vocabulary, then does the same for the rest of the word
Finds the most likely split into tokens, using the scores learned during training
BPE
简述
BPE是Byte-Pair Encoding 的简写,他有三步
将corpus所有独一无二字符拆出来,如英文中的26个字母,标点和其他特殊符号
在有基础字符的基础上,以频率作为选取标准,将两个字符匹配在一起,选择频率最高的词进行入库
重复第二步直到满足你设定的词库大小
The GPT-2 and RoBERTa tokenizers (which are pretty similar) have a clever way to deal with this: they don’t look at words as being written with Unicode characters, but with bytes. This way the base vocabulary has a small size (256), but every character you can think of will still be included and not end up being converted to the unknown token. This trick is called byte-level BPE.
corpus = [ "This is the Hugging Face Course.", "This chapter is about tokenization.", "This section shows several tokenizer algorithms.", "Hopefully, you will be able to understand how they are trained and generate tokens.", ]
统计词频
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
from transformers import AutoTokenizer from collections import defaultdict
tokenizer = AutoTokenizer.from_pretrained("gpt2")
word_freqs = defaultdict(int)
for text in corpus: words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text) new_words = [word for word, offset in words_with_offsets] for word in new_words: word_freqs[word] += 1
deftokenize(text): pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text) pre_tokenized_text = [word for word, offset in pre_tokenize_result] splits = [[l for l in word] for word in pre_tokenized_text] # splits = [list(word) for word in pre_tokenized_text] for pair, merge in merges.items(): for idx, split inenumerate(splits): i = 0 while i < len(split) - 1: if split[i] == pair[0] and split[i + 1] == pair[1]: split = split[:i] + [merge] + split[i + 2 :] else: i += 1 splits[idx] = split
returnsum(splits, [])
不怎么用,不理解也可
WordPiece
它运用在 BERT, such as DistilBERT, MobileBERT, Funnel Transformers, and MPNE等模型