Main Types of Tokenizers

spaCy and Moses are two popular rule-based tokenizers.

Subword

word-level 太大，character-level包含不了很多语义

So to get the best of both worlds, transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.
- 将低频词(专有名词那些) 拆解、分词

例如

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("I have a new GPU!")

# ["i", "have", "a", "new", "gp", "##u", "!"]

更小的词汇表意味着失去更多的词义，在词义和运算资源的平衡引出了下面的几个算法。

Byte-Pair Encoding (BPE)

GPT-2, Roberta, XLM, FlauBERT ,GPT

BPE在Neural Machine Translation of Rare Words with Subword Units中被提出

After pre-tokenization, a set of unique words has been created and the frequency of each word it occurred in the training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to define before training the tokenizer.
- 首先统计词频 —> 再根据词频创建词汇表 , 同时加入融合规则 —> 依词频merge得到新的词汇表 —> 在新的词汇表基础上在此merge —> 做种使得词汇表的大小在 desired vocabulary size. 这个大小是可调的超参数

例如, 数字表示频率

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

首先我们得到 base vocabulary is ["b", "g", "h", "n", "p", "s", "u"].
ug的组合有20的频率第一个被加进去
- ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
次高un —> hug
- ("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)

另外 For instance, the word "bug" would be tokenized to ["b", "ug"] but "mug" would be tokenized as ["<unk>", "ug"] since the symbol "m" is not in the base vocabulary.

the vocabulary size, i.e. the base vocabulary size + the number of merges

For instance GPT has a vocabulary size of 40,478 since they have 478 base characters and chose to stop training after 40,000 merges.

GPT-2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.

Unigram

Unigram introduced in Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, 2018).

it’s used in conjunction with SentencePiece. 跟SentencePiece配合使用

In contrast to BPE or WordPiece, Unigram initializes its base vocabulary to a large number of symbols and progressively trims down each symbol to obtain a smaller vocabulary.
At each training step, the Unigram algorithm defines a loss (often defined as the log-likelihood) over the training data given the current vocabulary and a unigram language model. Then, for each symbol in the vocabulary, the algorithm computes how much the overall loss would increase if the symbol was to be removed from the vocabulary. Unigram then removes p (with p usually being 10% or 20%) percent of the symbols whose loss increase is the lowest, i.e. those symbols that least affect the overall loss over the training data. This process is repeated until the vocabulary has reached the desired size. The Unigram algorithm always keeps the base characters so that any word can be tokenized.
- 取对数似然损失，将所有字母嵌入基础词汇表，开始merge，计算移除某些词所降低或提升的损失，重复不断移除，直到符合理想的词汇表大小。
Because Unigram is not based on merge rules (in contrast to BPE and WordPiece), the algorithm has several ways of tokenizing new text after training

WordPiece

BERT, DistilBERT, and Electra. ~~满满的含金量~~

WordPiece was outlined in Japanese and Korean Voice Search (Schuster et al., 2012)

WordPiece first initializes the vocabulary to include every character present in the training data and progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.
- WordPiece使用频率记录character做基础词汇表，然后使用最大似然做评判标准 merge词汇。是BPE和Unigram的结合使用案例。

SentencePiece

XLM, ALBERT, XLNet, Marian, and T5.

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018) treats the input as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram algorithm to construct the appropriate vocabulary.

All transformers models in the library that use SentencePiece use it in combination with unigram
- 就是使用Unicode码编码所有字符

Customizing Tokenizer

如同上面看到的不同模型选择了不同算法的分词方式，根据你的需求选择不同的分词器。

from tokenizers import models, Tokenizer

# 当然这里可以直接 使用模型名字得到一个对应模型的分词器，那就是通常使用的方法
tokenizer = Tokenizer(models.WordPiece()) # models.BPE() models.Unigram()

以上即可引出你的分词器进行下面的客制化。

Normalization: Executes all the initial transformations over the initial input string. For example when you need to lowercase some text, maybe strip it, or even apply one of the common unicode normalization process, you will add a Normalizer.
- 对你的文本进行修缮，转小写字母，删除某些符号等
Pre-tokenization: In charge of splitting the initial input string. That’s the component that decides where and how to pre-segment the origin string. The simplest example would be to simply split on spaces.
- 定制一些分词规则
Model: Handles all the sub-token discovery and generation, this is the part that is trainable and really dependent of your input data.
- 循环处理
Post-Processing: Provides advanced construction features to be compatible with some of the Transformers-based SoTA models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.

首先设定一下数据集和数据发生器

from datasets import load_dataset

dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")

batch_size = 1000
def batch_iterator():
    for i in range(0, len(dataset), batch_size): # 0到数据集的长度, 步长为batch_size
        yield dataset[i : i + batch_size]["text"]

WordPiece

pre-processing

from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer

tokenizer = Tokenizer(models.WordPiece(unl_token="[UNK]"))

tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()])

# 这里设定bert的前处理分词器
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
tokenizer.pre_tokenizer.pre_tokenize_str("This is an example!")
'''
[('This', (0, 4)),
 ('is', (5, 7)),
 ('an', (8, 10)),
 ('example', (11, 18)),
 ('!', (18, 19))]
'''

Note that the pre-tokenizer not only split the text into words but keeps the offsets
- that is the beginning and start of each of those words inside the original text. 这里是为QA做的偏移量特性

processing

special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
# 引入 trainer不是编辑好了，而是借助已经完成的trainer结构加入special_tokens
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

# 这里是开始训练了
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)

这里我们就得到了基本处理的数据，下面进行后处理。

post-processing

tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", cls_token_id),
        ("[SEP]", sep_token_id),
    ],
)

#查看数据
encoder = tokenizer.encode("This is one sentence.", "With this one we have a pair.")
encoding.tokens
'''
['[CLS]',
 'this',
 'is',
 'one',
 'sentence',
 '.',
 '[SEP]',
 'with',
 'this',
 'one',
 'we',
 'have',
 'a',
 'pair',
 '.',
 '[SEP]']
 '''
 
 encoding.type_ids
 # [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

We have to indicate in the template how to organize the special tokens with one sentence ($A) or two sentences ($A and $B). The : followed by a number indicates the token type ID to give to each part.
- process阶段处理好的句子进行包装单个句子single就是00，pair就是01

wrap your tokenizer to transformer object

from transformers import PreTrainedTokenizerFast

new_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
# 原示例是 new_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

tokenizer = Tokenizer(models.BPE())

tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

tokenizer.pre_tokenizer.pre_tokenize_str("This is an example!")

trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)

tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)
tokenizer.decoder = decoders.ByteLevel()
encode2= tokenizer.encode("This is one sentence.", "With this one we have a pair.")
encode2
# Encoding(num_tokens=13, 
attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

# 同样进行封装
from transformers import PreTrainedTokenizerFast
new_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

与之前不同的是我们在第三行直接使用pre_tokenizers.ByteLevel的分词器，而不是调用bert的分词器
同样trainers.BpeTrainer我们使用的也是bpe的
trim_offsets=False Whether the post processing step should trim offsets to avoid including whitespaces.
- 是否将空白格计算在偏移量里

Unigram

tokenizer = Tokenizer(models.Unigram())

tokenizer.normalizer = normalizers.Sequence(
    [normalizers.Replace("``", '"'), normalizers.Replace("''", '"'), normalizers.Lowercase()]
)
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

tokenizer.pre_tokenizer.pre_tokenize_str("This is an example!")
# [('▁This', (0, 4)), ('▁is', (4, 7)), ('▁an', (7, 10)), ('▁example!', (10, 19))]

trainer = trainers.UnigramTrainer(vocab_size=25000, special_tokens=["[CLS]", "[SEP]", "<unk>", "<pad>", "[MASK]"], unk_token="<unk>")
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)

tokenizer.post_processor = processors.TemplateProcessing(
    single="[CLS]:0 $A:0 [SEP]:0",
    pair="[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", cls_token_id),
        ("[SEP]", sep_token_id),
    ],
)
tokenizer.decoder = decoders.Metaspace()

tokenizer.encode("This is one sentence.", "With this one we have a pair.").tokens
'''
['[CLS]',
 '▁this',
 '▁is',
 '▁one',
 '▁sentence',
 '.',
 '[SEP]',
 '▁with',
 '▁this',
 '▁one',
 '▁we',
 '▁have',
 '▁a',
 '▁pair',
 '.',
 '[SEP]']
 '''

# 封装
from transformers import AlbertTokenizerFast
new_tokenizer = AlbertTokenizerFast(tokenizer_object=tokenizer)