载入数据

from datasets import load_dataset

# This can take a few minutes to load, so grab a coffee or tea while you wait!
raw_datasets = load_dataset("code_search_net", "python")
raw_datasets["train"]
'''
Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 
      'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 
      'func_code_url'
    ],
    num_rows: 412178
})'''

生成器加载数据

下面的方法会一次加载所有数据

1 2	# Don't uncomment the following line unless your dataset is small! # training_corpus = [raw_datasets["train"][i: i + 1000]["whole_func_string"] for i in range(0, len(raw_datasets["train"]), 1000)]

一般使用python生成器

training_corpus = (
    raw_datasets["train"][i : i + 1000]["whole_func_string"]
    for i in range(0, len(raw_datasets["train"]), 1000)
)

将列表推导式的方括号换成圆括号就可以变成生成器了，好厉害。

>gen = (i for i in range(10))
>print(list(gen))
>print(list(gen))
>'''
>[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>[]'''

使用之后会清空内存，如上所示

更一般的生成器

def get_training_corpus():
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["whole_func_string"]

train_new_from_iterator()

载入模型

from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = old_tokenizer.tokenize(example)
tokens

'''
['def', 'Ġadd', '_', 'n', 'umbers', '(', 'a', ',', 'Ġb', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo',
 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`', '."', '""', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']'''

This tokenizer has a few special symbols, like Ġ and Ċ, which denote spaces and newlines, respectively。

两个G表示空格和换行符。他还为多个空格在一起的单独编码，带下划线的词也不认识，所以不太合适。

训练新分词器

1	tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

注意，只有Fast的tokenizer支持train_new_from_iterator方法，他们是根据rust写的。没有fast的是纯python写的。

tokens = tokenizer.tokenize(example)
tokens
'''
['def', 'Ġadd', '_', 'numbers', '(', 'a', ',', 'Ġb', '):', 'ĊĠĠĠ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`',
 'a', '`', 'Ġand', 'Ġ`', 'b', '`."""', 'ĊĠĠĠ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']'''

起码多个空格学会了

存储新分词器

1	tokenizer.save_pretrained("code-search-net-tokenizer")