# This can take a few minutes to load, so grab a coffee or tea while you wait! raw_datasets = load_dataset("code_search_net", "python") raw_datasets["train"] ''' Dataset({ features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url' ], num_rows: 412178 })'''
生成器加载数据
下面的方法会一次加载所有数据
1 2
# Don't uncomment the following line unless your dataset is small! # training_corpus = [raw_datasets["train"][i: i + 1000]["whole_func_string"] for i in range(0, len(raw_datasets["train"]), 1000)]
一般使用python生成器
1 2 3 4
training_corpus = ( raw_datasets["train"][i : i + 1000]["whole_func_string"] for i inrange(0, len(raw_datasets["train"]), 1000) )
将列表推导式的方括号换成圆括号就可以变成生成器了,好厉害。
1 2 3 4 5 6
>gen = (i for i inrange(10)) >print(list(gen)) >print(list(gen)) >''' >[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >[]'''