Scrapy 01 tutorial
如果vscode中你的终端不能识别scrapy可以在环境变量中加入scrapy.exe的路径
启动安装好后,在目标文件夹内启动scrapy startproject tutorial命令,将会创建如下文件
12345678910111213141516tutorial/ scrapy.cfg # deploy configuration file tutorial/ # project's Python module, you'll import your code from here __init__.py items.py # project items definition file middlewares.py # project middlewares file pipelines.py # project pipelines file settings.py # project settings file spiders/ # a directory where you'll later put your spiders __init__.py
在tutorial/spiders目录下创建我们的第一个爬虫命名为quotes_spider.py
1234567891011121314151617181920import scrapyclass QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): urls = [ 'https://quotes.toscrape.com/page/1/', 'https://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): page = response.url.split("/")[-2] filename = f'quotes-{page}.html' with open(filename, 'wb') as f: f.write(response.body) self.log(f'Saved file {filename}')
终端中启动爬虫scrapy crawl quotes会得到两个文件quotes-1.html and quotes-2.html
scrapy shell在解析他两之前,我们介绍 Scrapy shell,用来调试我们输出 scrapy shell <url>
pip install ipython之后 在上级目录中找到scrapy.cfg文件在setting下加入
shell = bpython 如果你的ipython不能用的话
输入exit可以退出
1234567891011121314151617scrapy shell "https://quotes.toscrape.com/page/1/"'''[ ... Scrapy log here ... ]2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)[s] Available Scrapy objects:[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)[s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90>[s] item {}[s] request <GET https://quotes.toscrape.com/page/1/>[s] response <200 https://quotes.toscrape.com/page/1/>[s] settings <scrapy.settings.Settings object at 0x7fa91d888c10>[s] spider <DefaultSpider 'default' at 0x7fa91c8af990>[s] Useful shortcuts:[s] shelp() Shell help (print this help)[s] fetch(req_or_url) Fetch request (or URL) and update local objects[s] view(response) View response in a browser'''
以上是返回的一些可以操作的对象
12response.css('title')# [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
如此可以实现交互式运行
css语法::text12response.css('title::text').getall()# ['Quotes to Scrape']
12response.css('title').getall()# ['<title>Quotes to Scrape</title>']
get/getall返回一个,或者返回一个列表
正则123456response.css('title::text').re(r'Quotes.*')# ['Quotes to Scrape']response.css('title::text').re(r'Q\w+')# ['Quotes']response.css('title::text').re(r'(\w+) to (\w+)')# ['Quotes', 'Scrape']
Xpath官方推荐使用这个,但我觉得css写的更方便一点
1234response.xpath('//title')# [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]response.xpath('//title/text()').get()# 'Quotes to Scrape'
/html/head/title: 选择HTML文档中 <head> 标签内的 <title> 元素
/html/head/title/text(): 选择上面提到的 <title> 元素的文字
//td: 选择所有的 <td> 元素
//div[@class="mine"]: 选择所有具有 class="mine" 属性的 div 元素
提取数据12345678910111213141516'''<div class="quote"> <span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span> <span> by <small class="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div></div>'''
scrapy shell 'https://quotes.toscrape.com'
单个提取12345response.css("div.quote")'''[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, ...]'''
分为两个部分selector和data ,data就是我们操作的分布
12345678910quote = response.css("div.quote")[0]text = quote.css("span.text::text").get()text''''“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”' '''author = quote.css("small.author::text").get()author# 'Albert Einstein'
response.css搜寻的格式为’标签.标签名称’
quote为我们html文件中所有class=quote的标签组,
组内span.text标签下为名言、组内small.author为作者
小组提取12345678910111213141516'''<div class="quote"> <span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span> <span> by <small class="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div></div>'''
123tags = quote.css("div.tags a.tag::text").getall()tags# ['change', 'deep-thoughts', 'thinking', 'world']
全部提取12345678for quote in response.css("div.quote"): text = quote.css("span.text::text").get() author = quote.css("small.author::text").get() tags = quote.css("div.tags a.tag::text").getall() print(dict(text=text, author=author, tags=tags))'''{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}'''
数据保存scrapy crawl spiderman -O spn.json
1234567891011121314151617import scrapyclass QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'https://quotes.toscrape.com/page/1/', 'https://quotes.toscrape.com/page/2/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('small.author::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), }
启动爬虫会获得如下内容:
注要在爬虫的根目录启动爬虫
12345'''2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}'''
输出格式scrapy crawl quotes -O quotes.json
-O将会覆写同名文件已存在的内容,
-o则会在已存在文件的后面增加内容,但是新旧格式可能不同,可以使用
scrapy crawl quotes -o quotes.jsonl
有json、jsonl、csv、xml四种格式
爬取整个网站123456'''<ul class="pager"> <li class="next"> <a href="/page/2/">Next <span aria-hidden="true" ...
diffusion 综述
待排版
第一部分是DDPM时代的图像编辑。因为还没有任何的引导生成技术的出现,这一阶段的论文都属于利用输入图像引导生成的范式。
第二部分是在显式分类器引导生成技术出现后,基于CLIP模型的多模态引导生成技术的调研。
第三部分是最近(2022.11)一两个月基于Stable-Diffusion/Imagen等一系列模型所产生的图像编辑技术的调研。
不像人话
第一阶段: DDPM
加噪 diffusion 再降噪还原 全局修改
逐步发现对梯度的控制很重要,于是加入对梯度控制。DDPM->IVLR->SDEdit->RePaint
最后从打补丁控制生成的基础上,引导出了对导数的控制
第二阶段: DDIM
Diffusion Models Beat GANs on Image Synthesis 加入10倍的定向梯度控制
More Control for Free! Image Synthesis with Semantic Diffusion Guidance : CLIP 可以局部修改
想要使用一个文本来引导图像生成,我们可以每一步都计算现在的图像表征和文本表征的距离,使用方程的梯度来计算缩小这个距离的方向
但就在十天之后OpenAI发布了GLIDE,使用了下面会提到的隐式分类器引导的图像生成
随着新的更强大更便捷的模型如Stable-Diffusion, Imagen等如雨后春笋般涌现,上面的各项工作可能只剩下了借鉴意义。
Classifier-Free Diffusion Guidance :基于隐式分类器的文生图大模型
无分类器引导可以说是GLIDE/Stable-Diffusion/Imagen的做法的直接奠基工作之一
第三阶段:
在隐式分类器上引导生成过程中的调控生成
第一种是根据扩散模型迭代去噪的特性,我们在模型的低频细节上继续生成。这种做法虽然能保留大部分几何特征,但是也同样无法调控几何特征。
Imagic: Text-Based Real Image Editing with Diffusion Models:
具体来说,Imagic将概念绑定这件事拆成了三个步骤,对于输入图像x和我们希望生成的目标描述文本text_target来说:
1:我们首先冻结整个模型,使用模型训练时的生成目标来微调text_target的文本表征,使其接近于图像的表征。
2:我们放开整个模型的权重更新,依然使用训练时的生成目标,但这次全模型微调。模型的输入是图像x和我们微调后的文本表征。这一步是因为哪怕我们让目标文本表征和原图的表征接近了,也不能保证我们输入让我们微调后的目标文本表征可以生成我们的原图,所以我们再次将这两个概念一起训练,使得我们可以使用微调后的目标文本表征生成我们的原图
3:既然我们已经将原图和微调后的新文本表征绑定起来了,现在我们再使用原本的目标文本表征与微调后的文本表征做插值,来对原图像施加影响即可。
训练好图形输出锁住参数-> 微调全参数适应文本输出-> 开放全参数两个一起训练 ->将两个概念捆绑并开放修改
简单来讲可以将微调后的目标文本表征近似当作原图像原生的文本表征,那么最后一步使用目标表征对原生的表征施加影响就非常自然了
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
具体来说作者提出了使用稀缺词加种类词如“beikkpic dog”的组合文本来微调一组照片和这个文本的绑定。但是仅仅用少量的照片来微调一个有着大量参数的模型很明显会带来极为严重的过拟合。并且还会带来一个语言模型里特别常见的事情–灾难性遗忘。这两个问题的表现一个是绑定词的形态很难变换,就如上篇的Unitune一样。另一个问题是对种类词里面的种类生成也会快速失去多样性和变化性。于是针对这个问题作者针对性地提出了一个叫自身类别先验保存损失的损失函数。
这个函数的设计是在用户提供一个指定的类别和这个类别的一组图片(如自家的宠物狗的多张照片)后,模型同时使用“特殊词+类别”对用户照片训练和“类别”与模型生成的该类别图训练。这样做的好处是模型可以在将特定的照片主体与特殊词绑定的时候可以一起学到和其类别的关系,并且同时该类别的信息在不断的被重申以对抗用户照片信息的冲击。作者在训练的时候特意将这两个损失以一比一的比例训练了200个轮次左右。(单卡GPU 15分钟左右就可以)
Prompt-to-Prompt Image Editing with Cross-Attention Control
这篇文章的洞见来自于一个重要思考:即多模态里文生图的文本是如何对生成过程施加影响的?
基于隐式分类器的文图模型是通过训练一个既可以做无条件生成的梯度预估,也可以做条件生成的梯度预估的模型实现的。而其中这个条件交互的方式在Imagen和Stable-Diffusion里都是通过cross-attention实现信息融合的。那么很明显,我们的着眼点也应该在cross-attention上。
而作者的洞见则在于:我们输入的文本和像素之间存在着一个空间对应的关系。通过调控注意力和像素间的映射。我们能够对图像的不同区域实施准确的引导。
有了以上洞见据此进行图像引导生成就很直观了,作者将其分为三个主要场景:单词替换(比如在上图里将熊换成猫则将猫这个token对应的map换成熊的map),单词增添(在原有的map上增加新的单词的map),注意力重加权(如果想放大或减弱某个词对原图的引导效果则对其map乘上新的权重值,如降低下雪的效果开花的程度等)
原链
HF Course 09 Custom Tokenizer
与之前继承式的分词器不同,这词我们将从语料库中训练一个全新的分词器
首先我们设置一个WordPiece类型的分词器
加载文档12345678910111213from datasets import load_datasetdataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")def get_training_corpus(): for i in range(0, len(dataset), 1000): yield dataset[i : i + 1000]["text"]# 也可以从本地打开文档with open("wikitext-2.txt", "w", encoding="utf-8") as f: for i in range(len(dataset)): f.write(dataset[i]["text"] + "\n")
加载构件1234567891011from tokenizers import ( decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer,)tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
我们从tokenizer库中加载特殊的model构件,来使用WordPiece方法
设定遇到没见过的词标记为[UNK], 同时可以设置max_input_chars_per_word作为最大词长
设置Normalizer这里我们选择bert的设置,包括:
所有字母小写、strip_accents除去重音、删除控制字符、将所有多个空格设置为单个空格、汉字周围放置空格。
123456789tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)# 你也可以自定义tokenizer.normalizer = normalizers.Sequence( [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()])print(tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))# hello how are u?
上面自定义中我们使用Sequence方法定义我们自己的规范化规则
Pre-tokenization和上面一样可以通过tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()套用bert的设置
下面是custom版本
12345tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()tokenizer.pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")'''[('Let', (0, 3)), ("'", (3, 4)), ('s', (4, 5)), ('test', (6, 10)), ('my', (11, 13)), ('pre', (14, 17)), ('-', (17, 18)), ('tokenizer', (18, 27)), ('.', (27, 28))]'''
.Whitespace()是对标点空格分隔,你可用下面的分隔
1234pre_tokenizer = pre_tokenizers.WhitespaceSplit()pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")'''[("Let's", (0, 5)), ('test', (6, 10)), ('my', (11, 13)), ('pre-tokenizer.', (14, 28))]'''
推荐使用Sequence方法组合你的预分词
1234567pre_tokenizer = pre_tokenizers.Sequence( [pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()])pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")'''[('Let', (0, 3)), ("'", (3, 4)), ('s', (4, 5)), ('test', (6, 10)), ('my', (11, 13)), ('pre', (14, 17)), ('-', (17, 18)), ('tokenizer', (18, 27)), ('.', (27, 28))]'''
Trainer训练之前我们需要加入特殊token因为他不在你的词库之中
12special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
As well as specifying the vocab_size and special_tokens, we can set the min_frequency (the number of times a token must appear to be included in the vocabulary) or change the continuing_subword_prefix (if we want to use something different from ##).
改某个token必须出现多少次、改连接前缀##为别的
开始训练
12345tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)# 另外的版本tokenizer.model = models.WordPiece(unk_token="[UNK]")tokenizer.train(["wikitext-2.txt"], trainer=trainer)
第一种方法是用上面定义的生成器
第二种传入”wikitext-2.txt”文件
到此我们tokenizer就具有了一般tokenizer的所有方法如encode
1234encoding = tokenizer.encode("Let's test this tokenizer.")print(encoding.tokens)# ['let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.']
Post-processing最后我们需要包裹我们的token到特殊的格式如: [CLS]…[SEP]…[SEP]
首先我们获取所需的特殊token的下标
12345cls_token_id = tokenizer.token_to_id("[CLS]")sep_token_id = tokenizer.token_to_id("[SEP]")print(cls_token_id, sep_token_id)# (2, 3)
接下来处理我们的模板
12345tokenizer.post_processor = processors.TemplateProcessing( single=f"[CLS]:0 $A:0 [SEP]:0", pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1", special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],)
模板我们需要设置两种模式:
single–单个句子情况下
[0,0,0,0]
pair
[0,0,0,1,1,1]
最后指定特殊token的id
查看
12345678910encoding = tokenizer.encode("Let's test this tokenizer.")print(encoding.tokens)# ['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.', '[SEP]']encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences.")print(encoding.tokens)print(encoding.type_ids)'''['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '...', '[SEP]', 'on', 'a', 'pair', 'of', 'sentences', '.', '[SEP]'][0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]'''
decoder接下来对解码器做一定设置
1234tokenizer.decoder = decoders.WordPiece(prefix="##")tokenizer.decode(encoding.ids)# "let's test this tokenizer... on a pair of sentences."
保存 & 加载 Custom Tokenizertokenizer.save("tokenizer.json")
new_tokenizer = Tokenizer.from_file("tokenizer.json")
转成Fast TokenizerTo use this tokenizer in 🤗 Transformers, we have to wrap it in a PreTrainedTokenizerFast. We can either use the generic class or, if our tokenizer corresponds to an existing model, use that class (here, BertTokenizerFast). If you apply this lesson to build a brand new tokenizer, you will have to use the first option.
可以继承你的特定类BertTokenizerFast,也可以用泛类PreTrainedTokenizerFast
1234567891011from transformers import PreTrainedTokenizerFastwrapped_tokenizer = PreTrainedTokenizerFast( tokenizer_object=tokenizer, # tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternatively unk_token="[UNK]", pad_token="[PAD]", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]",)
这里可以从文件中加载你的tokenizer设置、也可直接赋值、注意你的特殊符号必须重新定义
BPE类型的分词器1234567tokenizer = Tokenizer(models.BPE())tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)tokenizer.pre_tokenizer.pre_tokenize_str("Let's test pre-tokenization!")'''[('Let', (0, 3)), ("'s", (3, 5)), ('Ġtest', (5, 10)), ('Ġpre', (10, 14)), ('-', (14, 15)), ('tokenization', (15, 27)), ('!', (27, 28))]'''
GPT2只需要开始和结束的特殊token
12345678910trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)tokenizer.model = models.BPE()tokenizer.train(["wikitext-2.txt"], trainer=trainer)encoding = tokenizer.encode("Let's test this tokenizer.")print(encoding.tokens)'''['L', 'et', "'", 's', 'Ġtest', 'Ġthis', 'Ġto', 'ken', 'izer', '.']'''
12345678tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)sentence = "Let's test this tokenizer."encoding = tokenizer.encode(sentence)start, end = encoding.offsets[4]sentence[start:end]# ' test'
The trim_offsets = False option indicates to the post-processor that we should leave the offsets of tokens that begin with ‘Ġ’ as they are: this way the start of the offsets will point to the space before the word, not the first character of the word (since the space is technically part of the token). Let’s have a look at the result with the text we just encoded, where 'Ġtest' is the token at index 4
trim_offsets设定是否修正字符的空格位置进入偏移量
123tokenizer.decoder = decoders.ByteLevel()tokenizer.decode(encoding.ids)# "Let's test this tokenizer."
包装
123456789101112from transformers import PreTrainedTokenizerFastwrapped_tokenizer = PreTrainedTokenizerFast( tokenizer_object=tokenizer, bos_token="<|endoftext|>", eos_token="<|endoftext|>",)# 或者from transformers import GPT2TokenizerFastwrapped_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)
Unigram类型的分词器12345678910111213tokenizer = Tokenizer(models.Unigram())from tokenizers import Regextokenizer.normalizer = normalizers.Sequence( [ normalizers.Replace("``", '"'), normalizers.Replace("''", '"'), normalizers.NFKD(), normalizers.StripAccents(), normalizers.Replace(Regex(" {2,}"), " "), ])
第一、二个norm将符号替换,最后一个将多个空格替换成一个
1234tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()tokenizer.pre_tokenizer.pre_tokenize_str("Let's test the pre-tokenizer!")# [("▁Let's", (0, 5)), ('▁test', (5, 10)), ('▁the', (10, 14)), ('▁pre-tokenizer!', (14, 29))]
123456789special_tokens = ["<cls>", "<sep>", "<unk>", "<pad>", "<mask>", "<s>", ...
HF Course 08 Tokenizer底层算法
流程一般来说我们的tokenizer有如下流程
规范化规范化是对字符做大小写处理之类的我们可以通过如下API查看底层的normalization方法
123456789from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")print(type(tokenizer.backend_tokenizer))# <class 'tokenizers.Tokenizer'>print(tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))# 'hello how are u?'
预分词通过如下api查看分词器是如何做pre_tokenize的
123tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are you?")''''[('Hello', (0, 5)), (',', (5, 6)), ('how', (7, 10)), ('are', (11, 14)), ('you', (16, 19)), ('?', (19, 20))]'''
可以看到后面的偏移量坐标,这也是上一节offset-mapping的由来
不同的预分词gpt
1234tokenizer = AutoTokenizer.from_pretrained("gpt2")tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are you?")'''[('Hello', (0, 5)), (',', (5, 6)), ('Ġhow', (6, 10)), ('Ġare', (10, 14)), ('Ġ', (14, 15)), ('Ġyou', (15, 19)),('?', (19, 20))]''''
t5
123tokenizer = AutoTokenizer.from_pretrained("t5-small")tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are you?")# [('▁Hello,', (0, 6)), ('▁how', (7, 10)), ('▁are', (11, 14)), ('▁you?', (16, 20))]
三种分词算法总览如上,不同的模型适用不同的分词算法
sentencepiece它经常与unigram算法一起,且并不需要预分词,是特攻中文日文,这种无法分词的语言的
算法预览
Model
BPE
WordPiece
Unigram
Training
Starts from a small vocabulary and learns rules to merge tokens
Starts from a small vocabulary and learns rules to merge tokens
Starts from a large vocabulary and learns rules to remove tokens
Training step
Merges the tokens corresponding to the most common pair
Merges the tokens corresponding to the pair with the best score based on the frequency of the pair, privileging pairs where each individual token is less frequent
Removes all the tokens in the vocabulary that will minimize the loss computed on the whole corpus
Learns
Merge rules and a vocabulary
Just a vocabulary
A vocabulary with a score for each token
Encoding
Splits a word into characters and applies the merges learned during training
Finds the longest subword starting from the beginning that is in the vocabulary, then does the same for the rest of the word
Finds the most likely split into tokens, using the scores learned during training
BPE简述BPE是Byte-Pair Encoding 的简写,他有三步
将corpus所有独一无二字符拆出来,如英文中的26个字母,标点和其他特殊符号
在有基础字符的基础上,以频率作为选取标准,将两个字符匹配在一起,选择频率最高的词进行入库
重复第二步直到满足你设定的词库大小
The GPT-2 and RoBERTa tokenizers (which are pretty similar) have a clever way to deal with this: they don’t look at words as being written with Unicode characters, but with bytes. This way the base vocabulary has a small size (256), but every character you can think of will still be included and not end up being converted to the unknown token. This trick is called byte-level BPE.
GPT和roberta使用的是比特级别的字符,就是0100这种,这就是他们的基础语料库,然后在基础上融合出来词进行构建词库
实例下面进行实例解析设定语料库如下
语料库: "hug", "pug", "pun", "bun", "hugs"
词频: ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
1# ("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
第一轮
最多的是 ug的组合,20次
123'''Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug"]Corpus: ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)'''
第二轮
最多是un
123'''Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug", "un"]Corpus: ("h" "ug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("h" "ug" "s", 5)''
第三轮
最多的是hug
123'''Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]Corpus: ("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)'''
…如何循环到设定的词库大小
简要代码语料库
123456corpus = [ "This is the Hugging Face Course.", "This chapter is about tokenization.", "This section shows several tokenizer algorithms.", "Hopefully, you will be able to understand how they are trained and generate tokens.",]
统计词频
12345678910111213141516171819from transformers import AutoTokenizerfrom collections import defaultdicttokenizer = AutoTokenizer.from_pretrained("gpt2")word_freqs = defaultdict(int)for text in corpus: words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text) new_words = [word for word, offset in words_with_offsets] for word in new_words: word_freqs[word] += 1print(word_freqs)'''defaultdict(int, {'This': 3, 'Ġis': 2, 'Ġthe': 1, 'ĠHugging': 1, 'ĠFace': 1, 'ĠCourse': 1, '.': 4, 'Ġchapter': 1, 'Ġabout': 1, 'Ġtokenization': 1, 'Ġsection': 1, 'Ġshows': 1, 'Ġseveral': 1, 'Ġtokenizer': 1, 'Ġalgorithms': 1, 'Hopefully': 1, ',': 1, 'Ġyou': 1, 'Ġwill': 1, 'Ġbe': 1, 'Ġable': 1, 'Ġto': 1, 'Ġunderstand': 1, 'Ġhow': 1, 'Ġthey': 1, 'Ġare': 1, 'Ġtrained': 1, 'Ġand': 1, 'Ġgenerate': 1, 'Ġtokens': 1})'''
首先载入gpt的分词器,做预分词
再载入collection中的defaultdict设定为int类型
基础词汇表
123456789101112alphabet = []for word in word_freqs.keys(): for letter in word: if letter not in alphabet: alphabet.append(letter)alphabet.sort()print(alphabet)'''[ ',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ']'''
加个表头vocab = ["<|endoftext|>"] + alphabet.copy()
将单词映射为{‘word’: [‘w’, ‘o’, ‘r’, ‘d’]}的形式进行训练
1234splits = {word: [c for c in word] for word in word_freqs.keys()}# 我觉可以改一下`splits = {word: list(word) for word in word_freqs.keys()}`
字母对频率函数
1234567891011121314151617181920212223242526272829303132333435363738394041def compute_pair_freqs(splits): pair_freqs = defaultdict(int) for word, freq in word_freqs.items(): split = splits[word] # 取得word对应的值如['w', 'o', 'r', 'd'] if len(split) == 1: continue for i in range(len(split) - 1): pair = (split[i], split[i + 1]) pair_freqs[pair] += freq # 记录字母对的频率 return pair_freqs# 示例pair_freqs = compute_pair_freqs(splits)for i, key in enumerate(pair_freqs.keys()): print(f"{key}: {pair_freqs[key]}") if i >= 5: break'''('T', 'h'): 3('h', 'i'): 3('i', 's'): 5('Ġ', 'i'): 2('Ġ', 't'): 7('t', 'h'): 3'''# 取最大值best_pair = ""max_freq = Nonefor pair, freq in pair_freqs.items(): if max_freq is None or max_freq < freq: best_pair = pair max_freq = freqprint(best_pair, max_freq)# ('Ġ', 't') 7# 合并入库merges = {("Ġ", "t"): "Ġt"}vocab.append("Ġt")
将字符对构建进新的基础词表split (不是vocab)
12345678910111213141516171819def merge_pair(a, b, splits): for word in word_freqs: split = splits[word] # 取得word对应的值如['w', 'o', 'r', 'd'] if len(split) == 1: continue i = 0 while i < len(split) - 1: if split[i] == a and split[i + 1] == b: # 找到词对的位置,将ab字符串连接起来,做个列表存起来 split = split[:i] + [a + b] + split[i + 2 :] else: i += 1 splits[word] = split # 更新 ['w', 'o', 'r', 'd'] -> ['wo', 'r', 'd'] return splitssplits = merge_pair("Ġ", "t", splits)print(splits["Ġtrained"])# ['Ġt', 'r', 'a', 'i', 'n', 'e', 'd']
构建循环
1234567891011121314151617181920vocab_size = 50while len(vocab) < v ...
HF Course 07 NER QA Tokenizer
记得排版 分割线待完成
QA部分
本章我们需要对做特殊的tokenizer以适应NER和QA任务数据的特殊性
Fast Tokenizer1234567from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-cased")example = "My name is Sylvain and I work at Hugging Face in Brooklyn."encoding = tokenizer(example)print(type(encoding))# <class 'transformers.tokenization_utils_base.BatchEncoding'>
分词后返回的结果类型不简单是字典的映射
还包含很多方法
123456789101112tokenizer.is_fast, encoding.is_fast(True,True)encoding.tokens()'''['[CLS]', 'My', 'name', 'is', 'S', '##yl', '##va', '##in', 'and', 'I', 'work', 'at', 'Hu', '##gging', 'Face', 'in', 'Brooklyn', '.', '[SEP]']''' encoding.word_ids()'''[None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]'''
word_ids()方法可看到分词的结果来自哪个单词
最后我们可以使用word_to_chars() or token_to_chars() and char_to_word() or char_to_token() 查看单词
123start, end = encoding.word_to_chars(3)example[start:end]# Sylvain
NER
在NER中我们以偏移量的标记来锁定原文的字符
pipeline方法首先查看pipeline方法的ner流程
12345678910111213from transformers import pipelinetoken_classifier = pipeline("token-classification")token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")'''[{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S', 'start': 11, 'end': 12}, {'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl', 'start': 12, 'end': 14}, {'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va', 'start': 14, 'end': 16}, {'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in', 'start': 16, 'end': 18}, {'entity': 'I-ORG', 'score': 0.97389334, 'index': 12, 'word': 'Hu', 'start': 33, 'end': 35}, {'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging', 'start': 35, 'end': 40}, {'entity': 'I-ORG', 'score': 0.98879766, 'index': 14, 'word': 'Face', 'start': 41, 'end': 45}, {'entity': 'I-LOC', 'score': 0.99321055, 'index': 16, 'word': 'Brooklyn', 'start': 49, 'end': 57}]'''
简洁版
12345678from transformers import pipelinetoken_classifier = pipeline("token-classification", aggregation_strategy="simple")token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")'''[{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18}, {'entity_group': 'ORG', 'score': 0.97960204, 'word': 'Hugging Face', 'start': 33, 'end': 45}, {'entity_group': 'LOC', 'score': 0.99321055, 'word': 'Brooklyn', 'start': 49, 'end': 57}]'''
aggregation_strategy有不同的参数,simple是分词后的平均分数
如上面的sylvain分数来自 正常版的四项平均'S', '##yl', '##va', '##in'
"first", where the score of each entity is the score of the first token of that entity (so for “Sylvain” it would be 0.993828, the score of the token S)
"max", where the score of each entity is the maximum score of the tokens in that entity (so for “Hugging Face” it would be 0.98879766, the score of “Face”)
"average", where the score of each entity is the average of the scores of the words composing that entity (so for “Sylvain” there would be no difference from the "simple" strategy, but “Hugging Face” would have a score of 0.9819, the average of the scores for “Hugging”, 0.975, and “Face”, 0.98879)
logits这里通过返回的结果使用argmax(-1)得到映射的分类
123456789from transformers import AutoTokenizer, AutoModelForTokenClassificationmodel_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)example = "My name is Sylvain and I work at Hugging Face in Brooklyn."inputs = tokenizer(example, return_tensors="pt")outputs = model(**inputs)
12345print(inputs["input_ids"].shape)print(outputs.logits.shape)'''torch.Size([1, 19])torch.Size([1, 19, 9])'''
123456789101112131415161718import torchprobabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()predictions = outputs.logits.argmax(dim=-1)[0].tolist()print(predictions)# [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]model.config.id2label'''{0: 'O', 1: 'B-MISC', 2: 'I-MISC', 3: 'B-PER', 4: 'I-PER', 5: 'B-ORG', 6: 'I-ORG', 7: 'B-LOC', 8: 'I-LOC'}'''
偏移量postprocessing组织一下格式,复现上面的内容
123456789101112131415161718192021results = []tokens = inputs.tokens()for idx, pred in enumerate(predictions): label = model.config.id2label[pred] if label != "O": results.append( {"entity": label, "score": probabilities[idx][pred], "word": tokens[idx]} )print(results)'''[{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S'}, {'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl'}, {'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va'}, {'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in'}, {'entity': 'I-ORG', 'score': 0.97389334, 'index': 12, 'word': 'Hu'}, {'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging'}, {'entity': 'I-ORG', 'score': 0.98879766, 'index': 14, 'word': 'Face'}, {'entity': 'I-LOC', 'score': 0.99321055, 'index': 16, 'word': 'Brooklyn'}]'''
偏移量 offsets_mapping
123456inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)inputs_with_offsets["offset_mapping"]'''[(0, 0), (0, 2), (3, 7), (8, 10), (11, 12), (12, 14), (14, 16), (16, 18), (19, 22), (23, 24), (25, 29), (30, 32), (33, 35), (35, 40), (41, 45), (46, 48), (49, 57), (57, 58), (0, 0)]'''
这里的19对元组就是对应19个分词后token的下标
1['[CLS]', 'My', 'name', 'is', 'S', '##yl', '##va', '##in', 'and', 'I', 'work', 'at', 'Hu', '##gging', 'Face', 'in','Brooklyn', '.', '[SEP]']
比如(0,0)是留给[CLS]的; 比如第六个token对应的是 ##ly 那么他在原文中的标注就是(12,14),如下
12example[12:14]# yl
继续我们的复现pipeline
123456789101112131415161718192021222324252627282930results = []inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)tokens = inputs_with_offsets.tokens()offsets = inputs_with_offsets["offset_mapping"]for idx, pred in enumerate(predictions): label = model.config.id2label[pred] if label != "O": start, end = offsets[idx] results.append( { "entity": label, "score": probabilities[idx][pred], "word": tokens[idx], "start": start, "end": end, } )print(results)'''[{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S', 'start': 11, 'end': 12}, {'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl', 'start': 12, 'end': 14}, {'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va', 'start': 14, 'end': 16}, {'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in', 'start': 16, 'end': 18}, {'entity': ...
HF Course 06 继承式的Tokenizer
这种方法是基于旧的模型分词器上,针对你的语料库训练一个新的分词器的方法。夺舍属于是
这里我们以GPT的分词器为例,它使用unigram的算法进行分词
载入数据12345678910111213from datasets import load_dataset# This can take a few minutes to load, so grab a coffee or tea while you wait!raw_datasets = load_dataset("code_search_net", "python")raw_datasets["train"]'''Dataset({ features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url' ], num_rows: 412178})'''
生成器加载数据下面的方法会一次加载所有数据
12# Don't uncomment the following line unless your dataset is small!# training_corpus = [raw_datasets["train"][i: i + 1000]["whole_func_string"] for i in range(0, len(raw_datasets["train"]), 1000)]
一般使用python生成器
1234training_corpus = ( raw_datasets["train"][i : i + 1000]["whole_func_string"] for i in range(0, len(raw_datasets["train"]), 1000))
将列表推导式的方括号换成圆括号就可以变成生成器了,好厉害。
123456>gen = (i for i in range(10))>print(list(gen))>print(list(gen))>'''>[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]>[]'''
使用之后会清空内存,如上所示
更一般的生成器
12345def get_training_corpus(): dataset = raw_datasets["train"] for start_idx in range(0, len(dataset), 1000): samples = dataset[start_idx : start_idx + 1000] yield samples["whole_func_string"]
train_new_from_iterator()载入模型1234567891011121314from transformers import AutoTokenizerold_tokenizer = AutoTokenizer.from_pretrained("gpt2")example = '''def add_numbers(a, b): """Add the two numbers `a` and `b`.""" return a + b'''tokens = old_tokenizer.tokenize(example)tokens'''['def', 'Ġadd', '_', 'n', 'umbers', '(', 'a', ',', 'Ġb', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`', '."', '""', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']'''
This tokenizer has a few special symbols, like Ġ and Ċ, which denote spaces and newlines, respectively。
两个G表示空格和换行符。他还为多个空格在一起的单独编码,带下划线的词也不认识,所以不太合适。
训练新分词器1tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)
注意,只有Fast的tokenizer支持train_new_from_iterator方法,他们是根据rust写的。没有fast的是纯python写的。
12345tokens = tokenizer.tokenize(example)tokens'''['def', 'Ġadd', '_', 'numbers', '(', 'a', ',', 'Ġb', '):', 'ĊĠĠĠ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`."""', 'ĊĠĠĠ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']'''
起码多个空格学会了
存储新分词器1tokenizer.save_pretrained("code-search-net-tokenizer")