HF Course 05 faiss 搜索引擎
在我们创建好自己的数据集后,可以用faiss 和 hf 来搜索一些数据。
我们通过multi-qa-mpnet-base-dot-v1模型embedding我们的数据,然后通过 faiss给每个embedding得到index
最后将我们的query 给tokenizer转换之后喂给模型,得到最匹配我们问题的数据。
Fortunately, there’s a library called sentence-transformers that is dedicated to creating embeddings. As described in the library’s documentation, our use case is an example of asymmetric semantic search because we have a short query whose answer we’d like to find in a longer document, like a an issue comment. The handy model overview table in the documentation indicates that the multi-qa-mpnet-base-dot-v1 checkpoint has the best performance for semantic search, so we’ll use that for our application.
我们主要使用了sentence-transformers faiss两个额外库处理
加载模型12345from transformers import AutoTokenizer, AutoModelmodel_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"tokenizer = AutoTokenizer.from_pretrained(model_ckpt)model = AutoModel.from_pretrained(model_ckpt)
数据处理123456789101112131415import torchdevice = torch.device("cuda")model.to(device)def cls_pooling(model_output): return model_output.last_hidden_state[:, 0] def get_embeddings(text_list): encoded_input = tokenizer( text_list, padding=True, truncation=True, return_tensors="pt" ) encoded_input = {k: v.to(device) for k, v in encoded_input.items()} model_output = model(**encoded_input) return cls_pooling(model_output)
加入 faiss 的index12345embeddings_dataset = comments_dataset.map( lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]})embeddings_dataset.add_faiss_index(column="embeddings")
测试1234question = "How can I load a dataset offline?"question_embedding = get_embeddings([question]).cpu().detach().numpy()question_embedding.shape# torch.Size([1, 768])
123scores, samples = embeddings_dataset.get_nearest_examples( "embeddings", question_embedding, k=5)
查看结果12345import pandas as pdsamples_df = pd.DataFrame.from_dict(samples)samples_df["scores"] = scoressamples_df.sort_values("scores", ascending=False, inplace=True)
1234567for _, row in samples_df.iterrows(): print(f"COMMENT: {row.comments}") print(f"SCORE: {row.scores}") print(f"TITLE: {row.title}") print(f"URL: {row.html_url}") print("=" * 50) print()
可以查看最匹配的评论
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051"""COMMENT: Requiring online connection is a deal breaker in some cases unfortunately so it'd be great if offline mode is added similar to how `transformers` loads models offline fine.@mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?SCORE: 25.505046844482422TITLE: Discussion using datasets in offline modeURL: https://github.com/huggingface/datasets/issues/824==================================================COMMENT: The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package since #1726 :)You can now use them offline\`\`\`pythondatasets = load_dataset("text", data_files=data_files)\`\`\`We'll do a new release soonSCORE: 24.555509567260742TITLE: Discussion using datasets in offline modeURL: https://github.com/huggingface/datasets/issues/824==================================================COMMENT: I opened a PR that allows to reload modules that have already been loaded once even if there's no internet.Let me know if you know other ways that can make the offline mode experience better. I'd be happy to add them :)I already note the "freeze" modules option, to prevent local modules updates. It would be a cool feature.----------> @mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?Indeed `load_dataset` allows to load remote dataset script (squad, glue, etc.) but also you own local ones.For example if you have a dataset script at `./my_dataset/my_dataset.py` then you can do\`\`\`pythonload_dataset("./my_dataset")\`\`\`and the dataset script will generate your dataset once and for all.----------About I'm looking into having `csv`, `json`, `text`, `pandas` dataset builders already included in the `datasets` package, so that they are available offline by default, as opposed to the other datasets that require the script to be downloaded.cf #1724SCORE: 24.14896583557129TITLE: Discussion using datasets in offline modeURL: https://github.com/huggingface/datasets/issues/824==================================================COMMENT: > here is my way to load a dataset offline, but it **requires** an online machine>> 1. (online machine)>
import datasets
data = datasets.load_dataset(…)
data.save_to_disk(/YOUR/DATASET/DIR)
123452. copy the dir from online to the offline machine3. (offline machine)
import datasets
data = datasets.load_from_disk(/SAVED/DATA/DIR)
12345678910111213141516171819202122232425262728293031HTH.SCORE: 22.893993377685547TITLE: Discussion using datasets in offline modeURL: https://github.com/huggingface/datasets/issues/824==================================================COMMENT: here is my way to load a dataset offline, but it **requires** an online machine1. (online machine)\`\`\`import datasetsdata = datasets.load_dataset(...)data.save_to_disk(/YOUR/DATASET/DIR)\`\`\`2. copy the dir from online to the offline machine3. (offline machine)\`\`\`import datasetsdata = datasets.load_from_disk(/SAVED/DATA/DIR)\`\`\`HTH.SCORE: 22.406635284423828TITLE: Discussion using datasets in offline modeURL: https://github.com/huggingface/datasets/issues/824=================================================="""
HF Course 04 Dataset
加载本地数据
Data format
Loading script
Example
CSV & TSV
csv
load_dataset("csv", data_files="my_file.csv")
Text files
text
load_dataset("text", data_files="my_file.txt")
JSON & JSON Lines
json
load_dataset("json", data_files="my_file.jsonl")
Pickled DataFrames
pandas
load_dataset("pandas", data_files="my_dataframe.pkl")
分别需要做,指明数据类型,指明文件路径
data_files参数The data_files argument of the load_dataset() function is quite flexible and can be either a single file path, a list of file paths, or a dictionary that maps split names to file paths. You can also glob files that match a specified pattern according to the rules used by the Unix shell (e.g., you can glob all the JSON files in a directory as a single split by setting data_files="*.json"). See the 🤗 Datasets documentation for more details.
可以做文件路径
可以做split将数据映射成想要的字典格式
```pythondata_files = {“train”: “SQuAD_it-train.json”, “test”: “SQuAD_it-test.json”}squad_it_dataset = load_dataset(“json”, data_files=data_files, field=”data”)squad_it_dataset
‘’’DatasetDict({train: Dataset({ features: [‘title’, ‘paragraphs’], num_rows: 442})test: Dataset({ features: [‘title’, ‘paragraphs’], num_rows: 48})})’’’
123456789101112 ## 加载服务器数据```pythonurl = "https://github.com/crux82/squad-it/raw/master/"data_files = { "train": url + "SQuAD_it-train.json.gz", "test": url + "SQuAD_it-test.json.gz",}squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
数据处理分隔符如果你的数据不是传统的CSV格式(以逗号分割),你可以指定分隔符
12345from datasets import load_datasetdata_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}# \t is the tab character in Pythondrug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")
随机选取样本1234567891011121314drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))# Peek at the first few examplesdrug_sample[:3]'''{'Unnamed: 0': [87571, 178045, 80482], 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'], 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'], 'review': ['"like the previous person mention, I'm a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"', '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."', '"I have been taking Mobic for over a year with no side effects other than an elevated blood pressure. I had severe knee and ankle pain which completely went away after taking Mobic. I attempted to stop the medication however pain returned after a few days."'], 'rating': [9.0, 3.0, 10.0], 'date': ['September 2, 2015', 'November 7, 2011', 'June 5, 2013'], 'usefulCount': [36, 13, 128]}'''
重命名123456789101112131415drug_dataset = drug_dataset.rename_column( original_column_name="Unnamed: 0", new_column_name="patient_id")drug_dataset'''DatasetDict({ train: Dataset({ features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'], num_rows: 161297 }) test: Dataset({ features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'], num_rows: 53766 })})'''
补充一个匿名表达式的细节
(lambda base, height: 0.5 * base * height)(4, 8)
16
转换大小写123456def lowercase_condition(example): return {"condition": example["condition"].lower()}drug_dataset.map(lowercase_condition)'''AttributeError: 'NoneType' object has no attribute 'lower''''
这里报错了
filterdataset.filter
12345678drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)drug_dataset = drug_dataset.map(lowercase_condition)# Check that lowercasing workeddrug_dataset["train"]["condition"][:3]'''['left ventricular dysfunction', 'adhd', 'birth control']'''
过滤筛选合格的数据样本
增加列123456789101112131415def compute_review_length(example): return {"review_length": len(example["review"].split())} drug_dataset = drug_dataset.map(compute_review_length)# Inspect the first training exampledrug_dataset["train"][0]'''{'patient_id': 206461, 'drugName': 'Valsartan', 'condition': 'left ventricular dysfunction', 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"', 'rating': 9.0, 'date': 'May 20, 2012', 'usefulCount': 27, 'review_length': 17}'''
补充一个sort
12345678910>drug_dataset["train"].sort("review_length")[:3]>'''>{'patient_id': [103488, 23627, 20558],'drugName': ['Loestrin 21 1 / 20', 'Chlorzoxazone', 'Nucynta'],'condition': ['birth control', 'muscle spasm', 'pain'],'review': ['"Excellent."', '"useless"', '"ok"'],'rating': [10.0, 1.0, 6.0],'date': ['November 4, 2008', 'March 24, 2017', 'August 20, 2016'],'usefulCount': [5, 2, 10],'review_length': [1, 1, 1]}'''
sort应该也有reverse选项,如果真要做EDA还是用Pandas好了, 查看可配置参数
在补充一个Dataset.add_column()
An alternative way to add new columns to a dataset is with the Dataset.add_column() function. This allows you to provide the column as a Python list or NumPy array and can be handy in situations where Dataset.map() is not well suited for your analysis.
解析html字符12345678import htmltext = "I'm a transformer called BERT"html.unescape(text)'''"I'm a transformer called BERT"'''
1drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})
map 方法batchWhen you specify batched=True the function receives a dictionary with the fields of the dataset, but each value is now a list of values, and not just a single value. The return value of Dataset.map() should be the same: a dictionary with the fields we want to update or add to our dataset, and a list of values.
123new_drug_dataset = drug_dataset.map( lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True)
批量处理为True的话,每次传进来就是一个字典批次。一般我们做的就是更新这个数据集
If you’re running this code in a notebook, you’ll see that this command executes way faster than the previous one. And it’s not because our reviews have already been HTML-unescaped — if you re-execute the instruction from the previous section (without batched=True), it will take the same amount of time as before. This is because list comprehensions are usually faster than executing the same code in a for loop, and we also gain some performance by accessing lots of elements at the same time instead of one by one.
之前单个处理的用的是for循环,这里批量处理就可以用列表推导式,要快的多
配合tokenizer使用1234567891011def tokenize_and_split(examples): return tokenizer( examples["review"], truncation=True, max_length=128, return_overflowing_tokens=True, ) result = tokenize_and_split(drug_dataset["train"][0])[len(inp) for inp in result["input_ids"]]# [128, 49]
使用return_overflowing_tokens参数来接受截断的部分,这里我们177的长度变成了128和49两份
1tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
数据类型转换PandasTo enable the conversion between various third-party libraries, 🤗 Datasets provides a Dataset.set_format() function. This function only changes the output format of the dataset, so you can easily switch to another format without affecting the underlying data format, which is Apache Arrow. The formatting is done in place. To demonstrate, let’s convert our dataset to Pandas:
drug_dataset.set_format("pandas")
一般使用train_df = drug_dataset["train"][:] 获得整体的切片作为新的Dataframe 可以自己尝试是否返回对象为hf的dataset
123456789from datasets import Datasetfreq_dataset = Dataset.from_pandas(frequencies)freq_dataset'''Dataset({ features: ['condition', ...
NLP Baseline 01 翻译
从头训练,不如fine-tune,如果你比Google & Mate 有钱当我没说
待完成
Accelarator
get_scheduler
custom_wandb
Translation这里我们使用zh-en的数据集和模型,进行翻译任务
这里需要注册一个wandb的账号,记得啊。
示例查看123456789101112131415161718192021from transformers import AutoModelForSeq2SeqLM, AutoTokenizerfrom datasets import load_datasetprx = {'https': 'http://127.0.0.1:7890'}model_name = "Helsinki-NLP/opus-mt-zh-en"save_path = r'D:\00mydataset\huggingface model'data_path = r'D:\00mydataset\huggingface dataset'tokenizer = AutoTokenizer.from_pretrained(model_name, proxies=prx, cache_dir=save_path)model = AutoModelForSeq2SeqLM.from_pretrained(model_name, proxies=prx, cache_dir=save_path)dataset = load_dataset('news_commentary','en-fr',cache_dir=data_path)dataset'''DatasetDict({ train: Dataset({ features: ['id', 'translation'], num_rows: 69206 })})'''
这个挂个代理加速下
1234567891011tokenizer'''PreTrainedTokenizer(name_or_path='Helsinki-NLP/opus-mt-zh-en', vocab_size=65001, model_max_len=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})''''dataset['train'][1]['translation']'''{'id': '1', 'translation': {'en': 'PARIS – As the economic crisis deepens and widens, the world has been searching for historical analogies to help us understand what has been happening. At the start of the crisis, many people likened it to 1982 or 1973, which was reassuring, because both dates refer to classical cyclical downturns.', 'zh': '巴黎-随着经济危机不断加深和蔓延,整个世界一直在寻找历史上的类似事件希望有助于我们了解目前正在发生的情况。一开始,很多人把这次危机比作1982年或1973年所发生的情况,这样得类比是令人宽心的,因为这两段时期意味着典型的周期性衰退。'}} '''
查看下数据, 可以看到返回的是字典形式,我们主要用到translation下的en、zh
123456789101112s1 = '天下第一美少女, 罢了'inputs = tokenizer(s1, return_tensors='pt',)inputs'''({'input_ids': tensor([[ 9705, 359, 3615, 2797, 14889, 2, 7, 40798, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])},)'''outputs = model.generate(**inputs)tokenizer.batch_decode(outputs, skip_special_tokens=True)'''["The most beautiful girl in the world, that's all."]'''
看下输出,还可以
注意,AutoModelForSeq2SeqLM不同于AutoModel的就是加入了model.generate这个特性。
不然model(**inputs)是要你补充目标语言的。
If you are using a multilingual tokenizer such as mBART, mBART-50, or M2M100, you will need to set the language codes of your inputs and targets in the tokenizer by setting tokenizer.src_lang and tokenizer.tgt_lang to the right values.
如果你使用多语言模型,你得指定你的源语言和目标语言的参数
Preprocessing1234567891011121314151617split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)split_datasets'''DatasetDict({ train: Dataset({ features: ['id', 'translation'], num_rows: 189155 }) test: Dataset({ features: ['id', 'translation'], num_rows: 21018 })})'''split_datasets["validation"] = split_datasets.pop("test")
HF的dataset可以直接调用.train_test_split(train_size=0.9, seed=20)
HF的dataset可以直接转DataFrame,这样你也可以直接配合Sklearn使用
给test重命名为validation
DataCollatorForSeq2Seq12345678910111213141516171819202122max_length = 128def preprocess_function(examples): inputs = [ex["en"] for ex in examples["translation"]] targets = [ex["fr"] for ex in examples["translation"]] model_inputs = tokenizer( inputs, text_target=targets, max_length=max_length, truncation=True ) return model_inputs tokenized_datasets = split_datasets.map( preprocess_function, batched=True, remove_columns=split_datasets["train"].column_names,)'''这里本来还有['id', 'translation'],通过下面的设置就删除了。remove_columns: Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of `function`,i.e. if `function` is adding columns with names in `remove_columns`, these columns will be kept.'''
We don’t pay attention to the attention mask of the targets, as the model won’t expect it. Instead, the labels corresponding to a padding token should be set to -100 so they are ignored in the loss computation. This will be done by our data collator later on since we are applying dynamic padding, but if you use padding here, you should adapt the preprocessing function to set all labels that correspond to the padding token to -100.
这里我们不会加入padding,mask。之后我们的mask会设成-100 使其不会计算损失。这些都是下一步的操作
ps: 今天看到个bug,应该是没有更新到最新版版,具体来说就是tokenizer之后没有label,如果bug了,可以进行以下替换
12345678910111213141516def preprocess_function(examples): inputs = [ex["zh"] for ex in examples["translation"]] targets = [ex["en"] for ex in examples["translation"]] model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True) # Set up the tokenizer for targets with tokenizer.as_target_tokenizer(): labels = tokenizer(targets, max_length=max_target_length, truncation=True) model_inputs["labels"] = labels["input_ids"] return model_inputstokenized_datasets = split_datasets.map( preprocess_function, batched=True, remove_columns=split_datasets["train"].column_names,)
12345678910111213141516171819202122232425262728293031from transformers import DataCollatorForSeq2Seqdata_collator = DataCollatorForSeq2Seq(tokenizer, model=model)batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])batch.keys()# dict_keys(['attention_mask', 'input_ids', 'labels', 'decoder_input_ids'])batch["labels"]''' tensor([[57483, 7, 3241, 403, 3, 289, 1817, 25787, 22, 6, 38697, 22, 2, 3, 426, 64, 72, 27734, 14, 9054, 56467, 6667, 8, 721, 512, 2498, 209, 64, 72, 11468, 5, 393, 3, 2597, 4, 3, 1817, 2, 469, 235, 238, 24898, 39, 8, 13579, 50, 17528, 2, 60, 42, 56548, 2, 695, 443, 10119, 5543, 8, 53617, 7, 38261, 40490, 22, 5, 0], [ 24, 22026, 30, 2329, 10349, 22901, 20, 52813, 17, 50, 12, 29940, 4, 3, 2121, 20, 1843, 45, 67, 243, 1945, 30, 368, 36681, 10, 3, 1796, 4, 14961, 2203, 6, 28291, 3, 22986, 2, 11355, 3, 3368, 64, 8700, 18, 469, 38575, 10, 278, 54, 8, 4291, 57, 22301, 1718, 8, 959, 30229, 1294, 6855, 4298, 5, 0, -100, -100, -100, -100, -100]])'''# 看下原来的tokenfor i in range(1, 3): print(tokenized_datasets["train"][i]["labels"])'''[57483, 7, 3241, 403, 3, 289, 1817, 25787, 22, 6, 38697, 22, 2, 3, 426, 64, 72, 27734, 14, 9054, 56467, 6667, 8, 721, 512, 2498, 209, 64, 72, 11468, 5, 393, 3, 2597, 4, 3, 1817, 2, 469, 235, 238, 24898, 39, 8, 13579, 50, 17528, 2, 60, 42, 56548, 2, 695, 443, 10119, 5543, 8, 53617, 7, 38261, 40490, 22, 5, 0][24, 22026, 30, 2329, 10349, 22901, 20, 52813, 17, 50, 12, 29940, 4, 3, 2121, 20, 1843, 45, 67, 243, 1945, 30, 368, 36681, 10, 3, 1796, 4, 14961, 2203, 6, 28291, 3, 22986, 2, 11355, 3, 3368, 64, 8700, 18, 469, 38575, 10, 278, 54, 8, 4291, 57, 22301, 1718, 8, 959, 30229, 1294, 6855, 4298, 5, 0]'''
可以看到padding的位置都变成-100了,pytorch中也有这个设定可见我之前讲Transformer的内容
This is all done by a DataCollatorForSeq2Seq. Like the DataCollatorWithPadding, it takes the tokenizer used to preprocess the inputs, but it also takes the model. This is because this data collator will also be responsible for preparing the decoder input IDs, which are shifted versions of the labels(移动版的标签) with a special token at the beginning. Since this shift is done slightly differently for different architectures, the DataCollatorForSeq2Seq needs to know the model object搞得还挺复杂
Metrics
One weakness with BLEU is that it expects the text to already be tokenized, which makes it difficult to compare scores between models that use different tokenizers. So instead, the most commonly used metric for benchmarking translation models today is SacreBLEU, which addresses this weakness (and others) by standardizing the tokenization step
这里我们加入SacreBLEU作为评分标准
12345!pip install sacrebleu import evaluatemetric = evaluate.load("sacrebleu")
示例1
12345678910111213141516171819predictions = [ "This plugin lets you translate web pages between several languages automatically."]references = [ [ "This plugin allows you to automatically translate web pages between several languages." ]]metric.compute(predictions=predictions, references=references)'''{'score': 46.750469682990165, 'counts': [11, 6, 4, 3], 'totals': [12, 11, 1 ...
Huggingface Tokenizer详解
Main Types of Tokenizers
Byte-Pair Encoding (BPE)
WordPiece
SentencePiece
spaCy and Moses are two popular rule-based tokenizers.
Subword
word-level 太大,character-level包含不了很多语义
So to get the best of both worlds, transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.
Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.
将低频词(专有名词那些) 拆解、分词
例如
123456from transformers import BertTokenizertokenizer = BertTokenizer.from_pretrained("bert-base-uncased")tokenizer.tokenize("I have a new GPU!")# ["i", "have", "a", "new", "gp", "##u", "!"]
更小的词汇表意味着失去更多的词义,在词义和运算资源的平衡引出了下面的几个算法。
Byte-Pair Encoding (BPE)GPT-2, Roberta, XLM, FlauBERT ,GPT
BPE在Neural Machine Translation of Rare Words with Subword Units中被提出
After pre-tokenization, a set of unique words has been created and the frequency of each word it occurred in the training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to define before training the tokenizer.
首先统计词频 —> 再根据词频创建词汇表 , 同时加入融合规则 —> 依词频merge得到新的词汇表 —> 在新的词汇表基础上在此merge —> 做种使得词汇表的大小在 desired vocabulary size. 这个大小是可调的超参数
例如, 数字表示频率
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
首先我们得到 base vocabulary is ["b", "g", "h", "n", "p", "s", "u"].
ug的组合有20的频率第一个被加进去
("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
次高un —> hug
("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)
另外 For instance, the word "bug" would be tokenized to ["b", "ug"] but "mug" would be tokenized as ["<unk>", "ug"] since the symbol "m" is not in the base vocabulary.
the vocabulary size, i.e. the base vocabulary size + the number of merges
For instance GPT has a vocabulary size of 40,478 since they have 478 base characters and chose to stop training after 40,000 merges.
GPT-2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
UnigramUnigram introduced in Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, 2018).
it’s used in conjunction with SentencePiece. 跟SentencePiece配合使用
In contrast to BPE or WordPiece, Unigram initializes its base vocabulary to a large number of symbols and progressively trims down each symbol to obtain a smaller vocabulary.
At each training step, the Unigram algorithm defines a loss (often defined as the log-likelihood) over the training data given the current vocabulary and a unigram language model. Then, for each symbol in the vocabulary, the algorithm computes how much the overall loss would increase if the symbol was to be removed from the vocabulary. Unigram then removes p (with p usually being 10% or 20%) percent of the symbols whose loss increase is the lowest, i.e. those symbols that least affect the overall loss over the training data. This process is repeated until the vocabulary has reached the desired size. The Unigram algorithm always keeps the base characters so that any word can be tokenized.
取对数似然损失,将所有字母嵌入基础词汇表,开始merge,计算移除某些词所降低或提升的损失,重复不断移除,直到符合理想的词汇表大小。
Because Unigram is not based on merge rules (in contrast to BPE and WordPiece), the algorithm has several ways of tokenizing new text after training
WordPieceBERT, DistilBERT, and Electra. 满满的含金量
WordPiece was outlined in Japanese and Korean Voice Search (Schuster et al., 2012)
WordPiece first initializes the vocabulary to include every character present in the training data and progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.
WordPiece使用频率记录character做基础词汇表,然后使用最大似然做评判标准 merge词汇。是BPE和Unigram的结合使用案例。
SentencePieceXLM, ALBERT, XLNet, Marian, and T5.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018) treats the input as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram algorithm to construct the appropriate vocabulary.
All transformers models in the library that use SentencePiece use it in combination with unigram
就是使用Unicode码编码所有字符
Customizing Tokenizer如同上面看到的不同模型选择了不同算法的分词方式,根据你的需求选择不同的分词器。
1234from tokenizers import models, Tokenizer# 当然这里可以直接 使用模型名字得到一个对应模型的分词器,那就是通常使用的方法tokenizer = Tokenizer(models.WordPiece()) # models.BPE() models.Unigram()
以上即可引出你的分词器进行下面的客制化。
Normalization: Executes all the initial transformations over the initial input string. For example when you need to lowercase some text, maybe strip it, or even apply one of the common unicode normalization process, you will add a Normalizer.
对你的文本进行修缮,转小写字母,删除某些符号等
Pre-tokenization: In charge of splitting the initial input string. That’s the component that decides where and how to pre-segment the origin string. The simplest example would be to simply split on spaces.
定制一些分词规则
Model: Handles all the sub-token discovery and generation, this is the part that is trainable and really dependent of your input data.
循环处理
Post-Processing: Provides advanced construction features to be compatible with some of the Transformers-based SoTA models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.
首先设定一下数据集和数据发生器
12345678from datasets import load_datasetdataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")batch_size = 1000def batch_iterator(): for i in range(0, len(dataset), batch_size): # 0到数据集的长度, 步长为batch_size yield dataset[i : i + batch_size]["text"]
WordPiece
pre-processing
1234567891011121314151617from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizertokenizer = Tokenizer(models.WordPiece(unl_token="[UNK]"))tokenizer.normalizer = normalizers.Sequence( [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()])# 这里设定bert的前处理分词器tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()tokenizer.pre_tokenizer.pre_tokenize_str("This is an example!")'''[('This', (0, 4)), ('is', (5, 7)), ('an', (8, 10)), ('example', (11, 18)), ('!', (18, 19))]'''
Note that the pre-tokenizer not only split the text into words but keeps the offsets
that is the beginning and start of each of those words inside the original text. 这里是为QA做的偏移量特性
processing
123456special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]# 引入 trainer不是编辑好了,而是借助已经完成的trainer结构加入special_tokenstrainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)# 这里是开始训练了tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)
这里我们就得到了基本处理的数据,下面进行后处理。
post-processing
12345678910111213141516171819202122232425262728293031323334tokenizer.post_processor = processors.TemplateProcessing( single=f"[CLS]:0 $A:0 [SEP]:0", pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1", special_tokens=[ ("[CLS]", cls_token_id), ("[SEP]", sep_token_id), ],)#查看数据encoder = tokenizer.encode("This is one sentence.", "With this one we have a pair.")encoding.tokens'''['[CLS]', 'this', 'is', 'one', 'sentence', '.', '[SEP]', 'with', 'this', 'one', 'we', 'have', 'a', 'pair', '.', '[SEP]'] ''' encoding.type_ids # [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
We have to indicate in the template how to organize the special tokens with one sentence ($A) or two sentences ($A and $B). The : followed by a number indicates the token type ID to give to each part.
process阶段处理好的句子进行包装 单个句子single就是00,pair就是01
wrap your tokenizer to transformer object
1234from transformers import PreTrainedTokenizerFastnew_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)# 原示例是 new_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)
BPE
12345678910111213141516171819tokenizer = Tokenizer(models.BPE())tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)tokenizer.pre_tokenizer.pre_tokenize_str("This is an example!")trainer = trainers.BpeTrainer(vocab ...
Transformer From Scratch
待完成
增加decoder inference模块
前言
架构分四个大块
encoder
encoder-layer
decoder
decoder-layer
细节三种mask
encoder-mask
decoder-mask
cross-mask
Embedding句子表示为 [token1, token2, …tokens]
句子1 = [token_1, token_2, …token_x]
句子2 = [token_1, token_2, …token_y] x 不一定等于 y
token 构造123456789101112131415161718192021222324252627282930313233343536373839404142import torch.nn.functional as Fimport torchsrc_vocab_size = 16tgt_vocab_size = 16batch_size = 4max_len = 6src_len = torch.randint(2,7, (batch_size,))tgt_len = torch.randint(2,7, (batch_size,))'''结果如下tensor([8, 6, 5, 9]) 此batch的第一个句子长8,第二个6tensor([6, 5, 9, 5])'''# 接下来我们把不定长的句子padding'''randint 生成(i,)形状的数据padding 0次或者 max_len - len(i) 的次数unsqueeze 增加一个维度'''src_seq = [F.pad(torch.randint(1, src_vocab_size, (i,)), (0, max_len-i)).unsqueeze(0) for i in src_len]tgt_seq = [F.pad(torch.randint(1, tgt_vocab_size, (i,)), (0, max_len-i)).unsqueeze(0) for i in tgt_len]# 将整个batch的句子整合src_seq = torch.cat(src_seq)tgt_seq = torch.cat(tgt_seq)'''如下tensor([[12, 15, 10, 5, 3, 14], [ 5, 7, 9, 3, 12, 1], [ 3, 1, 1, 9, 3, 4], [ 9, 6, 0, 0, 0, 0]]) tensor([[11, 12, 11, 3, 5, 15], [ 7, 9, 11, 0, 0, 0], [12, 6, 13, 11, 0, 0], [13, 3, 0, 0, 0, 0]])'''
词向量空间 映射123456789101112131415161718192021222324252627282930import torch.nn as nnd_model = 8src_embedding = nn.Embedding(src_vocab_size+1, d_model)tgt_embedding = nn.Embedding(tgt_vocab_size+1, d_model)src_embedding.weight # shape为(17, 8)'''第零行为pad的数据Parameter containing:tensor([[ 2.4606, 1.7139, -0.2859, -0.5058, 0.6229, -0.0470, 2.1517, 0.2996], [ 0.0077, -0.4292, -0.2397, 1.2366, -0.3061, 0.9196, -1.4222, -1.6431], [-0.6378, -0.7809, -0.4206, 0.5759, -1.4899, 1.2241, 0.9220, -0.6333], [ 0.0303, -1.4113, 0.9164, -0.1200, 1.7224, -0.4996, -1.6708, -1.8563], [ 0.0235, 0.0155, -0.1292, -0.9274, -1.1351, -0.9155, 0.4391, -0.0437], [ 0.8498, 0.4709, -0.9168, -2.1307, 0.1840, 0.3554, -0.3986, 1.2806], [ 0.7256, 1.2303, -0.8280, -0.2173, 0.8939, 2.4122, 0.4820, -1.9615], [-0.8607, 2.4886, -0.8877, -0.8852, 0.3905, 0.9511, -0.3732, 0.4872], [ 0.4882, -0.4518, -0.1945, 0.2857, -0.6832, -0.4870, -1.7165, -2.0987], [-0.0512, 0.2692, -1.0003, 0.7896, 0.5004, 0.3594, -1.5923, -1.5618], [ 0.4012, 0.1614, 1.8939, 0.3862, -0.6733, -1.2442, -0.6540, -1.6772], [ 1.4784, 2.7430, 0.0159, 0.5944, -1.0025, 1.0843, 0.4580, -0.6515], [ 0.3905, 0.6118, -0.1256, -0.6725, 1.2366, 0.8272, 0.0838, -1.5124], [-0.1470, 0.2149, -1.4561, 1.8008, 0.7764, -0.8517, -0.3204, -0.2550], [-1.1534, -0.6837, -1.7165, -1.7905, -1.5423, 1.8812, -0.1794, -0.2357], [ 1.3046, 1.5021, 1.4846, 1.0622, 1.4066, 0.7299, 0.7929, -1.0107], [-0.3920, 0.7482, 1.5976, 1.7429, -0.4683, 0.2286, 0.1320, -0.5826]], requires_grad=True)'''scr_embedding(scr_seq[0]) # 将取出对应的[12, 15, 10, 5, 3, 14]行
Key_mask由于在encoder_layer pad的位置经过softmax 也会分得或多或少注意力分数,这些pad不是我们希望模型从数据中学到的,所以这里我们引入Key_mask 帮助encoder更好的关注在需要关注的位置上。 也是特别重要的一个细节。
前置
12345678910111213max_len = 6embed_dim = 8vacab_max = 5token_ids = [torch.randint(1,6, (len,)) for len in torch.randint(1,7, (max_len,))]token_ids'''[tensor([2, 3, 5, 2, 4]), tensor([3, 3, 4, 4, 5, 4]), tensor([4, 5, 3]), tensor([5, 1, 4]), tensor([2, 1, 5, 3]), tensor([1, 3, 3, 1])]'''
pad
1234567891011token_pad_ids = [F.pad(x, (0, max_len-x.shape[0])).unsqueeze(0) for x in token_ids]token_pad_ids = torch.cat(token_pad_ids)token_pad_ids'''tensor([[2, 3, 5, 2, 4, 0], [3, 3, 4, 4, 5, 4], [4, 5, 3, 0, 0, 0], [5, 1, 4, 0, 0, 0], [2, 1, 5, 3, 0, 0], [1, 3, 3, 1, 0, 0]])'''
取得embedding
123456789101112131415161718192021src_embedding = nn.Embedding(vacab_max+1, embed_dim)tgt_embedding = nn.Embedding(vacab_max+1, embed_dim)src_embedding.weight, tgt_embedding.weight'''(Parameter containing: tensor([[-1.4019, -0.3245, 0.8569, -1.6555, 1.3478, 0.0979, -1.7458, 1.3138], [-0.9099, -0.6957, 0.4430, 0.6305, 0.1099, 0.3213, 0.0841, 0.0786], [-0.1215, -1.4141, 0.8802, -0.3444, 0.3444, -1.4063, -0.5057, 0.1506], [ 0.9491, 1.7888, 0.3075, -0.6642, 0.3368, 0.3388, -1.2543, -0.8096], [ 0.7723, -1.2258, -0.4963, 1.4007, -0.8048, -0.1338, 0.0199, 0.4295], [ 1.3789, -0.9537, 0.3421, 0.0658, -0.7578, -0.7217, -1.3124, 1.6017]], requires_grad=True), Parameter containing: tensor([[ 2.0609, 0.7302, 0.9811, 0.7390, 0.7475, 0.2903, 0.0735, 0.3407], [ 1.5477, -0.5033, 1.3758, -1.5225, 0.8236, 0.6329, -0.2301, 1.2352], [-0.2906, -1.8842, -0.9998, 1.6752, 0.7286, -0.4089, -0.0515, 0.5763], [ 0.2128, 0.7354, -0.4248, 0.7142, 0.4635, 1.1675, 0.7193, 1.3474], [ 0.3543, 1.2881, -0.8270, 0.6220, -1.6282, 0.1802, -0.9306, -0.2407], [-1.3339, -0.4192, -0.0800, 0.1614, 0.7026, -0.6851, 0.2386, -0.4954]], requires_grad=True))'''
查看True False ,这里我们去src_ids 的第四个,因为零比较多,(embedding的零行是pad的词向量)
1234567891011pad = src_embedding.weight[0]src_embedding(token_pad_ids[3]) == pad, token_pad_ids[3]'''(tensor([[False, False, False, False, False, False, False, False], [False, False, False, False, False, False, False, False], [False, False, False, False, False, False, False, False], [ True, True, True, True, True, True, True, True], [ True, True, True, True, True, True, True, True], [ True, True, True, True, True, True, True, True]]), tensor([5, 1, 4, 0, 0, 0])) '''
由于encoder的输入都是src 所以Q*K.T 的维度为(bs, src_len, src_len), mask就直接可以写了
12345678910111213a = token_pad_ids[3].unsqueeze(-1)b = token_pad_ids[3].unsqueeze(0)torch.matmul(a,b), a.shape, b.shape'''(tensor([[25, 5, 20, 0, 0, 0], [ 5, 1, 4, 0, 0, 0], [20, 4, 16, 0, 0, 0], [ 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 0]]), torch.Size([6, 1]), torch.Size([1, 6])) '''
一批量查看mask
12345678910111213141516171819202122232425mask = torch.matmul(token_pad_ids.unsqueeze(-1),token_pad_ids.unsqueeze(1)) ==0mask'''只看前三个tensor([[[False, False, False, False, False, True], [False, False, False, False, False, True], [False, False, False, False, False, True], [False, False, False, False, False, True], [False, False, False, False, False, True], [ True, True, True, True, True, True]], [[False, False, False, False, False, False], [False, False, False, False, False, False], [False, False, False, False, False, False], [False, False, False, False, False, False], [False, False, False, False, False, False], [False, False, False, False, False, False]], [[False, False, False, True, True, True], [False, False, False, True, True, True], [False, False, False, True, True, True], [ True, True, True, True, True, True], [ True, True, True, True, True, True], [ True, True, True, True, True, True]], 。。。'''
上刺刀
123456789101112131415161718192021222324252627scores = torch.randn(6, 6, 6)mask = torch.matmul(token_pad_ids.unsqueeze(-1),token_pad_ids.unsqueeze(1))scores= scores.masked_fill(mask==0, -1e9)scores.softmax(-1)'''依旧只看前三个tensor([[[0.0356, 0.0985, 0.6987, 0.0902, 0.0770, 0.0000], [0.4661, 0.0397, 0.3546, 0.0931, 0.0464, 0.0000], [0.1917, 0.0149, 0.1564, 0.4113, 0.2259, 0.0000], [0.4269, 0.0352, 0.1605, 0.1334, 0.2441, 0.0000], [0.0515, 0.4421, 0.0705, 0.2934, 0.1426, 0.0000], [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667]], [[0.0803, 0.0330, 0.3310, 0.0243, 0.3612, 0.1701], [0.2160, 0.1483, 0.0312, 0.1804, 0.3861, 0.0380], [0.2151, 0.0807, 0.1072, 0.4335, 0.1200, 0.0435], [0.0285, 0.2684, 0.1558, 0.2210, 0.1880, 0.1383], [0.0889, 0.4485, 0.1067, 0.1028, 0.1901, 0.0630], [0.2885, 0.1682, 0.0935, 0.0179, 0.0289, 0.4031]], [[0.2862, 0.3934, 0.3204, 0.0000, 0.0000, 0.0000], [0.2426, 0.2206, 0.5369, 0.0000, 0.0000, 0.0000], [0.1487, 0.2483, 0.6030, 0.0000, 0.0000, 0.0000], [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667], [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667], [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667]],。。。'''
同时我们可以看见 全都是pad自身的注意力分数是平均的,也就是跟乱猜一样,没有意义。
Position Embedding按公式写就行
123456789101112131415pe = torch.zeros(max_len, d_model)pos = torch.arange(0, max_len).unsqueeze(1) # 形状为(max_len, 1)idx = torch.pow(10000, torch.arange(0, 8, 2).unsqueeze(0)/ d_model ) # 形状为 (1, 4)pe[:, 0::2] = torch.sin(pos / idx) # 触发广播 (max_len, 4)pe[:, 1::2] = torch.sin(pos / idx)'''此处演示只加奇数列的效果tensor([[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.8415, 0.0000, 0.0998, 0.0000, 0.0100, 0.0000, 0.0010, 0.0000], [ 0.9093, 0.0000, 0.1987, 0.0000, 0.0200, 0.0000, 0.0020, 0.0000], [ 0.1411, 0.0000, 0.2955, 0.0000, 0.0300, 0.0000, 0.003 ...
CV 02 Vit 叶子图片分类
前言Vit——Vision Transformer
这里通过kaggle的叶子分类任务来使用预训练(Pre-train)模型Vit来提升我们的任务表示
1.观察模型&处理数据1.1 模型探索无论是基于python的特性(适配各个领域的包),还是NLP里大行其道的Pre-train范式,拥有快速了解一个包的特性以适用于我们工作的能力,将极大的提升我们工作的效率和结果。所以下面我们来快速体验一下HuggingFace给出的模型范例,并针对我们的任务进行相应的数据处理。
1234567891011121314151617from transformers import ViTFeatureExtractor, ViTForImageClassificationfrom PIL import Imageimport requestsurl = 'http://images.cocodataset.org/val2017/000000039769.jpg'image = Image.open(requests.get(url, stream=True).raw)feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')inputs = feature_extractor(images=image, return_tensors="pt")outputs = model(**inputs)logits = outputs.logits# model predicts one of the 1000 ImageNet classespredicted_class_idx = logits.argmax(-1).item()print("Predicted class:", model.config.id2label[predicted_class_idx])
上面的代码可以自行运行
1.1.1 示例解读
上十行代码: 首先通过requests库拿到一张图片并用image生成图片形式,下面两行加载了Vit16的特征提取器和HF特供的图片分类适配模型
下面我们看看 特征提取后的输入(inputs)
1234567891011121314151617181920212223242526272829# inputs 输出如下'''{'pixel_values': tensor([[[[ 0.1137, 0.1686, 0.1843, ..., -0.1922, -0.1843, -0.1843], [ 0.1373, 0.1686, 0.1843, ..., -0.1922, -0.1922, -0.2078], [ 0.1137, 0.1529, 0.1608, ..., -0.2314, -0.2235, -0.2157], ..., [ 0.8353, 0.7882, 0.7333, ..., 0.7020, 0.6471, 0.6157], [ 0.8275, 0.7961, 0.7725, ..., 0.5843, 0.4667, 0.3961], [ 0.8196, 0.7569, 0.7569, ..., 0.0745, -0.0510, -0.1922]], [[-0.8039, -0.8118, -0.8118, ..., -0.8902, -0.8902, -0.8980], [-0.7882, -0.7882, -0.7882, ..., -0.8745, -0.8745, -0.8824], [-0.8118, -0.8039, -0.7882, ..., -0.8902, -0.8902, -0.8902], ..., [-0.2706, -0.3176, -0.3647, ..., -0.4275, -0.4588, -0.4824], [-0.2706, -0.2941, -0.3412, ..., -0.4824, -0.5451, -0.5765], [-0.2784, -0.3412, -0.3490, ..., -0.7333, -0.7804, -0.8353]], [[-0.5451, -0.4667, -0.4824, ..., -0.7412, -0.6941, -0.7176], [-0.5529, -0.5137, -0.4902, ..., -0.7412, -0.7098, -0.7412], [-0.5216, -0.4824, -0.4667, ..., -0.7490, -0.7490, -0.7647], ..., [ 0.5686, 0.5529, 0.4510, ..., 0.4431, 0.3882, 0.3255], [ 0.5451, 0.4902, 0.5137, ..., 0.3020, 0.2078, 0.1294], [ 0.5686, 0.5608, 0.5137, ..., -0.2000, -0.4275, -0.5294]]]])} '''inputs['pixel_values'].size()# torch.Size([1, 3, 224, 224])
可以看到是一个字典类型的tensor数据,其维度为(b, C, W, H)
因此我们喂给模型的数据也得是四维的结构
接下来看看模型吐出来的结果
1234567891011121314151617181920212223242526# outputs 输入如下'''MaskedLMOutput(loss=tensor(0.4776, grad_fn=<DivBackward0>), logits=tensor([[[[-0.0630, -0.0475, -0.1557, ..., 0.0950, 0.0216, -0.0084], [-0.1219, -0.0329, -0.0849, ..., -0.0152, -0.0143, -0.0663], [-0.1063, -0.0925, -0.0350, ..., 0.0238, -0.0206, -0.2159], ..., [ 0.2204, 0.0593, -0.2771, ..., 0.0819, 0.0535, -0.1783], [-0.0302, -0.1537, -0.1370, ..., -0.1245, -0.1181, -0.0070], [ 0.0875, 0.0626, -0.0693, ..., 0.1331, 0.1088, -0.0835]], [[ 0.1977, -0.2163, 0.0469, ..., 0.0802, -0.0414, 0.0552], [ 0.1125, -0.0369, 0.0175, ..., 0.0598, -0.0843, 0.0774], [ 0.1559, -0.0994, -0.0055, ..., -0.0215, 0.2452, -0.0603], ..., [ 0.0603, 0.1887, 0.2060, ..., 0.0415, -0.0383, 0.0990], [ 0.2106, 0.0992, -0.1562, ..., -0.1254, -0.0603, 0.0685], [ 0.0256, 0.1578, 0.0304, ..., -0.0894, 0.0659, 0.1493]], [[-0.0348, -0.0362, -0.1617, ..., 0.0527, 0.1927, 0.1431], [-0.0447, 0.0137, -0.0798, ..., 0.1057, -0.0299, -0.0742], [-0.0725, 0.1473, -0.0118, ..., -0.1284, 0.0010, -0.0773], ..., [-0.0315, 0.1065, -0.1130, ..., 0.0091, -0.0650, 0.0688], [ 0.0314, 0.1034, -0.0964, ..., 0.0144, 0.0532, -0.0415], [-0.0205, 0.0046, -0.0987, ..., 0.1317, -0.0065, -0.1617]]]], grad_fn=<UnsafeViewBackward0>), hidden_states=None, attentions=None) '''
可以看到有loss、logits、hidden_states、attentions,而我们的范例只取了logits作为结果输出。这里并不是说其他的部分没用,是只取适配下游任务的输出即可。详情可研究Vit的论文
最后通过argmax函数和model.config.id2label得出标签相对应的文字
argmax就是返回最大值的位置下标、model.config.id2label配置了对应标签的名称,也知道了最后的classifier层是1000维的
1.1.2 小结通过以上探索,我们可以得出:
输入的维度为(batch_size, 3, 224, 224)
最后的classifier需由1000改成我们叶子的类别数
1.2 数据处理接下来我们将探索数据的特性,并修改以适应我们的模型
1.2.1 EDA即Exploratory Data Analysis
首先导入所需包
12345678910111213141516171819202122232425# 导入各种包import torchimport torch.nn as nnfrom torch.nn import functional as Fimport randomimport copyfrom fastprogress.fastprogress import master_bar, progress_barfrom torch.cuda.amp import autocastimport pandas as pdimport numpy as npfrom torch.utils.data import Dataset, DataLoader, TensorDatasetfrom torchvision import transformsfrom sklearn.model_selection import KFoldfrom PIL import Imageimport osimport matplotlib.pyplot as pltimport torchvision.models as modelsfrom torch.optim.lr_scheduler import CosineAnnealingLRfrom transformers import (AdamW, get_scheduler)from transformers import ViTFeatureExtractor, ViTForImageClassification
查看初始数据
train_df = pd.read_csv('/kaggle/input/classify-leaves/train.csv')
使用下面代码给分类配上序号
123456789101112def num_map(file_path): data_df = pd.read_csv('/kaggle/input/classify-leaves/train.csv') categories = data_df.label.unique().tolist() categories_zip = list(zip( range(len(categories)) , categories)) categories_dict = {v:k for k, v in categories_zip} data_df['num_label'] = data_df.label.map(categories_dict) return data_dfshow_df = num_map('/kaggle/input/classify-leaves/train.csv')show_df.to_csv('train_valid_dataset.csv')
1.2.2 图片数据查看12345678path = '/kaggle/input/classify-leaves/'img = Image.open(path+train_df.image[1])# plt.figure("Image") # 图像窗口名称plt.imshow(img)plt.axis('off') # 关掉坐标轴为 offplt.title('image') # 图像题目plt.show()
这里我们做一下维度转换 即 [0, 1, 2] 换成 [2, 1, 0], 并只取某一个通道 看看
12345# np.asarray(img).shape# 可以看到图片反了,正确的顺序是.transpose([2, 0, 1])img_trans = np.asarray(img).transpose([2,1,0])plt.imshow(img_trans[0])plt.show()
2.Preprocessing接下来我们分别要做 数据增强、数据类定义、数据加载器测试
2.1.1 先来算个平均值标准差这里算的mean跟std是为了Normalize我们的数据使训练更稳定
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889import osimport cv2import numpy as npimport mathdef get_image_list(img_dir, isclasses=False): """将图像的名称列表 args: img_dir:存放图片的目录 isclasses:图片是否按类别存放标志 return: 图片文件名称列表 """ img_list = [] # 路径下图像是否按类别分类存放 if isclasses: img_file = os.listdir(img_dir) for class_name in img_file: if not os.path.isfile(os.path.join(img_dir, class_name)): class_img_list = os.listdir(os.path.join(img_dir, class_name)) img_list.extend(class_img_list) else: img_list = os.listdir(img_dir) print(img_list) print('image numbers: {}'.format(len(img_list))) return img_listdef get_image_pixel_mean(img_dir, img_list, img_size): """求数据集图像的R、G、B均值 args: img_dir: img_list: img_size: """ R_sum = 0 G_sum = 0 B_sum = 0 count = 0 # 循环读取所有图片 for img_name in img_list: img_path = os.path.join(img_dir, img_name) if not os.path.isdir(img_path): image = cv2.imread(img_path) image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) image = cv2.resize(image, (img_size, img_size)) # <class 'numpy.ndarray'> R_sum += image[:, :, 0].mean() G_sum += image[:, :, 1].mean() B_sum += image[:, :, 2].mean() count += 1 R_mean = R_sum / count G_mean = G_sum / count B_mean = B_sum / count print('R_mean:{}, G_mean:{}, B_mean:{}'.format(R_mean,G_mean,B_mean)) RGB_mean = [R_mean, G_mean, B_mean] return RGB_meandef get_image_pixel_std(img_dir, img_mean, img_list, img_size): R_squared_mean = 0 G_squared_mean = 0 B_squared_mean = 0 count = 0 image_mean = np.array(img_mean) # 循环读取所有图片 for img_name in img_list: img_path = os.path.join(img_dir, img_name) if not os.path.isdir(img_path): image = cv2.imread(img_path) # 读取图片 image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) image = cv2.resize(image, (img_size, img_size)) # <class 'numpy.ndarray'> image = image - image_mean # 零均值 # 求单张图片的方差 R_squared_mean += np.mean(np.square(image[:, :, 0]).flatten()) G_squared_mean += np.mean(np.square(image[:, :, 1]).flatten()) B_squared_mean += np.mean(np.square(image[:, :, 2]).flatten()) count += 1 R_std = math.sqrt(R_squared_mean / count) G_std = math.sqrt(G_squared_mean / count) B_std = math.sqrt(B_squared_mean / count) print('R_std:{}, G_std:{}, B_std:{}'.format(R_std, G_std, B_std)) RGB_std = [R_std, G_std, B_std] return R ...