Attention Is A Talent

发表于2022-12-06|更新于2022-12-12|Universe|Huggingface

outputsNote that the outputs of 🤗 Transformers models behave like namedtuples or dictionaries. You can access the elements by attributes (like we did) or by key (outputs["last_hidden_state"]), or even by index if you know exactly where the thing you are looking for is (outputs[0]). HF的输入返回大多以是元组或字典形式出，处理的时候要注意。从二分类到多分类，多分类中每个类别分别作二分类，是否属于这个类别进行输出 [[0.2,0.8]、[0.4,0.6]、[0.7,0.3]] = [[1]、[1]、[0]] Config12345678910111213141516171819from transformers import BertConfig, BertModel# Building the configconfig = BertConfig()# Building the model from the configmodel = BertModel(config)config'''BertConfig { [...] "hidden_size": 768, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, [...]}''' 环境变量The weights have been downloaded and cached (so future calls to the from_pretrained() method won’t re-download them) in the cache folder, which defaults to ~/.cache/huggingface/transformers. You can customize your cache folder by setting the HF_HOME environment variable. 配置你当前的环境变量 os.environ['HF_HOME']= '~/.cache/huggingface/transformers' Saving12345678model.save_pretrained("directory_on_my_computer")'''This saves two files to your disk:ls directory_on_my_computerconfig.json pytorch_model.bin''' If you take a look at the config.json file, you’ll recognize the attributes necessary to build the model architecture. This file also contains some metadata, such as where the checkpoint originated and what 🤗 Transformers version you were using when you last saved the checkpoint The pytorch_model.bin file is known as the state dictionary; it contains all your model’s weights. The two files go hand in hand; the configuration is necessary to know your model’s architecture, while the model weights are your model’s parameters. Tokenizerencode通过tokenizer.tokenize(sequence)查看分词后的结果 12345678910from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-cased")sequence = "Using a Transformer network is simple"tokens = tokenizer.tokenize(sequence)print(tokens)'''['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']''' ids = tokenizer.convert_tokens_to_ids(tokens) 还可以反过来得到token，也就是跟上面的 tokenize(seq) 一样的效果 123ids = tokenizer.convert_tokens_to_ids(tokens)print(ids) decode12345decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])print(decoded_string)''''Using a Transformer network is simple'''' padding查看tokenizer.pad_token_id pad的id 123456789# Will pad the sequences up to the maximum sequence lengthmodel_inputs = tokenizer(sequences, padding="longest")# Will pad the sequences up to the model max length# (512 for BERT or DistilBERT)model_inputs = tokenizer(sequences, padding="max_length")# Will pad the sequences up to the specified max lengthmodel_inputs = tokenizer(sequences, padding="max_length", max_length=8) 直接max_length是到模型的最大长度，longest是到批次里句子的最大长度 1234567891011sequence1_ids = [[200, 200, 200]]sequence2_ids = [[200, 200]]batched_ids = [ [200, 200, 200], [200, 200, tokenizer.pad_token_id],]'''tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)tensor([[ 1.5694, -1.3895], [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)''' This is because the key feature of Transformer models is attention layers that contextualize each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask. 这里两条单独的数据的结果跟组合起来的是不同的，是因为pad的位置也分散了模型的注意力，这不是我们希望模型学习的地方 ATTENTION MASK123456789101112131415batched_ids = [ [200, 200, 200], [200, 200, tokenizer.pad_token_id],]attention_mask = [ [1, 1, 1], [1, 1, 0],]outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))print(outputs.logits)'''tensor([[ 1.5694, -1.3895], [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)''' 长句子处理Models have different supported sequence lengths, and some specialize in handling very long sequences. Longformer is one example, and another is LED. If you’re working on a task that requires very long sequences, we recommend you take a look at those models. 一般是在tokenizer里设置truncation max_len，这里没讲，可以去看看模型。

HF Course 01 基础概念

发表于2022-12-06|更新于2022-12-12|Universe|Huggingface

Some of the currently available pipelines are: feature-extraction (get the vector representation of a text) fill-mask ner (named entity recognition) question-answering sentiment-analysis summarization text-generation translation zero-shot-classification pipeline12345678from transformers import pipelinegenerator = pipeline("text-generation", model="distilgpt2")generator( "In this course, we will teach you how to", max_length=30, num_return_sequences=2,) model模型发展时间史 June 2018: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results October 2018: BERT, another large pretrained model, this one designed to produce better summaries of sentences (more on this in the next chapter!) February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns October 2019: DistilBERT, a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance October 2019: BART and T5, two large pretrained models using the same architecture as the original Transformer model (the first to do so) May 2020, GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning) GPT-like (also called auto-regressive Transformer models) BERT-like (also called auto-encoding Transformer models) BART/T5-like (also called sequence-to-sequence Transformer models) encoder-decoder特攻类编码器的主要用处 Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition. ALBERT BERT DistilBERT ELECTRA RoBERTa Decoder-only models: Good for generative tasks such as text generation CTRL GPT GPT-2 Transformer XL. Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization. BART mBART Marian T5 cross-attention层使得decoder能查看整个句意，以调整顺序翻译输出 Note that the first attention layer in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layer uses the output of the encoder. It can thus access the whole input sentence to best predict the current word. This is very useful as different languages can have grammatical rules that put the words in different orders, or some context provided later in the sentence may be helpful to determine the best translation of a given word. 架构和检查点 Architecture: This is the skeleton of the model — the definition of each layer and each operation that happens within the model. Checkpoints: These are the weights that will be loaded in a given architecture. Model: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both. This course will specify architecture or checkpoint when it matters to reduce ambiguity. For example, BERT is an architecture while bert-base-cased, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the bert-base-cased model.” 即使使用干净的词库，也可能产生性别歧视，种族歧视 When asked to fill in the missing word in these two sentences, the model gives only one gender-free answer (waiter/waitress). The others are work occupations usually associated with one specific gender — and yes, prostitute ended up in the top 5 possibilities the model associates with “woman” and “work.” This happens even though BERT is one of the rare Transformer models not built by scraping data from all over the internet, but rather using apparently neutral data (it’s trained on the English Wikipedia and BookCorpus datasets). When you use these tools, you therefore need to keep in the back of your mind that the original model you are using could very easily generate sexist, racist, or homophobic content. Fine-tuning the model on your data won’t make this intrinsic bias disappear.

huggingface proxy

发表于2022-12-04|更新于2022-12-06|HuggingFace•Proxy

huggingface在有代理的情况下 1234567from transformers import AutoModelForSeq2SeqLM, AutoTokenizerprx = {'https': 'http://127.0.0.1:7890'}model_name = "Helsinki-NLP/opus-mt-zh-en"save_path = r'D:\00mydataset\huggingface model'tokenizer = AutoTokenizer.from_pretrained(model_name, proxies=prx, cache_dir=save_path) 直接设置代理就可以接到huggingface了 vscode在setting里面的proxy的第一栏设置http://127.0.0.1:7890 后面的7890就是你的端口号，在你的代理处可以查看 pippip 我根据这篇文章在终端直接设置 set http_proxy='http://127.0.0.1:7890' 这就好了 123456789# 在pip目录创建并编辑pip.ini（配置文件不存在时）cd C:\Users\(你的用户名) mkdir pip # 创建pip文件夹cd pip # 进入pip路径目录下cd.>pip.ini # 创建pip.ini文件# 然后打开C:\Users(用户名)\pip\pip.ini，添加如下内容：[global]proxy=http://10.20.217.2:8080

句意相似度 PipeLine总结

发表于2022-11-27|更新于2022-12-05|NLP|句意相似度•Pipeline

主要进行训练框架优化端到端 ML 实施（训练、验证、预测、评估）轻松适应您自己的数据集促进其他基于 BERT 的模型（BERT、ALBERT、…）的快速实验使用有限的计算资源进行快速训练（混合精度、梯度累积……）多 GPU 执行分类决策的阈值选择（不一定是 0.5）冻结 BERT 层，只更新分类层权重或更新所有权重种子设置，可复现结果 PipeLine导包12345678910111213import torchimport torch.nn as nnimport osimport copyimport torch.optim as optimimport randomimport numpy as npimport pandas as pdfrom torch.utils.data import DataLoader, Datasetfrom torch.cuda.amp import autocast, GradScalerfrom tqdm.auto import tqdmfrom transformers import AutoTokenizer, AutoModel, AdamW, get_linear_schedule_with_warmupfrom datasets import load_dataset, load_metric Dataset123456789101112131415161718192021222324252627282930313233343536class CustomDataset(Dataset): def __init__(self, data, maxlen, with_labels=True, bert_model='albert-base-v2'): self.data = data # pandas dataframe #Initialize the tokenizer self.tokenizer = AutoTokenizer.from_pretrained(bert_model) self.maxlen = maxlen self.with_labels = with_labels def __len__(self): return len(self.data) def __getitem__(self, index): #根据索引索取DataFrame中句子1余句子2 sent1 = str(self.data.loc[index, 'sentence1']) sent2 = str(self.data.loc[index, 'sentence2']) # 对句子对分词，得到input_ids、attention_mask和token_type_ids encoded_pair = self.tokenizer(sent1, sent2, padding='max_length', # 填充到最大长度 truncation=True, # 根据最大长度进行截断 max_length=self.maxlen, return_tensors='pt') # 返回torch.Tensor张量 token_ids = encoded_pair['input_ids'].squeeze(0) # tensor token ids attn_masks = encoded_pair['attention_mask'].squeeze(0) # padded values对应为 "0" ，其他token为1 token_type_ids = encoded_pair['token_type_ids'].squeeze(0) #第一个句子的值为0，第二个句子的值为1 # 只有一句全为0 if self.with_labels: # True if the dataset has labels label = self.data.loc[index, 'label'] return token_ids, attn_masks, token_type_ids, label else: return token_ids, attn_masks, token_type_ids 建议，进行测试 12sample = next(iter(DataLoader(tr_dataset, batch_size=2)))sample 12tr_model = SentencePairClassifier(freeze_bert=True)tr_model(sample[0], sample[1], sample[2]) 就是方便最后的维度转换，squeeze、flatten、view；甚至可以用reshape方法模型定义12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152class SentencePairClassifier(nn.Module): def __init__(self, bert_model="albert-base-v2", freeze_bert=False): super(SentencePairClassifier, self).__init__() # 初始化预训练模型Bert xxx self.bert_layer = AutoModel.from_pretrained(bert_model) # encoder 隐藏层大小 if bert_model == "albert-base-v2": # 12M 参数 hidden_size = 768 elif bert_model == "albert-large-v2": # 18M 参数 hidden_size = 1024 elif bert_model == "albert-xlarge-v2": # 60M 参数 hidden_size = 2048 elif bert_model == "albert-xxlarge-v2": # 235M 参数 hidden_size = 4096 elif bert_model == "bert-base-uncased": # 110M 参数 hidden_size = 768 elif bert_model == "roberta-base": # hidden_size = 768 # 固定Bert层更新分类输出层 if freeze_bert: for p in self.bert_layer.parameters(): p.requires_grad = False self.dropout = nn.Dropout(p=0.1) # 分类输出 self.cls_layer = nn.Linear(hidden_size, 1) @autocast() # 混合精度训练 def forward(self, input_ids, attn_masks, token_type_ids): ''' Inputs: -input_ids : Tensor containing token ids -attn_masks : Tensor containing attention masks to be used to focus on non-padded values -token_type_ids : Tensor containing token type ids to be used to identify sentence1 and sentence2 ''' # 输入给Bert，获取上下文表示 # cont_reps, pooler_output = self.bert_layer(input_ids, attn_masks, token_type_ids) outputs = self.bert_layer(input_ids, attn_masks, token_type_ids) # last_hidden_state,pooler_output,all_hidden_states 12层 # 将last layer hidden-state of the [CLS] 输入到 classifier layer # - last_hidden_state 的向量平均 # - 取all_hidden_states最后四层，然后做平均 weighted 平均 # - last_hidden_state+lstm # 获取输出 logits = self.cls_layer(self.dropout(outputs['pooler_output'])) return logits 固定随机种子12345678910def set_seed(seed): """ 固定随机种子，保证结果复现 """ torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False np.random.seed(seed) random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) 训练和评估12!mkdir models #可以在之前补充绝对路径!mkdir results 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596def train_bert(net, criterion, opti, lr, lr_scheduler, train_loader, val_loader, epochs, iters_to_accumulate): best_loss = np.Inf best_ep = 1 nb_iterations = len(train_loader) print_every = nb_iterations // 5 # 打印频率 iters = [] train_losses = [] val_losses = [] scaler = GradScaler() for ep in range(epochs): net.train() running_loss = 0.0 for it, (seq, attn_masks, token_type_ids, labels) in enumerate(tqdm(train_loader)): # 转为cuda张量 seq, attn_masks, token_type_ids, labels = \ seq.to(device), attn_masks.to(device), token_type_ids.to(device), labels.to(device) # 混合精度加速训练 with autocast(): # Obtaining the logits from the model logits = net(seq, attn_masks, token_type_ids) # Computing loss loss = criterion(logits.squeeze(-1), labels.float()) loss = loss / iters_to_accumulate # Normalize the loss because it is averaged # Backpropagating the gradients # Scales loss. Calls backward() on scaled loss to create scaled gradients. scaler.scale(loss).backward() if (it + 1) % iters_to_accumulate == 0: # Optimization step # scaler.step() first unscales the gradients of the optimizer's assigned params. # If these gradients do not contain infs or NaNs, opti.step() is then called, # otherwise, opti.step() is skipped. scaler.step(opti) # Updates the scale for next iteration. scaler.update() # 根据迭代次数调整学习率。 lr_scheduler.step() # 梯度清零 opti.zero_grad() running_loss += loss.item() if (it + 1) % print_every == 0: # Print training loss information print() print(f"Iteration {it+1}/{nb_iterations} of epoch {ep+1} complete. \ Loss : {running_loss / print_every} ") running_loss = 0.0 val_loss = evaluate_loss(net, device, criterion, val_loader) # Compute validation loss print() print(f"Epoch {ep+1} complete! Validation Loss : {val_loss}") if val_loss < best_loss: print("Best validation loss improved from {} to {}".format(best_loss, val_loss)) print() net_copy = copy.deepcopy(net) # # 保存最优模型 best_loss = val_loss best_ep = ep + 1 # 保存模型 path_to_model=f'models/{bert_model}_lr_{lr}_val_loss_{round(best_loss, 5)}_ep_{best_ep}.pt' torch.save(net_copy.state_dict(), path_to_model) print("The model has been saved in {}".format(path_to_model)) del loss torch.cuda.empty_cache() # 清空显存 def evaluate_loss(net, device, criterion, dataloader): """ 评估输出 """ net.eval() mean_loss = 0 count = 0 with torch.no_grad(): for it, (seq, attn_masks, token_type_ids, labels) in enumerate(tqdm(dataloader)): seq, attn_masks, token_type_ids, labels = \ seq.to(device), attn_masks.to(device), token_type_ids.to(device), labels.to(device) logits = net(seq, attn_masks, token_type_ids) mean_loss += criterion(logits.squeeze(-1), labels.float()).item() count += 1 return mean_loss / count 注意autocast和累计梯度这两种加速计算的方法 evaluate的时候要注意数据的维度，标签的类型超参数 & 开始训练1234567bert_model = "albert-base-v2" # 'albert-base-v2', 'albert-large-v2'freeze_bert = False # 是否冻结Bertmaxlen = 128 # 最大长度bs = 16 # batch sizeiters_to_accumulate = 2 # 梯度累加lr = 2e-5 # learning rateepochs = 2 # 训练轮数 123456789101112131415161718192021222324252627282930# 固定随机种子便于复现set_seed(1) # 2022 # 创建训练集与验证集print("Reading training data...")train_set = CustomDataset(df_train, maxlen, bert_model)print("Reading validation data...")val_set = CustomDataset(df_val, maxlen, bert_model)# 常见训练集与验证集DataLoadertrain_loader = DataLoader(train_set, batch_size=bs, num_workers=0)val_loader = DataLoader(val_set, ba ...

Weight & Bias

发表于2022-11-26|更新于2022-12-06|Trick|wandb•pytorch

待完成源码细节整理 torch.inference_mode()with no_gradient的一种加速参考文档 nn.MarginRankingLoss()文档 margin = 0 x1大于x2 则去-y，viceversa 取 y *loss(x1,x2,y)=max(0,−y∗(x1−x2)+margin)* 这里最后的loss是平均后的 1234567891011121314151617181920212223loss = nn.MarginRankingLoss()input1 = torch.randn(3, requires_grad=True)input2 = torch.randn(3, requires_grad=True)target = torch.randn(3).sign()output = loss(input1, input2, target)output.backward()```input1, input2, target, output(tensor([ 0.0277, -0.3806, 1.0405], requires_grad=True), tensor([-0.9075, 0.3271, 0.1156], requires_grad=True), tensor([ 1., -1., -1.]), tensor(0.3083, grad_fn=<MeanBackward0>)) input1 - input2 , (input1 - input2) * (-target)(tensor([ 0.9352, -0.7077, 0.9249], grad_fn=<SubBackward0>), tensor([-0.9352, -0.7077, 0.9249], grad_fn=<MulBackward0>), loss = 0.9249/3 ``` gc.collect()清除内存 defaultdict获得创建key不给value也不报错的dict 12345from collections import defaultdicthistory = defaultdict(list)history['Train Loss'].append(1.1) StratifiedKFold()12345678from sklearn.model_selection import StratifiedKFold, KFoldskf = StratifiedKFold(n_splits=CONFIG['n_fold'], shuffle=True, random_state=CONFIG['seed'])for fold, ( _, val_) in enumerate(skf.split(X=df, y=df.worker)): df.loc[val_ , "kfold"] = int(fold) df["kfold"] = df["kfold"].astype(int) 第五行将X分k折，y标签为样本对应index，fold 在 0~5 得到df[“kfold”] 列包含属于第几折的 valid数据通过下面的函数直接选择非本折的数据作为train，其他的就是valid df_train = df[df.kfold != fold].reset_index(drop=True) df_valid = df[df.kfold == fold].reset_index(drop=True) 12345678910111213def prepare_loaders(fold): df_train = df[df.kfold != fold].reset_index(drop=True) df_valid = df[df.kfold == fold].reset_index(drop=True) train_dataset = JigsawDataset(df_train, tokenizer=CONFIG['tokenizer'], max_length=CONFIG['max_length']) valid_dataset = JigsawDataset(df_valid, tokenizer=CONFIG['tokenizer'], max_length=CONFIG['max_length']) train_loader = DataLoader(train_dataset, batch_size=CONFIG['train_batch_size'], num_workers=2, shuffle=True, pin_memory=True, drop_last=True) valid_loader = DataLoader(valid_dataset, batch_size=CONFIG['valid_batch_size'], num_workers=2, shuffle=False, pin_memory=True) return train_loader, valid_loade tqdm1bar = tqdm(enumerate(dataloader), total=len(dataloader)) 单个epoch下面对bar做如下设置 12bar.set_postfix(Epoch=epoch, Valid_Loss=epoch_loss, LR=optimizer.param_groups[0]['lr']) Weights & Biases (W&B) hash 一个项目id train valid 定义一个 1个epoch 的函数返回分别其中的loss wandb.log({“Train Loss”: train_epoch_loss}) 使用 log 方式记录损失函数 run = wandb.init(project='Jigsaw', config=CONFIG, job_type='Train', group=CONFIG['group'], tags=['roberta-base', f'{HASH_NAME}', 'margin-loss'], name=f'{HASH_NAME}-fold-{fold}', anonymous='must') 1TRAIN PART run.finish() 1234显示如下 'hash--------name'Syncing run k5nu8k69390a-fold-0 to Weights & Biases (docs). 流程训练提炼 for fold in range(0, CONFIG[‘n_fold’]) wandb.init prepare_loaders、fetch_scheduler run_training train_one_epoch、valid_one_epoch —-> to got model, loss for wandb 中间掺杂 W&B 的数据实时载入分析即可 df[‘y’].value_counts(normalize=True) to got the percentage of each values 原文链接

HF 02

发表于2022-11-21|更新于2022-12-06|Universe|HuggingFace•Bert

待完成示例详解 Transformer分两块BERT&GPT都很能打 BERT用的是transformer的encoder BERT是用了Transformer的encoder侧的网络，encoder中的Self-attention机制在编码一个token的时候同时利用了其上下文的token，其中‘同时利用上下文’即为双向的体现，而并非想Bi-LSTM那样把句子倒序输入一遍。 GPT用的是transformer的decoder 在它之前是GPT，GPT使用的是Transformer的decoder侧的网络，GPT是一个单向语言模型的预训练过程，更适用于文本生成，通过前文去预测当前的字。 Bert的embeddingEmbedding由三种Embedding求和而成： Token Embeddings是词向量，第一个单词是CLS标志，可以用于之后的分类任务 BERT在第一句前会加一个[CLS]标志，最后一层该位对应向量可以作为整句话的语义表示，从而用于下游的分类任务等。因为与文本中已有的其它词相比，这个无明显语义信息的符号会更“公平”地融合文本中各个词的语义信息，从而更好的表示整句话的语义。具体来说，self-attention是用文本中的其它词来增强目标词的语义表示，但是目标词本身的语义还是会占主要部分的，因此，经过BERT的12层（BERT-base为例），每次词的embedding融合了所有词的信息，可以去更好的表示自己的语义。而[CLS]位本身没有语义，经过12层，句子级别的向量，相比其他正常词，可以更好的表征句子语义。 Segment Embeddings用来区别两种句子，因为预训练不光做LM还要做以两个句子为输入的分类任务 Position Embeddings和之前文章中的Transformer不一样，不是三角函数而是学习出来的 APItokenizer12345678910111213141516171819202122232425262728293031323334from transformers import AutoConfig,AutoModel,AutoTokenizer,AdamW,get_linear_schedule_with_warmup,logging# config模块MODEL_NAME="bert-base-chinese"config = AutoConfig.from_pretrained(MODEL_NAME) #c onfig可以配置模型信息# tokenizer模块tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)tokenizer.all_special_ids # 查看特殊符号的id [100, 102, 0, 101, 103]tokenizer.all_special_tokens # 查看token ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']tokenizer.vocab_size # 词汇表大小tokenizer.vocab # 词汇对应的dict形式## tokeningtext="我在北京工作"token_ids=tokenizer.encode(text)token_ids # [101, 2769, 1762, 1266, 776, 2339, 868, 102]tokenizer.convert_ids_to_tokens(token_ids) # ['[CLS]', '我', '在', '北', '京', '工', '作', '[SEP]'] # convert_tokens_to_ids(tokens) 为对应方法 ## padding 做向量填充token_ids=tokenizer.encode(text,padding=True,max_length=30,add_special_tokens=True)## encode_plustoken_ids=tokenizer.encode_plus( text,padding="max_length", max_length=30, add_special_tokens=True, return_tensors='pt', return_token_type_ids=True, return_attention_mask=True) 使用pre_train模型载入数据12model=AutoModel.from_pretrained(MODEL_NAME)outputs=model(token_ids['input_ids'],token_ids['attention_mask']) 数据集dataset定义12345678910111213141516171819202122232425262728293031323334class EnterpriseDataset(Dataset): def __init__(self,texts,labels,tokenizer,max_len): self.texts=texts self.labels=labels self.tokenizer=tokenizer self.max_len=max_len def __len__(self): return len(self.texts) def __getitem__(self,item): """ item 为数据索引，迭代取第item条数据 """ text=str(self.texts[item]) label=self.labels[item] encoding=self.tokenizer.encode_plus( text, add_special_tokens=True, max_length=self.max_len, return_token_type_ids=True, pad_to_max_length=True, return_attention_mask=True, return_tensors='pt', #转为tensor ) #print(encoding['input_ids']) return { 'texts':text, 'input_ids':encoding['input_ids'].flatten(), 'attention_mask':encoding['attention_mask'].flatten(), # toeken_type_ids:0 'labels':torch.tensor(label,dtype=torch.long) }