Fortunately, there’s a library called sentence-transformers that is dedicated to creating embeddings. As described in the library’s documentation, our use case is an example of asymmetric semantic search because we have a short query whose answer we’d like to find in a longer document, like a an issue comment. The handy model overview table in the documentation indicates that the multi-qa-mpnet-base-dot-v1 checkpoint has the best performance for semantic search, so we’ll use that for our application.
我们主要使用了sentence-transformersfaiss两个额外库处理
加载模型
1 2 3 4 5
from transformers import AutoTokenizer, AutoModel
model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1" tokenizer = AutoTokenizer.from_pretrained(model_ckpt) model = AutoModel.from_pretrained(model_ckpt)
数据处理
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
import torch
device = torch.device("cuda") model.to(device)
defcls_pooling(model_output): return model_output.last_hidden_state[:, 0] defget_embeddings(text_list): encoded_input = tokenizer( text_list, padding=True, truncation=True, return_tensors="pt" ) encoded_input = {k: v.to(device) for k, v in encoded_input.items()} model_output = model(**encoded_input) return cls_pooling(model_output)
""" COMMENT: Requiring online connection is a deal breaker in some cases unfortunately so it'd be great if offline mode is added similar to how `transformers` loads models offline fine. @mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like? SCORE: 25.505046844482422 TITLE: Discussion using datasets in offline mode URL: https://github.com/huggingface/datasets/issues/824 ================================================== COMMENT: The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package since #1726 :) You can now use them offline \`\`\`python datasets = load_dataset("text", data_files=data_files) \`\`\` We'll do a new release soon SCORE: 24.555509567260742 TITLE: Discussion using datasets in offline mode URL: https://github.com/huggingface/datasets/issues/824 ================================================== COMMENT: I opened a PR that allows to reload modules that have already been loaded once even if there's no internet. Let me know if you know other ways that can make the offline mode experience better. I'd be happy to add them :) I already note the "freeze" modules option, to prevent local modules updates. It would be a cool feature. ---------- > @mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like? Indeed `load_dataset` allows to load remote dataset script (squad, glue, etc.) but also you own local ones. For example if you have a dataset script at `./my_dataset/my_dataset.py` then you can do \`\`\`python load_dataset("./my_dataset") \`\`\` and the dataset script will generate your dataset once and for all. ---------- About I'm looking into having `csv`, `json`, `text`, `pandas` dataset builders already included in the `datasets` package, so that they are available offline by default, as opposed to the other datasets that require the script to be downloaded. cf #1724 SCORE: 24.14896583557129 TITLE: Discussion using datasets in offline mode URL: https://github.com/huggingface/datasets/issues/824 ================================================== COMMENT: > here is my way to load a dataset offline, but it **requires** an online machine > > 1. (online machine) >
import datasets
data = datasets.load_dataset(…)
data.save_to_disk(/YOUR/DATASET/DIR)
1 2 3 4 5
2. copy the dir from online to the offline machine
SCORE: 22.893993377685547 TITLE: Discussion using datasets in offline mode URL: https://github.com/huggingface/datasets/issues/824 ==================================================
COMMENT: here is my way to load a dataset offline, but it **requires** an online machine 1. (online machine) \`\`\` import datasets data = datasets.load_dataset(...) data.save_to_disk(/YOUR/DATASET/DIR) \`\`\` 2. copy the dir from online to the offline machine 3. (offline machine) \`\`\` import datasets data = datasets.load_from_disk(/SAVED/DATA/DIR) \`\`\`
HTH. SCORE: 22.406635284423828 TITLE: Discussion using datasets in offline mode URL: https://github.com/huggingface/datasets/issues/824 ================================================== """