加载本地数据

Data format	Loading script	Example
CSV & TSV	`csv`	`load_dataset("csv", data_files="my_file.csv")`
Text files	`text`	`load_dataset("text", data_files="my_file.txt")`
JSON & JSON Lines	`json`	`load_dataset("json", data_files="my_file.jsonl")`
Pickled DataFrames	`pandas`	`load_dataset("pandas", data_files="my_dataframe.pkl")`

分别需要做，指明数据类型，指明文件路径

data_files参数

The data_files argument of the load_dataset() function is quite flexible and can be either a single file path, a list of file paths, or a dictionary that maps split names to file paths. You can also glob files that match a specified pattern according to the rules used by the Unix shell (e.g., you can glob all the JSON files in a directory as a single split by setting data_files="*.json"). See the 🤗 Datasets documentation for more details.

可以做文件路径

可以做split将数据映射成想要的字典格式

```python
data_files = {“train”: “SQuAD_it-train.json”, “test”: “SQuAD_it-test.json”}
squad_it_dataset = load_dataset(“json”, data_files=data_files, field=”data”)
squad_it_dataset

‘’’
DatasetDict({
train: Dataset({
features: [‘title’, ‘paragraphs’],
num_rows: 442
})
test: Dataset({
features: [‘title’, ‘paragraphs’],
num_rows: 48
})
})’’’


    

## 加载服务器数据

```python
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

数据处理

分隔符

如果你的数据不是传统的CSV格式(以逗号分割)，你可以指定分隔符

from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

随机选取样本

drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
# Peek at the first few examples
drug_sample[:3]

'''
{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than an elevated blood pressure.  I had severe knee and ankle pain which completely went away after taking Mobic.  I attempted to stop the medication however pain returned after a few days."'],
 'rating': [9.0, 3.0, 10.0],
 'date': ['September 2, 2015', 'November 7, 2011', 'June 5, 2013'],
 'usefulCount': [36, 13, 128]}'''

重命名

drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset
'''
DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})'''

补充一个匿名表达式的细节

(lambda base, height: 0.5 * base * height)(4, 8)

16

转换大小写

def lowercase_condition(example):
    return {"condition": example["condition"].lower()}

drug_dataset.map(lowercase_condition)
'''
AttributeError: 'NoneType' object has no attribute 'lower''''

这里报错了

filter

dataset.filter

drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

drug_dataset = drug_dataset.map(lowercase_condition)
# Check that lowercasing worked
drug_dataset["train"]["condition"][:3]

'''
['left ventricular dysfunction', 'adhd', 'birth control']'''

过滤筛选合格的数据样本

增加列

def compute_review_length(example):
    return {"review_length": len(example["review"].split())}
    
drug_dataset = drug_dataset.map(compute_review_length)
# Inspect the first training example
drug_dataset["train"][0]
'''
{'patient_id': 206461,
 'drugName': 'Valsartan',
 'condition': 'left ventricular dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'date': 'May 20, 2012',
 'usefulCount': 27,
 'review_length': 17}'''

补充一个sort

>drug_dataset["train"].sort("review_length")[:3]
>'''
>{'patient_id': [103488, 23627, 20558],
'drugName': ['Loestrin 21 1 / 20', 'Chlorzoxazone', 'Nucynta'],
'condition': ['birth control', 'muscle spasm', 'pain'],
'review': ['"Excellent."', '"useless"', '"ok"'],
'rating': [10.0, 1.0, 6.0],
'date': ['November 4, 2008', 'March 24, 2017', 'August 20, 2016'],
'usefulCount': [5, 2, 10],
'review_length': [1, 1, 1]}'''

sort应该也有reverse选项，如果真要做EDA还是用Pandas好了, 查看可配置参数

在补充一个Dataset.add_column()

An alternative way to add new columns to a dataset is with the Dataset.add_column() function. This allows you to provide the column as a Python list or NumPy array and can be handy in situations where Dataset.map() is not well suited for your analysis.

解析html字符

import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)

'''
"I'm a transformer called BERT"
'''

1	drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

map 方法

batch

When you specify batched=True the function receives a dictionary with the fields of the dataset, but each value is now a list of values, and not just a single value. The return value of Dataset.map() should be the same: a dictionary with the fields we want to update or add to our dataset, and a list of values.

1
2
3

new_drug_dataset = drug_dataset.map(
    lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
)

批量处理为True的话，每次传进来就是一个字典批次。一般我们做的就是更新这个数据集

If you’re running this code in a notebook, you’ll see that this command executes way faster than the previous one. And it’s not because our reviews have already been HTML-unescaped — if you re-execute the instruction from the previous section (without batched=True), it will take the same amount of time as before. This is because list comprehensions are usually faster than executing the same code in a for loop, and we also gain some performance by accessing lots of elements at the same time instead of one by one.

之前单个处理的用的是for循环，这里批量处理就可以用列表推导式，要快的多

配合tokenizer使用

def tokenize_and_split(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )
    
result = tokenize_and_split(drug_dataset["train"][0])
[len(inp) for inp in result["input_ids"]]
# [128, 49]

使用return_overflowing_tokens参数来接受截断的部分，这里我们177的长度变成了128和49两份

1	tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)

数据类型转换

Pandas

To enable the conversion between various third-party libraries, 🤗 Datasets provides a Dataset.set_format() function. This function only changes the output format of the dataset, so you can easily switch to another format without affecting the underlying data format, which is Apache Arrow. The formatting is done in place. To demonstrate, let’s convert our dataset to Pandas:

drug_dataset.set_format("pandas")

一般使用train_df = drug_dataset["train"][:] 获得整体的切片作为新的Dataframe 可以自己尝试是否返回对象为hf的dataset

from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset
'''
Dataset({
    features: ['condition', 'frequency'],
    num_rows: 819
})'''

可以转换回来

train_test_split

drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean

'''
DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'review_clean'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'review_clean'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'review_clean'],
        num_rows: 46108
    })
})'''

保存文件

Data format	Function
Arrow	`Dataset.save_to_disk()`
CSV	`Dataset.to_csv()`
JSON	`Dataset.to_json()`

对超大数据的处理

psutil 查看内存

!pip install psutil

import psutil

# Process.memory_info is expressed in bytes, so convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

If you’re familiar with Pandas, this result might come as a surprise because of Wes Kinney’s famous rule of thumb that you typically need 5 to 10 times as much RAM as the size of your dataset. So how does 🤗 Datasets solve this memory management problem? 🤗 Datasets treats each dataset as a memory-mapped file, which provides a mapping between RAM and filesystem storage that allows the library to access and operate on elements of the dataset without needing to fully load it into memory.

Memory-mapped files can also be shared across multiple processes, which enables methods like Dataset.map() to be parallelized without needing to move or copy the dataset. Under the hood, these capabilities are all realized by the Apache Arrow memory format and pyarrow library, which make the data loading and processing lightning fast.

通常你需要五到十倍于你文件大小的内存，而dataset的内存管理使得你可以加载巨大的数据集的一部分，这是通过pyarrow实现的，并且你的map方法可并行处理数据。

stream data

if we tried to download the Pile in its entirety, we’d need825 GBof free disk space! To handle these cases, 🤗 Datasets provides a streaming feature that allows us to download and access elements on the fly, without needing to download the whole dataset. Let’s take a look at how this works.

1
2
3

pubmed_dataset_streamed = load_dataset(
    "json", data_files=data_files, split="train", streaming=True
)

返回的是一个IterableDataset对象

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x["text"]))
next(iter(tokenized_dataset))
'''
{'input_ids': [101, 4958, 5178, 4328, 6779, ...], 'attention_mask': [1, 1, 1, 1, 1, ...]}'''

💡 To speed up tokenization with streaming you can pass batched=True, as we saw in the last section. It will process the examples batch by batch; the default batch size is 1,000 and can be specified with the batch_size argument.

合并数据集

🤗 Datasets provides an interleave_datasets() function that converts a list of IterableDataset objects into a single IterableDataset, where the elements of the new dataset are obtained by alternating among the source examples.

law_dataset_streamed = load_dataset(
    "json",
    data_files="https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst",
    split="train",
    streaming=True,
)
next(iter(law_dataset_streamed))
'''
{'meta': {'case_ID': '110921.json',
  'case_jurisdiction': 'scotus.tar.gz',
  'date_created': '2010-04-28T17:12:49Z'},
 'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued January 19, 1983.\nDecided April 26, 1983.\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT\n*239 Michael A. Lilly, First Deputy Attorney General of Hawaii, argued the cause for petitioners. With him on the brief was James H. Dannenberg, Deputy Attorney General...'}'''