HF Course 04 Dataset
加载本地数据
Data format | Loading script | Example |
---|---|---|
CSV & TSV | csv |
load_dataset("csv", data_files="my_file.csv") |
Text files | text |
load_dataset("text", data_files="my_file.txt") |
JSON & JSON Lines | json |
load_dataset("json", data_files="my_file.jsonl") |
Pickled DataFrames | pandas |
load_dataset("pandas", data_files="my_dataframe.pkl") |
分别需要做,指明数据类型,指明文件路径
data_files参数
The data_files
argument of the load_dataset()
function is quite flexible and can be either a single file path, a list of file paths, or a dictionary that maps split names to file paths. You can also glob files that match a specified pattern according to the rules used by the Unix shell (e.g., you can glob all the JSON files in a directory as a single split by setting data_files="*.json"
). See the 🤗 Datasets documentation for more details.
可以做文件路径
可以做split将数据映射成想要的字典格式
```python
data_files = {“train”: “SQuAD_it-train.json”, “test”: “SQuAD_it-test.json”}
squad_it_dataset = load_dataset(“json”, data_files=data_files, field=”data”)
squad_it_dataset‘’’
DatasetDict({
train: Dataset({
features: [‘title’, ‘paragraphs’],
num_rows: 442
})
test: Dataset({
features: [‘title’, ‘paragraphs’],
num_rows: 48
})
})’’’1
2
3
4
5
6
7
8
9
10
11
12
## 加载服务器数据
```python
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
"train": url + "SQuAD_it-train.json.gz",
"test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
数据处理
分隔符
如果你的数据不是传统的CSV格式(以逗号分割),你可以指定分隔符
1 | from datasets import load_dataset |
随机选取样本
1 | drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000)) |
重命名
1 | drug_dataset = drug_dataset.rename_column( |
补充一个匿名表达式的细节
(lambda base, height: 0.5 * base * height)(4, 8)
16
转换大小写
1 | def lowercase_condition(example): |
这里报错了
filter
dataset.filter
1 | drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None) |
过滤筛选合格的数据样本
增加列
1 | def compute_review_length(example): |
补充一个sort
1
2
3
4
5
6
7
8
9
10 >drug_dataset["train"].sort("review_length")[:3]
>'''
>{'patient_id': [103488, 23627, 20558],
'drugName': ['Loestrin 21 1 / 20', 'Chlorzoxazone', 'Nucynta'],
'condition': ['birth control', 'muscle spasm', 'pain'],
'review': ['"Excellent."', '"useless"', '"ok"'],
'rating': [10.0, 1.0, 6.0],
'date': ['November 4, 2008', 'March 24, 2017', 'August 20, 2016'],
'usefulCount': [5, 2, 10],
'review_length': [1, 1, 1]}'''sort应该也有reverse选项,如果真要做EDA还是用Pandas好了, 查看可配置参数
在补充一个Dataset.add_column()
An alternative way to add new columns to a dataset is with the
Dataset.add_column()
function. This allows you to provide the column as a Python list or NumPy array and can be handy in situations whereDataset.map()
is not well suited for your analysis.
解析html字符
1 | import html |
1 | drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])}) |
map 方法
batch
When you specify batched=True
the function receives a dictionary with the fields of the dataset, but each value is now a list of values, and not just a single value. The return value of Dataset.map()
should be the same: a dictionary with the fields we want to update or add to our dataset, and a list of values.
1 | new_drug_dataset = drug_dataset.map( |
批量处理为True的话,每次传进来就是一个字典批次。一般我们做的就是更新这个数据集
If you’re running this code in a notebook, you’ll see that this command executes way faster than the previous one. And it’s not because our reviews have already been HTML-unescaped — if you re-execute the instruction from the previous section (without batched=True
), it will take the same amount of time as before. This is because list comprehensions are usually faster than executing the same code in a for
loop, and we also gain some performance by accessing lots of elements at the same time instead of one by one.
之前单个处理的用的是for循环,这里批量处理就可以用列表推导式,要快的多
配合tokenizer使用
1 | def tokenize_and_split(examples): |
使用return_overflowing_tokens
参数来接受截断的部分,这里我们177的长度变成了128和49两份
1 | tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True) |
数据类型转换
Pandas
To enable the conversion between various third-party libraries, 🤗 Datasets provides a Dataset.set_format()
function. This function only changes the output format of the dataset, so you can easily switch to another format without affecting the underlying data format, which is Apache Arrow. The formatting is done in place. To demonstrate, let’s convert our dataset to Pandas:
drug_dataset.set_format("pandas")
一般使用train_df = drug_dataset["train"][:]
获得整体的切片作为新的Dataframe 可以自己尝试是否返回对象为hf的dataset
1 | from datasets import Dataset |
可以转换回来
train_test_split
1 | drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42) |
保存文件
Data format | Function |
---|---|
Arrow | Dataset.save_to_disk() |
CSV | Dataset.to_csv() |
JSON | Dataset.to_json() |
对超大数据的处理
psutil 查看内存
1 | !pip install psutil |
If you’re familiar with Pandas, this result might come as a surprise because of Wes Kinney’s famous rule of thumb that you typically need 5 to 10 times as much RAM as the size of your dataset
. So how does 🤗 Datasets solve this memory management problem? 🤗 Datasets treats each dataset as a memory-mapped file, which provides a mapping between RAM and filesystem storage that allows the library to access and operate on elements of the dataset without needing to fully load it into memory.
Memory-mapped files can also be shared across multiple processes, which enables methods like Dataset.map()
to be parallelized without needing to move or copy the dataset. Under the hood, these capabilities are all realized by the Apache Arrow memory format and pyarrow
library, which make the data loading and processing lightning fast.
通常你需要五到十倍于你文件大小的内存,而dataset的内存管理使得你可以加载巨大的数据集的一部分,这是通过pyarrow实现的,并且你的map方法可并行处理数据。
stream data
if we tried to download the Pile in its entirety, we’d need825 GB
of free disk space! To handle these cases, 🤗 Datasets provides a streaming feature that allows us to download and access elements on the fly, without needing to download the whole dataset. Let’s take a look at how this works.
1 | pubmed_dataset_streamed = load_dataset( |
返回的是一个IterableDataset
对象
1 | from transformers import AutoTokenizer |
💡 To speed up tokenization with streaming you can pass
batched=True
, as we saw in the last section. It will process the examples batch by batch; the default batch size is 1,000 and can be specified with thebatch_size
argument.
合并数据集
🤗 Datasets provides an interleave_datasets()
function that converts a list of IterableDataset
objects into a single IterableDataset
, where the elements of the new dataset are obtained by alternating among the source examples.
1 | law_dataset_streamed = load_dataset( |