前言

Vit——Vision Transformer

这里通过kaggle的叶子分类任务来使用预训练(Pre-train)模型Vit来提升我们的任务表示

1.观察模型&处理数据

1.1 模型探索

无论是基于python的特性(适配各个领域的包),还是NLP里大行其道的Pre-train范式,拥有快速了解一个包的特性以适用于我们工作的能力,将极大的提升我们工作的效率和结果。所以下面我们来快速体验一下HuggingFace给出的模型范例,并针对我们的任务进行相应的数据处理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

上面的代码可以自行运行

1.1.1 示例解读

  • 上十行代码: 首先通过requests库拿到一张图片并用image生成图片形式,下面两行加载了Vit16的特征提取器和HF特供的图片分类适配模型

  • 下面我们看看 特征提取后的输入(inputs)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    # inputs 输出如下
    '''
    {'pixel_values': tensor([[[[ 0.1137, 0.1686, 0.1843, ..., -0.1922, -0.1843, -0.1843],
    [ 0.1373, 0.1686, 0.1843, ..., -0.1922, -0.1922, -0.2078],
    [ 0.1137, 0.1529, 0.1608, ..., -0.2314, -0.2235, -0.2157],
    ...,
    [ 0.8353, 0.7882, 0.7333, ..., 0.7020, 0.6471, 0.6157],
    [ 0.8275, 0.7961, 0.7725, ..., 0.5843, 0.4667, 0.3961],
    [ 0.8196, 0.7569, 0.7569, ..., 0.0745, -0.0510, -0.1922]],

    [[-0.8039, -0.8118, -0.8118, ..., -0.8902, -0.8902, -0.8980],
    [-0.7882, -0.7882, -0.7882, ..., -0.8745, -0.8745, -0.8824],
    [-0.8118, -0.8039, -0.7882, ..., -0.8902, -0.8902, -0.8902],
    ...,
    [-0.2706, -0.3176, -0.3647, ..., -0.4275, -0.4588, -0.4824],
    [-0.2706, -0.2941, -0.3412, ..., -0.4824, -0.5451, -0.5765],
    [-0.2784, -0.3412, -0.3490, ..., -0.7333, -0.7804, -0.8353]],

    [[-0.5451, -0.4667, -0.4824, ..., -0.7412, -0.6941, -0.7176],
    [-0.5529, -0.5137, -0.4902, ..., -0.7412, -0.7098, -0.7412],
    [-0.5216, -0.4824, -0.4667, ..., -0.7490, -0.7490, -0.7647],
    ...,
    [ 0.5686, 0.5529, 0.4510, ..., 0.4431, 0.3882, 0.3255],
    [ 0.5451, 0.4902, 0.5137, ..., 0.3020, 0.2078, 0.1294],
    [ 0.5686, 0.5608, 0.5137, ..., -0.2000, -0.4275, -0.5294]]]])}
    '''

    inputs['pixel_values'].size()
    # torch.Size([1, 3, 224, 224])

    可以看到是一个字典类型的tensor数据,其维度为(b, C, W, H)

    因此我们喂给模型的数据也得是四维的结构

  • 接下来看看模型吐出来的结果

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    # outputs 输入如下
    '''
    MaskedLMOutput(loss=tensor(0.4776, grad_fn=<DivBackward0>), logits=tensor([[[[-0.0630, -0.0475, -0.1557, ..., 0.0950, 0.0216, -0.0084],
    [-0.1219, -0.0329, -0.0849, ..., -0.0152, -0.0143, -0.0663],
    [-0.1063, -0.0925, -0.0350, ..., 0.0238, -0.0206, -0.2159],
    ...,
    [ 0.2204, 0.0593, -0.2771, ..., 0.0819, 0.0535, -0.1783],
    [-0.0302, -0.1537, -0.1370, ..., -0.1245, -0.1181, -0.0070],
    [ 0.0875, 0.0626, -0.0693, ..., 0.1331, 0.1088, -0.0835]],

    [[ 0.1977, -0.2163, 0.0469, ..., 0.0802, -0.0414, 0.0552],
    [ 0.1125, -0.0369, 0.0175, ..., 0.0598, -0.0843, 0.0774],
    [ 0.1559, -0.0994, -0.0055, ..., -0.0215, 0.2452, -0.0603],
    ...,
    [ 0.0603, 0.1887, 0.2060, ..., 0.0415, -0.0383, 0.0990],
    [ 0.2106, 0.0992, -0.1562, ..., -0.1254, -0.0603, 0.0685],
    [ 0.0256, 0.1578, 0.0304, ..., -0.0894, 0.0659, 0.1493]],

    [[-0.0348, -0.0362, -0.1617, ..., 0.0527, 0.1927, 0.1431],
    [-0.0447, 0.0137, -0.0798, ..., 0.1057, -0.0299, -0.0742],
    [-0.0725, 0.1473, -0.0118, ..., -0.1284, 0.0010, -0.0773],
    ...,
    [-0.0315, 0.1065, -0.1130, ..., 0.0091, -0.0650, 0.0688],
    [ 0.0314, 0.1034, -0.0964, ..., 0.0144, 0.0532, -0.0415],
    [-0.0205, 0.0046, -0.0987, ..., 0.1317, -0.0065, -0.1617]]]],
    grad_fn=<UnsafeViewBackward0>), hidden_states=None, attentions=None) '''

    可以看到有loss、logits、hidden_states、attentions,而我们的范例只取了logits作为结果输出。这里并不是说其他的部分没用,是只取适配下游任务的输出即可。详情可研究Vit的论文

  • 最后通过argmax函数和model.config.id2label得出标签相对应的文字

    argmax就是返回最大值的位置下标、model.config.id2label配置了对应标签的名称,也知道了最后的classifier层是1000维的

1.1.2 小结

通过以上探索,我们可以得出:

  1. 输入的维度为(batch_size, 3, 224, 224)
  2. 最后的classifier需由1000改成我们叶子的类别数

1.2 数据处理

接下来我们将探索数据的特性,并修改以适应我们的模型

1.2.1 EDA

即Exploratory Data Analysis

首先导入所需包

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# 导入各种包
import torch
import torch.nn as nn
from torch.nn import functional as F
import random
import copy


from fastprogress.fastprogress import master_bar, progress_bar
from torch.cuda.amp import autocast
import pandas as pd
import numpy as np
from torch.utils.data import Dataset, DataLoader, TensorDataset
from torchvision import transforms
from sklearn.model_selection import KFold
from PIL import Image
import os
import matplotlib.pyplot as plt
import torchvision.models as models
from torch.optim.lr_scheduler import CosineAnnealingLR


from transformers import (AdamW, get_scheduler)
from transformers import ViTFeatureExtractor, ViTForImageClassification

查看初始数据

train_df = pd.read_csv('/kaggle/input/classify-leaves/train.csv')

使用下面代码给分类配上序号

1
2
3
4
5
6
7
8
9
10
11
12
def num_map(file_path):
data_df = pd.read_csv('/kaggle/input/classify-leaves/train.csv')

categories = data_df.label.unique().tolist()
categories_zip = list(zip( range(len(categories)) , categories))
categories_dict = {v:k for k, v in categories_zip}

data_df['num_label'] = data_df.label.map(categories_dict)

return data_df
show_df = num_map('/kaggle/input/classify-leaves/train.csv')
show_df.to_csv('train_valid_dataset.csv')

1.2.2 图片数据查看

1
2
3
4
5
6
7
8
path = '/kaggle/input/classify-leaves/'
img = Image.open(path+train_df.image[1])

# plt.figure("Image") # 图像窗口名称
plt.imshow(img)
plt.axis('off') # 关掉坐标轴为 off
plt.title('image') # 图像题目
plt.show()

这里我们做一下维度转换 即 [0, 1, 2] 换成 [2, 1, 0], 并只取某一个通道 看看

1
2
3
4
5
# np.asarray(img).shape
# 可以看到图片反了,正确的顺序是.transpose([2, 0, 1])
img_trans = np.asarray(img).transpose([2,1,0])
plt.imshow(img_trans[0])
plt.show()

2.Preprocessing

接下来我们分别要做 数据增强、数据类定义、数据加载器测试

2.1.1 先来算个平均值标准差

这里算的mean跟std是为了Normalize我们的数据使训练更稳定

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import os
import cv2
import numpy as np
import math


def get_image_list(img_dir, isclasses=False):
"""将图像的名称列表
args: img_dir:存放图片的目录
isclasses:图片是否按类别存放标志
return: 图片文件名称列表
"""
img_list = []
# 路径下图像是否按类别分类存放
if isclasses:
img_file = os.listdir(img_dir)
for class_name in img_file:
if not os.path.isfile(os.path.join(img_dir, class_name)):
class_img_list = os.listdir(os.path.join(img_dir, class_name))
img_list.extend(class_img_list)
else:
img_list = os.listdir(img_dir)
print(img_list)
print('image numbers: {}'.format(len(img_list)))
return img_list


def get_image_pixel_mean(img_dir, img_list, img_size):
"""求数据集图像的R、G、B均值
args: img_dir:
img_list:
img_size:
"""
R_sum = 0
G_sum = 0
B_sum = 0
count = 0
# 循环读取所有图片
for img_name in img_list:
img_path = os.path.join(img_dir, img_name)
if not os.path.isdir(img_path):
image = cv2.imread(img_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = cv2.resize(image, (img_size, img_size)) # <class 'numpy.ndarray'>
R_sum += image[:, :, 0].mean()
G_sum += image[:, :, 1].mean()
B_sum += image[:, :, 2].mean()
count += 1
R_mean = R_sum / count
G_mean = G_sum / count
B_mean = B_sum / count
print('R_mean:{}, G_mean:{}, B_mean:{}'.format(R_mean,G_mean,B_mean))
RGB_mean = [R_mean, G_mean, B_mean]
return RGB_mean


def get_image_pixel_std(img_dir, img_mean, img_list, img_size):
R_squared_mean = 0
G_squared_mean = 0
B_squared_mean = 0
count = 0
image_mean = np.array(img_mean)
# 循环读取所有图片
for img_name in img_list:
img_path = os.path.join(img_dir, img_name)
if not os.path.isdir(img_path):
image = cv2.imread(img_path) # 读取图片
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = cv2.resize(image, (img_size, img_size)) # <class 'numpy.ndarray'>
image = image - image_mean # 零均值
# 求单张图片的方差
R_squared_mean += np.mean(np.square(image[:, :, 0]).flatten())
G_squared_mean += np.mean(np.square(image[:, :, 1]).flatten())
B_squared_mean += np.mean(np.square(image[:, :, 2]).flatten())
count += 1
R_std = math.sqrt(R_squared_mean / count)
G_std = math.sqrt(G_squared_mean / count)
B_std = math.sqrt(B_squared_mean / count)
print('R_std:{}, G_std:{}, B_std:{}'.format(R_std, G_std, B_std))
RGB_std = [R_std, G_std, B_std]
return RGB_std

```
if __name__ == '__main__':
image_dir = '/图片文件路径'
image_list = get_image_list(image_dir, isclasses=False)
RGB_mean = get_image_pixel_mean(image_dir, image_list, img_size=224)
get_image_pixel_std(image_dir, RGB_mean, image_list, img_size=224)```

2.1.2 数据增强

transforms.Compose

1
2
3
4
5
6
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.08, 1.0), ratio=(3.0 / 4.0, 4.0 / 3.0)),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],[0.229, 0.224, 0.225])])

见官网

2.2.1 Dataset

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class imgdataset(Dataset):
def __init__(self, os_path, file_path, transform ):
self.os_path = os_path
self.data = pd.read_csv(file_path)
self.transform = transform

def __getitem__(self, idx):
img = Image.open(self.os_path + self.data.image[idx])
label = self.data.num_label[idx]

self.transform != None:
img = self.transform(img)
label = torch.tensor(label)

return img, label

def __len__(self):
return self.data.shape[0]

2.2.2 模型测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
train_dataset = imgdataset('/kaggle/input/classify-leaves/', 
'/kaggle/working/processed_train.csv',
transform = transform)
train_dataloader = DataLoader(train_dataset, batch_size = 1, shuffle = True)

samples = next(iter(train_dataloader))
samples[0], samples[1]

``` 输入如下
(tensor([[[[0.7608, 0.7608, 0.7608, ..., 0.8353, 0.8353, 0.8392],
[0.7608, 0.7608, 0.7608, ..., 0.8392, 0.8353, 0.8431],
[0.7608, 0.7608, 0.7608, ..., 0.8392, 0.8392, 0.8431],
...,
[0.7725, 0.7725, 0.7725, ..., 0.7725, 0.7725, 0.7725],
[0.7725, 0.7725, 0.7725, ..., 0.7725, 0.7725, 0.7725],
[0.7725, 0.7725, 0.7725, ..., 0.7725, 0.7725, 0.7725]],

[[0.8118, 0.8118, 0.8118, ..., 0.8588, 0.8588, 0.8627],
[0.8118, 0.8118, 0.8118, ..., 0.8627, 0.8588, 0.8667],
[0.8118, 0.8118, 0.8118, ..., 0.8627, 0.8627, 0.8667],
...,
[0.7725, 0.7725, 0.7725, ..., 0.7725, 0.7725, 0.7725],
[0.7725, 0.7725, 0.7725, ..., 0.7725, 0.7725, 0.7725],
[0.7725, 0.7725, 0.7725, ..., 0.7725, 0.7725, 0.7725]],

[[0.7725, 0.7725, 0.7725, ..., 0.8510, 0.8510, 0.8549],
[0.7725, 0.7725, 0.7725, ..., 0.8549, 0.8510, 0.8588],
[0.7725, 0.7725, 0.7725, ..., 0.8549, 0.8549, 0.8588],
...,
[0.7725, 0.7725, 0.7725, ..., 0.7725, 0.7725, 0.7725],
[0.7725, 0.7725, 0.7725, ..., 0.7725, 0.7725, 0.7725],
[0.7725, 0.7725, 0.7725, ..., 0.7725, 0.7725, 0.7725]]]]),
tensor([55])) ```

这里可以直接看到transforms的ToTensor方式已经将我们的数据修改乘(C, W, H)形式(原来的是 C在最后)

1
2
3
test_ot = model(samples[0])
test_pred = test_ot.logits.argmax(-1)
test_pred, test_ot.logits

3. 训练循环

3.1 plot

1
2
3
4
5
6
7
8
9
10
11
12
def plot_loss_update(epoch, mb, train_loss, valid_loss):

x = range(1, epoch+1)
y = np.concatenate((train_loss, valid_loss))
graphs = [[x,train_loss], [x,valid_loss]]
x_margin = 0.2
y_margin = 0.05
x_bounds = [1-x_margin, epochs+x_margin]
# y_bounds = [np.min(y)-y_margin, np.max(y)+y_margin] #边界换成0-1看看
y_bounds = [0-y_margin, 1+y_margin]

mb.update_graph(graphs, x_bounds, y_bounds)

上面是一个 在训练过程中绘制ACC的包 fastprogress

3.2 train_valid

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def train_loop(net, device, criterion, opti, lr, lr_scheduler, batch_size, 
train_loader, val_loader, epochs, model_name):

best_acc = 0
train_acc, valid_acc = [], []
mb = master_bar(range(1, epochs+1))

for epoch in mb:
train_correct, valid_correct = 0, 0

# train_part
net.train()
for batch_data in progress_bar(train_loader, parent=mb):

x, y = tuple(k.to(device) for k in batch_data)
outputs = net(x).logits
loss = criterion(outputs, y)
train_correct += (outputs.argmax(dim=-1) == y).sum().item()

opti.zero_grad()
loss.backward()
lr_scheduler.step()
opti.step()
lr_now = lr_scheduler.get_lr()[0]
train_acc.append(train_correct/(len(train_loader)*batch_size))

# valid_part
net.eval()
with torch.no_grad():
for batch in progress_bar(val_loader, parent=mb):

x, y = tuple(k.to(device) for k in batch_data)
outputs = net(x).logits
loss = criterion(outputs, y)
valid_correct += (outputs.argmax(dim=-1) == y).sum().item()

valid_acc.append(valid_correct/(len(val_loader)*batch_size))

# plot
plot_loss_update(epochs, mb, train_acc, valid_acc)

# print info
train_loss_now = train_acc[-1]
valid_loss_now = valid_acc[-1]
print(f"Epoch {epoch+1} complete! Train Acc :
{train_loss_now*100:.5f}% with lr {lr_now:.4f}", '\n')
print(f"Epoch {epoch+1} complete! Validation Acc : {valid_loss_now*100:.5f}%", '\n')

return valid_loss_now

上面我们定义两个数组保存ACC的值,以绘制图形

3.3 kfold_save

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def kfold_loop(data, save_path, config):
best_acc = 0

for fold, (train_ids,valid_ids) in enumerate(kfold.split(data)):
print(f'FOLD {fold}')
print('--------------------------------------')

# config 数据配置
train_subsampler = torch.utils.data.SubsetRandomSampler(train_ids)
valid_subsampler = torch.utils.data.SubsetRandomSampler(valid_ids)

config['train_loader'] = torch.utils.data.DataLoader(data, batch_size=32,
sampler=train_subsampler, num_workers=2)
config['val_loader'] = torch.utils.data.DataLoader(data,batch_size=32,
sampler=valid_subsampler, num_workers=2)
config['opti'] = torch.optim.AdamW(model.parameters(), lr=lr)
config['lr_scheduler'] = CosineAnnealingLR(config['opti'],T_max=10)

net.to(device)
valid_acc_now = train_loop(**config)

# save 保存最好的模型
if valid_acc_now > best_acc:
print(f"Best validation Acc improved from {best_acc:.5f} to {valid_loss_now:.5f}")
net_copy = copy.deepcopy(net)
best_acc = valid_loss_now

save_path = config['save_path']
path_to_model = f'{save_path}/{model_name}_lr_{lr_now:.5f}_valid_acc_{best_acc:.5f}.pt'
torch.save(net_copy.state_dict(), path_to_model)
print(f"The model has been saved in {path_to_model}")

这里我们进行k折验证

3.4 config

最后我们定义超参数,以及其他构件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
seed = 1222
bs = 32
lr = 3e-4
epochs = 10
warm_steps = 122*epochs
total_steps = 458*epochs

set_seed(seed)

train_transform = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.08, 1.0), ratio=(3.0 / 4.0, 4.0 / 3.0)),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],[0.229, 0.224, 0.225])])

os_path = '/kaggle/input/classify-leaves/'
file_path = '/kaggle/working/train_valid_dataset.csv'
dataset = imgdataset(os_path, file_path, train_transform)
kfold = KFold(n_splits=5, shuffle=True)

model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
model.classifier = nn.Linear(768, 176)
for idx, para in enumerate(model.parameters()): #冻结部分参数
para.requires_grad = False
if idx == 197:
break

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
criter = nn.CrossEntropyLoss()

打包配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
config = {
'net':model,
'device':device,

'lr':lr,
'opti':1,
'lr_scheduler':1,
'criterion':criter,

'batch_size':bs,
'train_loader':1,
'val_loader':1,
'epochs':epochs,
'model_name':'leaves_classifier_model'
}

4. 训练 & 结果分析

1
2
3
4
!mkdir model_save
save_dir = os.getcwd()+'/model_save'

kfold_loop(dataset, save_dir, config)

这里第一个fold 出了点问题,总之valid_acc应该是从6%到了23% 后面就是跟下图一样了

这里我截取了两个fold进行数据查看 (1fold在p100上训练大概40分钟左右)

  • 随着模型在训练集上的准确率上升,模型在验证集上的准确率也跟train_acc逐步拟合,当然由于验证集的数据没有训练过,中间有一些抖动。但是模型最后还是学到东西了的。

小结

由于硬件资源的限制,就不再进行训练(模型还是在继续提升的),我们省略了模型融合和提交结果的验证,这里简单提下

  • 以投票方式的模型融合为例,Vit的投票结果占权重0.4,剩下的ResNeSt和ResNeXt各占0.25, VGG占0.1,最后决定输出的标签
  • test上就是valid部分 只输出176维度里最大值的位置即可

总结

此次我们学习了Pre-train的范式、K-fold验证DataAugment

  • 重点是理解‘拿来主义’,总之拿来就用
  • k折交叉验证只是验证的一种方式
  • 数据增强则需要在理解数据集的基础上进行,是炼丹师必修的一门,当然也有非常多中增强数据的方式

之后我们将进行对比学习的讲解