目标检测是计算机视觉的一项重要任务,本教程使用JITTOR框架实现了SSD目标检测模型。
SSD论文:https://arxiv.org/pdf/1512.02325.pdf
1. 数据集
1.1 数据准备
VOC数据集是目标检测、语义分割等任务常用的数据集之一,本教程使用VOC数据集的2007 trainval和2012 trainval作为训练集,2007 test作为验证集和测试集。您可以从下面的链接下载数据。
VOC数据集中的物体共包括20个类别:'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor'

将三个文件解压在同一文件夹下,并使用utils.py里的create_data_lists()函数生成训练所需的json文件。
函数的参数voc07_path和voc12_path分别是./data/VOCdevkit/VOC2007/和./data/VOCdevkit/VOC2012/,output_folder可自行设置,比如./dataset/,您将在output_folder中得到label_map.json、TEST_images.json、TEST_objects.json、TRAIN_images.json、TRAIN_objects.json五个文件。
最终数据集的文件组织如下。
# 文件组织
根目录
|----data
| |----VOCdevkit
| | |----VOC2007
| | | |----Annotations
| | | |----ImageSets
| | | |----JPEGImages
| | | |----SegmentationClass
| | | |----SegmentationObject
| | |----VOC2012
| | |----Annotations
| | |----ImageSets
| | |----JPEGImages
| | |----SegmentationClass
| | |----SegmentationObject
|----dataset
|----label_map.json
|----TEST_images.json
|----TEST_objects.json
|----TRAIN_images.json
|----TRAIN_objects.json
1.2 数据加载
使用jittor.dataset.dataset的基类Dataset可以构造自己的数据集,需要实现__init__、__getitem__、__len__以及collate_batch等函数。
__init__: 定义数据路径,这里的data_folder需设置为之前您设定的output_folder路径。同时需要调用self.set_attr来指定数据集加载所需的参数batch_size,total_len、shuffle。__getitem__: 返回单个item的数据。__len__: 返回数据集的数据总数。collate_batch: 由于训练集中不同的图片的gt框个数不同,需要重写collate_batch函数将不同item的boxes和labels放入list,返回batch_size的数据。
from jittor.dataset.dataset import Dataset
import json
import os
from PIL import Image
import numpy as np
class PascalVOCDataset(Dataset):
def __init__(self, data_folder, split, keep_difficult=False, batch_size=1, shuffle=False):
self.split = split.upper()
assert self.split in {'TRAIN', 'TEST'}
self.data_folder = data_folder # data_folder is output_folder used in create_data_lists
self.keep_difficult = keep_difficult # keep or discard objects that are considered difficult to detect
self.batch_size = batch_size
self.shuffle = shuffle
self.mean = [0.485, 0.456, 0.406]
self.std = [0.229, 0.224, 0.225]
with open(os.path.join(data_folder, self.split + '_images.json'), 'r') as j:
self.images = json.load(j)
with open(os.path.join(data_folder, self.split + '_objects.json'), 'r') as j:
self.objects = json.load(j)
assert len(self.images) == len(self.objects)
self.total_len = len(self.images)
self.set_attr(batch_size = self.batch_size, total_len = self.total_len, shuffle = self.shuffle) # bs , total_len, shuffle
def __getitem__(self, i):
image = Image.open(self.images[i], mode='r')
width, height = image.size
image = image.resize((300, 300))
image = np.array(image.convert('RGB')) / 255.
image = (image - self.mean) / self.std
image = image.transpose((2,0,1)).astype("float32")
objects = self.objects[i]
boxes = np.array(objects['boxes']).astype("float32")
boxes[:,[0,2]] /= width
boxes[:,[1,3]] /= height
labels = np.array(objects['labels'])
difficulties = np.array(objects['difficulties'])
# Discard difficult objects, if desired
if not self.keep_difficult:
boxes = boxes[1 - difficulties]
labels = labels[1 - difficulties]
difficulties = difficulties[1 - difficulties]
return image, boxes, labels, difficulties
def __len__(self):
return len(self.images)
def collate_batch(self, batch):
# get batch_size data
images = list()
boxes = list()
labels = list()
difficulties = list()
for b in batch:
images.append(b[0])
boxes.append(b[1])
labels.append(b[2])
difficulties.append(b[3])
images = np.stack(images, axis=0)
return images, boxes, labels, difficulties
2. 模型定义

上图为SSD论文给出的网络架构图。本教程采用VGG-16为backbone,架构有部分改动。输入图像尺寸为300*300。需要说明的是:
- 本教程采用
VGG-16的中间层特征conv4_3、conv7以及Extra Feature Layers (AuxiliaryConvolutions)的中间特征层conv8_2、conv9_2、conv10_2和conv11_2。 conv4_3、conv7、conv8_2、conv9_2、conv10_2和conv11_2的特征图大小分别是38*38、19*19、10*10、5*5、3*3、1*1,锚框Prior的scale分别为0.1、0.2、0.375、0.55、0.725、0.9,特征图上每个点产生的Prior数目分别为4、6、6、6、4、4,最终每个特征图产生的Prior数目为5776、2116、600、150、36、3。总计有8732个Prior。
class VGGBase(nn.Module):
def __init__(self):
super(VGGBase, self).__init__()
self.conv1_1 = nn.Conv(3, 64, kernel_size=3, padding=1)
self.conv1_2 = nn.Conv(64, 64, kernel_size=3, padding=1)
self.pool1 = nn.Pool(kernel_size=2, stride=2, op='maximum')
self.conv2_1 = nn.Conv(64, 128, kernel_size=3, padding=1)
self.conv2_2 = nn.Conv(128, 128, kernel_size=3, padding=1)
self.pool2 = nn.Pool(kernel_size=2, stride=2, op='maximum')
self.conv3_1 = nn.Conv(128, 256, kernel_size=3, padding=1)
self.conv3_2 = nn.Conv(256, 256, kernel_size=3, padding=1)
self.conv3_3 = nn.Conv(256, 256, kernel_size=3, padding=1)
self.pool3 = nn.Pool(kernel_size=2, stride=2, ceil_mode=True, op='maximum')
self.conv4_1 = nn.Conv(256, 512, kernel_size=3, padding=1)
self.conv4_2 = nn.Conv(512, 512, kernel_size=3, padding=1)
self.conv4_3 = nn.Conv(512, 512, kernel_size=3, padding=1)
self.pool4 = nn.Pool(kernel_size=2, stride=2, op='maximum')
self.conv5_1 = nn.Conv(512, 512, kernel_size=3, padding=1)
self.conv5_2 = nn.Conv(512, 512, kernel_size=3, padding=1)
self.conv5_3 = nn.Conv(512, 512, kernel_size=3, padding=1)
self.pool5 = nn.Pool(kernel_size=3, stride=1, padding=1, op='maximum')
self.conv6 = nn.Conv(512, 1024, kernel_size=3, padding=6, dilation=6)
self.conv7 = nn.Conv(1024, 1024, kernel_size=1)
def execute(self, image):
out = nn.relu(self.conv1_1(image))
out = nn.relu(self.conv1_2(out))
out = self.pool1(out)
out = nn.relu(self.conv2_1(out))
out = nn.relu(self.conv2_2(out))
out = self.pool2(out)
out = nn.relu(self.conv3_1(out))
out = nn.relu(self.conv3_2(out))
out = nn.relu(self.conv3_3(out))
out = self.pool3(out)
out = nn.relu(self.conv4_1(out))
out = nn.relu(self.conv4_2(out))
out = nn.relu(self.conv4_3(out))
conv4_3_feats = out
out = self.pool4(out)
out = nn.relu(self.conv5_1(out))
out = nn.relu(self.conv5_2(out))
out = nn.relu(self.conv5_3(out))
out = self.pool5(out)
out = nn.relu(self.conv6(out))
conv7_feats = nn.relu(self.conv7(out))
return (conv4_3_feats, conv7_feats)
class AuxiliaryConvolutions(nn.Module):
def __init__(self):
super(AuxiliaryConvolutions, self).__init__()
self.conv8_1 = nn.Conv(1024, 256, kernel_size=1, padding=0)
self.conv8_2 = nn.Conv(256, 512, kernel_size=3, stride=2, padding=1)
self.conv9_1 = nn.Conv(512, 128, kernel_size=1, padding=0)
self.conv9_2 = nn.Conv(128, 256, kernel_size=3, stride=2, padding=1)
self.conv10_1 = nn.Conv(256, 128, kernel_size=1, padding=0)
self.conv10_2 = nn.Conv(128, 256, kernel_size=3, padding=0)
self.conv11_1 = nn.Conv(256, 128, kernel_size=1, padding=0)
self.conv11_2 = nn.Conv(128, 256, kernel_size=3, padding=0)
def execute(self, conv7_feats):
out = nn.relu(self.conv8_1(conv7_feats))
out = nn.relu(self.conv8_2(out))
conv8_2_feats = out
out = nn.relu(self.conv9_1(out))
out = nn.relu(self.conv9_2(out))
conv9_2_feats = out
out = nn.relu(self.conv10_1(out))
out = nn.relu(self.conv10_2(out))
conv10_2_feats = out
out = nn.relu(self.conv11_1(out))
conv11_2_feats = nn.relu(self.conv11_2(out))
return (conv8_2_feats, conv9_2_feats, conv10_2_feats, conv11_2_feats)
PredictionConvolutions将上述的6个Feature map经过若干层卷积操作最终concat在一起形成[bs, 8732, 4]的locs信息以及[bs, 8732, 1]的classes_scores信息。
class PredictionConvolutions(nn.Module):
def __init__(self, n_classes):
super(PredictionConvolutions, self).__init__()
self.n_classes = n_classes
n_boxes = {
'conv4_3': 4,
'conv7': 6,
'conv8_2': 6,
'conv9_2': 6,
'conv10_2': 4,
'conv11_2': 4,
}
self.loc_conv4_3 = nn.Conv(512, (n_boxes['conv4_3'] * 4), kernel_size=3, padding=1)
self.loc_conv7 = nn.Conv(1024, (n_boxes['conv7'] * 4), kernel_size=3, padding=1)
self.loc_conv8_2 = nn.Conv(512, (n_boxes['conv8_2'] * 4), kernel_size=3, padding=1)
self.loc_conv9_2 = nn.Conv(256, (n_boxes['conv9_2'] * 4), kernel_size=3, padding=1)
self.loc_conv10_2 = nn.Conv(256, (n_boxes['conv10_2'] * 4), kernel_size=3, padding=1)
self.loc_conv11_2 = nn.Conv(256, (n_boxes['conv11_2'] * 4), kernel_size=3, padding=1)
self.cl_conv4_3 = nn.Conv(512, (n_boxes['conv4_3'] * n_classes), kernel_size=3, padding=1)
self.cl_conv7 = nn.Conv(1024, (n_boxes['conv7'] * n_classes), kernel_size=3, padding=1)
self.cl_conv8_2 = nn.Conv(512, (n_boxes['conv8_2'] * n_classes), kernel_size=3, padding=1)
self.cl_conv9_2 = nn.Conv(256, (n_boxes['conv9_2'] * n_classes), kernel_size=3, padding=1)
self.cl_conv10_2 = nn.Conv(256, (n_boxes['conv10_2'] * n_classes), kernel_size=3, padding=1)
self.cl_conv11_2 = nn.Conv(256, (n_boxes['conv11_2'] * n_classes), kernel_size=3, padding=1)
def execute(self, conv4_3_feats, conv7_feats, conv8_2_feats, conv9_2_feats, conv10_2_feats, conv11_2_feats):
batch_size = conv4_3_feats.shape[0]
l_conv4_3 = self.loc_conv4_3(conv4_3_feats)
l_conv4_3 = jt.transpose(l_conv4_3, [0, 2, 3, 1])
l_conv4_3 = jt.reshape(l_conv4_3, [batch_size, -1, 4])
l_conv7 = self.loc_conv7(conv7_feats)
l_conv7 = jt.transpose(l_conv7, [0, 2, 3, 1])
l_conv7 = jt.reshape(l_conv7, [batch_size, -1, 4])
l_conv8_2 = self.loc_conv8_2(conv8_2_feats)
l_conv8_2 = jt.transpose(l_conv8_2, [0, 2, 3, 1])
l_conv8_2 = jt.reshape(l_conv8_2, [batch_size, -1, 4])
l_conv9_2 = self.loc_conv9_2(conv9_2_feats)
l_conv9_2 = jt.transpose(l_conv9_2, [0, 2, 3, 1])
l_conv9_2 = jt.reshape(l_conv9_2, [batch_size, -1, 4])
l_conv10_2 = self.loc_conv10_2(conv10_2_feats)
l_conv10_2 = jt.transpose(l_conv10_2, [0, 2, 3, 1])
l_conv10_2 = jt.reshape(l_conv10_2, [batch_size, -1, 4])
l_conv11_2 = self.loc_conv11_2(conv11_2_feats)
l_conv11_2 = jt.transpose(l_conv11_2, [0, 2, 3, 1])
l_conv11_2 = jt.reshape(l_conv11_2, [batch_size, -1, 4])
c_conv4_3 = self.cl_conv4_3(conv4_3_feats)
c_conv4_3 = jt.transpose(c_conv4_3, [0, 2, 3, 1])
c_conv4_3 = jt.reshape(c_conv4_3, [batch_size, -1, self.n_classes])
c_conv7 = self.cl_conv7(conv7_feats)
c_conv7 = jt.transpose(c_conv7, [0, 2, 3, 1])
c_conv7 = jt.reshape(c_conv7, [batch_size, -1, self.n_classes])
c_conv8_2 = self.cl_conv8_2(conv8_2_feats)
c_conv8_2 = jt.transpose(c_conv8_2, [0, 2, 3, 1])
c_conv8_2 = jt.reshape(c_conv8_2, [batch_size, -1, self.n_classes])
c_conv9_2 = self.cl_conv9_2(conv9_2_feats)
c_conv9_2 = jt.transpose(c_conv9_2, [0, 2, 3, 1])
c_conv9_2 = jt.reshape(c_conv9_2, [batch_size, -1, self.n_classes])
c_conv10_2 = self.cl_conv10_2(conv10_2_feats)
c_conv10_2 = jt.transpose(c_conv10_2, [0, 2, 3, 1])
c_conv10_2 = jt.reshape(c_conv10_2, [batch_size, -1, self.n_classes])
c_conv11_2 = self.cl_conv11_2(conv11_2_feats)
c_conv11_2 = jt.transpose(c_conv11_2, [0, 2, 3, 1])
c_conv11_2 = jt.reshape(c_conv11_2, [batch_size, -1, self.n_classes])
locs = jt.contrib.concat([l_conv4_3, l_conv7, l_conv8_2, l_conv9_2, l_conv10_2, l_conv11_2], dim=1)
classes_scores = jt.contrib.concat([c_conv4_3, c_conv7, c_conv8_2, c_conv9_2, c_conv10_2, c_conv11_2], dim=1)
return (locs, classes_scores)
3. 模型训练
模型训练参数设定如下:
# Learning parameters
batch_size = 16 # batch size
epochs = 200 # number of epochs to run without early-stopping
workers = 4 # number of workers for loading data in the DataLoader
print_freq = 5 # print training or validation status every __ batches
lr = 1e-3 # learning rate
momentum = 0.9 # momentum
weight_decay = 5e-4 # weight decay
grad_clip = 1 # clip if gradients are exploding, which may happen at larger batch sizes (sometimes at 32) - you will recognize it by a sorting error in the MuliBox loss calculation
定义模型、优化器、损失函数、训练/验证数据加载器。
model = SSD300(n_classes=n_classes)
optimizer = nn.SGD(model.parameters(),
lr,
momentum=momentum,
weight_decay=weight_decay)
criterion = MultiBoxLoss(priors_cxcy=model.priors_cxcy)
train_loader = PascalVOCDataset(data_folder,
split='train',
keep_difficult=keep_difficult,
batch_size=batch_size,
shuffle=False)
val_loader = PascalVOCDataset(data_folder,
split='test',
keep_difficult=keep_difficult,
batch_size=batch_size,
shuffle=False)
for epoch in range(epochs):
train(train_loader=train_loader,
model=model,
criterion=criterion,
optimizer=optimizer,
epoch=epoch)
validate(val_loader=val_loader,
model=model,
criterion=criterion)
if epoch % 100 == 0 and epoch != 0:
optimizer.lr *= 0.1
model.save(f"model_{epoch}_{i}.pkl")
损失函数设计:监督predicted_locs使用L1Loss,监督predicted_scores采用CrossEntropyLoss。其中正负样本比例为1:3。
class L1Loss(nn.Module):
def __init__(self, size_average=None, reduce=None, reduction='mean'):
self.size_average = size_average
self.reduce = reduce
self.reduction = reduction
def execute(self, input, target):
ret = jt.abs(input - target)
if self.reduction != None:
ret = jt.mean(ret) if self.reduction == 'mean' else jt.sum(ret)
return ret
class CrossEntropyLoss(nn.Module):
def __init__(self, weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean'):
self.ignore_index = ignore_index
self.reduction = reduction
def execute(self, input, target):
bs_idx = jt.array(range(input.shape[0]))
ret = (- jt.log(nn.softmax(input, dim=1)))[bs_idx, target]
if self.reduction != None:
ret = jt.mean(ret) if self.reduction == 'mean' else jt.sum(ret)
return ret
class MultiBoxLoss(nn.Module):
def __init__(self, priors_cxcy, threshold=0.5, neg_pos_ratio=3, alpha=1.0):
super(MultiBoxLoss, self).__init__()
self.priors_cxcy = priors_cxcy
self.priors_xy = cxcy_to_xy(priors_cxcy)
self.threshold = threshold
self.neg_pos_ratio = neg_pos_ratio
self.alpha = alpha
self.smooth_l1 = L1Loss()
self.cross_entropy = CrossEntropyLoss(reduce=False, reduction=None)
def execute(self, predicted_locs, predicted_scores, boxes, labels):
# ... 省略部分代码
loc_loss = self.smooth_l1(
(predicted_locs * positive_priors.broadcast([1,1,4], [2])),
(true_locs * positive_priors.broadcast([1,1,4], [2]))
)
# ... 省略部分代码
conf_loss_all = self.cross_entropy(
jt.reshape(predicted_scores, [-1, n_classes]), jt.reshape(true_classes, [-1,])
)
# ... 省略部分代码
conf_loss = ((conf_loss_hard_neg.sum() + conf_loss_pos.sum()) / n_positives.float32().sum())
return (conf_loss + (self.alpha * loc_loss))
def train(train_loader, model, criterion, optimizer, epoch):
model.train()
for i, (images, boxes, labels, _) in enumerate(train_loader):
start = time.time()
images = jt.array(images) # (batch_size (N), 3, 300, 300)
boxes = [jt.array(b) for b in boxes]
labels = [jt.array(l) for l in labels]
predicted_locs, predicted_scores = model(images)
loss = criterion(predicted_locs, predicted_scores, boxes, labels)
# 如果grad_clip不为None,裁剪梯度在[-grad_clip, grad_clip]范围内。
if grad_clip is not None:
optimizer.grad_clip = grad_clip
optimizer.step(loss)
if i % print_freq == 0:
print(jt.liveness_info())
print("epoch: ", epoch, "loss: ", loss.data, "batch_time: ", time.time() - start)
4. 结果
| 类别 | mAP |
| aeroplane | 0.7927576768469976 |
| bicycle | 0.8308261210912117 |
| bird | 0.7491560368355223 |
| boat | 0.6989078756890165 |
| bottle | 0.43970585216576197 |
| bus | 0.8564384890463036 |
| car | 0.8465455938938328 |
| cat | 0.8811548777886532 |
| chair | 0.5568820322545921 |
| cow | 0.8207511150002826 |
| diningtable | 0.7499577509191816 |
| dog | 0.8365562301751435 |
| horse | 0.8710702934141827 |
| motorbike | 0.8146738327375153 |
| person | 0.7736053658165942 |
| pottedplant | 0.4945958124163817 |
| sheep | 0.7608088479131314 |
| sofa | 0.749742482016984 |
| train | 0.8414309795414535 |
| tvmonitor | 0.7584442522786521 |
