使用MNIST数据集训练数字识别

环境情况
OS:ubuntu-22.04
Kernel:5.15.0-101-generic
GPU:NVIDIA-T4
Python版本:3.10.12
Docker:24.0.5

使用MNIST数据集训练手写数字识别
下载数据集,使用以下脚本

环境初始化配置

先安装torch和torchvision

1
pip install torch torchvision

安装cuda和GPU驱动,直接按照官网手册进行,这里安装cuda-12.1,默认会自动安装对应的GPU驱动
https://developer.nvidia.com/cuda-12-1-0-download-archive
也可以用cuda12.4。同样按此目录下载即可
安装完成后能执行nvidia-smi看见gpu即可

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
nvidia-smi 
Sun May 26 13:53:22 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:00:08.0 Off | 0 |
| N/A 29C P8 11W / 70W| 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------

安装docker-ce

1
2
3
参考:https://docs.docker.com/engine/install/ubuntu/

安装后版本为docker-ce:v24.0.5

为了能够让容器内使用GPU安装nvidia-container-toolkit

1
2
3
4
5
6
7
8
9
10
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -

curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu22.04/nvidia-docker.list > /etc/apt/sources.list.d/nvidia-docker.list

apt update

apt -y install nvidia-container-toolkit


systemctl restart docker

验证
执行docker命令启动nvidia/cuda:12.1.0-base-ubuntu20.04容器通过–gpus命令将宿主机gpu透传进去,执行nvidia-smi命令查看能否看见gpu

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
docker run --gpus all nvidia/cuda:12.1.0-base-ubuntu20.04 nvidia-smi

Sun May 26 06:03:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:00:08.0 Off | 0 |
| N/A 29C P8 11W / 70W| 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

下载MNIST训练数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import os
from torchvision import datasets

rootdir = "/home/mnist-data/"
traindir = rootdir + "/train"
testdir = rootdir + "/test"

train_dataset = datasets.MNIST(root=rootdir, train=True, download=True)
test_dataset = datasets.MNIST(root=rootdir, train=False, download=True)

number = 0
for img, label in train_dataset:
savedir = traindir + "/" + str(label)
os.makedirs(savedir, exist_ok=True)
savepath = savedir + "/" + str(number).zfill(5) + ".png"
img.save(savepath)
number = number + 1
print(savepath)

number = 0
for img, label in test_dataset:
savedir = testdir + "/" + str(label)
os.makedirs(savedir, exist_ok=True)
savepath = savedir + "/" + str(number).zfill(5) + ".png"
img.save(savepath)
number = number + 1
print(savepath)

保存为文件,执行下载。

下载后的目录会包含3个文件夹

1
2
ls /home/image/
MNIST test train

MNIST文件夹:存放MNIST训练和测试数据集,包括

  • t10k-images-idx3-ubyte:包含训练集的图像数据。

  • train-labels-idx1-ubyte:包含训练集标签数据。

  • t10k-images-idx3-ubyte.gz:测试图像数据集。

  • t10k-labels-idx1-ubyte:测试集标签数据。

train文件夹:训练集图像,这个文件夹包含训练数据集,通常包括60,000张28x28像素的手写数字图像以及相应的标签。这些图像用于训练机器学习模型。

test文件夹: 这个文件夹包含测试数据集,通常包括10,000张28x28像素的手写数字图像以及相应的标签。这些图像用于评估训练好的模型的性能。

特点:

  • 标签:每张图片都有一个对应的标签,表示该图片上的数字是多少(0到9)。
  • 标准化:所有图片都被标准化到28x28像素,并且中心对齐,保证数字位于图像的中心位置。

配置

1
docker run --gpus all -itd --rm -v /home/mnist-data:/workspace/data nvcr.io/nvidia/pytorch:24.05-py3

在容器中进行训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# 定义网络架构
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout2d(0.25)
self.dropout2 = nn.Dropout2d(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)

def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
output = F.log_softmax(x, dim=1)
return output

# 定义数据预处理
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])

# 加载训练集和测试集
train_dataset = datasets.MNIST(root='/workspace/data', train=True, download=False, transform=transform)
test_dataset = datasets.MNIST(root='/workspace/data', train=False, download=False, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

# 检查是否有GPU可用,并选择设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = Net().to(device)
optimizer = optim.Adam(model.parameters())

# 训练模型
def train(model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % 100 == 0:
print(f'Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)} '
f'({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}')

# 测试模型
def test(model, device, test_loader):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item()
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item()

test_loss /= len(test_loader.dataset)
print(f'\nTest set: Average loss: {test_loss:.4f}, Accuracy: {correct}/{len(test_loader.dataset)} '
f'({100. * correct / len(test_loader.dataset):.0f}%)\n')

# 运行训练和测试,并保存模型
for epoch in range(1, 11):
train(model, device, train_loader, optimizer, epoch)
test(model, device, test_loader)

# 保存模型
torch.save(model.state_dict(), "/workspace/mnist_cnn.pt")
print("Model saved to /workspace/mnist_cnn.pt")

保存为mnist_train.py文件,执行python mnist_train.py
会加载我们下载映射到容器内的MNIST数据集,进行训练,训练后的文件mnist_cnn.pt会存储到workspace目录

加载模型进行测试验证

保存为test.py文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import transforms
from PIL import Image
import argparse

# 定义相同的网络架构
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout2d(0.25)
self.dropout2 = nn.Dropout2d(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)

def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
output = F.log_softmax(x, dim=1)
return output

# 检查是否有GPU可用,并选择设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 加载模型
model = Net().to(device)
model.load_state_dict(torch.load("/workspace/mnist_cnn.pt"))
model.eval()

# 定义数据预处理
transform = transforms.Compose([
transforms.Grayscale(num_output_channels=1),
transforms.Resize((28, 28)),
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])

def predict_image(image_path):
image = Image.open(image_path)
image = transform(image).unsqueeze(0).to(device)

with torch.no_grad():
output = model(image)
pred = output.argmax(dim=1, keepdim=True)

return pred.item()

if __name__ == "__main__":
parser = argparse.ArgumentParser(description='MNIST Image Prediction')
parser.add_argument('image_path', type=str, help='Path to the image to be predicted')
args = parser.parse_args()

# 预测图片
prediction = predict_image(args.image_path)
print(f'The predicted digit is: {prediction}')

执行验证,指定图片路径

1
2
3
4
5
6
7
8
9
10
python test.py data/test/8/00527.png 


结果如下:
The predicted digit is: 8

python test.py data/test/1/00239.png

The predicted digit is: 1

可以用test目录下数据进行快速验证。

也可以使用DIGITS进行图形化加载验证。

https://licensecounter.jp/engineer-voice/blog/articles/20240408_ngc_nvidia_gpu_cloud.html