PyTorch入门(五)使用CNN模型进行中文文本分类

本文将会介绍如何在PyTorch中使用CNN模型进行中文文本分类。

使用CNN实现中文文本分类的基本思路:

  • 文本预处理

  • 将字(或token)进行汇总,形成字典文件,可保留前n个字

  • 文字转数字,不在字典文件中用表示

  • 对文本进行阶段与填充,填充用,将文本向量长度统一

  • 建立Embedding层

  • 建立CNN模型

  • 训练模型,调整参数得到最优表现的模型,获取模型评估指标

  • 保存模型,并在新样本上进行预测

    我们以搜狗小分类数据集为例,使用CNN模型对其进行文本分类。

数据集介绍

搜狗小分类数据集,共有5个类别,分别为体育、健康、军事、教育、汽车。划分为训练集和测试集,其中训练集每个类别800条样本,测试集每个类别100条样本。

文本预处理

读取训练集中的文本数据,形成文字列表,打乱顺序,保留前N个文字,形成Pickle文件,并保存类别列表至Pickle文件,便于后续处理,Python代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# -*- coding: utf-8 -*-
# @Time : 2023/3/16 10:32
# @Author : Jclian91
# @File : preprocessing.py
# @Place : Minghang, Shanghai
from random import shuffle
import pandas as pd
from collections import Counter, defaultdict

from params import TRAIN_FILE_PATH, NUM_WORDS
from pickle_file_operaor import PickleFileOperator


class FilePreprossing(object):
def __init__(self, n):
# 保留前n个字
self.__n = n

def _read_train_file(self):
train_pd = pd.read_csv(TRAIN_FILE_PATH)
label_list = train_pd['label'].unique().tolist()
# 统计文字频数
character_dict = defaultdict(int)
for content in train_pd['content']:
for key, value in Counter(content).items():
character_dict[key] += value
# 不排序
sort_char_list = [(k, v) for k, v in character_dict.items()]
shuffle(sort_char_list)
print(f'total {len(character_dict)} characters.')
print('top 10 chars: ', sort_char_list[:10])
# 保留前n个文字
top_n_chars = [_[0] for _ in sort_char_list[:self.__n]]

return label_list, top_n_chars

def run(self):
label_list, top_n_chars = self._read_train_file()
PickleFileOperator(data=label_list, file_path='labels.pk').save()
PickleFileOperator(data=top_n_chars, file_path='chars.pk').save()


if __name__ == '__main__':
processor = FilePreprossing(NUM_WORDS)
processor.run()
# 读取pickle文件
labels = PickleFileOperator(file_path='labels.pk').read()
print(labels)
content = PickleFileOperator(file_path='chars.pk').read()
print(content)

文字转数字,不在字典文件中用UNK表示。对文本进行阶段与填充,填充用PAD,将文本向量长度统一。Python实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# -*- coding: utf-8 -*-
# @Time : 2023/3/16 11:15
# @Author : Jclian91
# @File : text_featuring.py
# @Place : Minghang, Shanghai
import pandas as pd
import numpy as np
import torch as T
from torch.utils.data import Dataset, DataLoader, random_split

from params import (PAD_NO,
UNK_NO,
START_NO,
SENT_LENGTH,
TEST_FILE_PATH,
TRAIN_FILE_PATH)
from pickle_file_operaor import PickleFileOperator


# load csv file
def load_csv_file(file_path):
df = pd.read_csv(file_path)
samples, y_true = [], []
for index, row in df.iterrows():
y_true.append(row['label'])
samples.append(row['content'])
return samples, y_true


# 读取pickle文件
def load_file_file():
labels = PickleFileOperator(file_path='labels.pk').read()
chars = PickleFileOperator(file_path='chars.pk').read()
label_dict = dict(zip(labels, range(len(labels))))
char_dict = dict(zip(chars, range(len(chars))))
return label_dict, char_dict


# 文本预处理
def text_feature(labels, contents, label_dict, char_dict):
samples, y_true = [], []
for s_label, s_content in zip(labels, contents):
# one_hot_vector = [0] * len(label_dict)
# one_hot_vector[label_dict[s_label]] = 1
# y_true.append([one_hot_vector])
y_true.append(label_dict[s_label])
train_sample = []
for char in s_content:
if char in char_dict:
train_sample.append(START_NO + char_dict[char])
else:
train_sample.append(UNK_NO)
# 补充或截断
if len(train_sample) < SENT_LENGTH:
samples.append(train_sample + [PAD_NO] * (SENT_LENGTH - len(train_sample)))
else:
samples.append(train_sample[:SENT_LENGTH])

return samples, y_true


# dataset
class CSVDataset(Dataset):
# load the dataset
def __init__(self, file_path):
label_dict, char_dict = load_file_file()
samples, y_true = load_csv_file(file_path)
x, y = text_feature(y_true, samples, label_dict, char_dict)
self.X = T.from_numpy(np.array(x)).long()
self.y = T.from_numpy(np.array(y))

# number of rows in the dataset
def __len__(self):
return len(self.X)

# get a row at an index
def __getitem__(self, idx):
return [self.X[idx], self.y[idx]]

# get indexes for train and test rows
def get_splits(self, n_test=0.3):
# determine sizes
test_size = round(n_test * len(self.X))
train_size = len(self.X) - test_size
# calculate the split
return random_split(self, [train_size, test_size])


if __name__ == '__main__':
p = CSVDataset().__getitem__(1)
print(p)

以下面的文本为例,将其转化为向量(假设最大长度为40)后的结果为:

盖世汽车讯,特斯拉去年击败了宝马,夺得了美国豪华汽车市场的桂

[3899, 4131, 2497, 496, 3746, 221, 3273, 1986, 4002, 4882, 3238, 5114, 1516, 353, 4767, 2357, 221, 2920, 387, 353, 4434, 4930, 4079, 4187, 2497, 496, 883, 1325, 1061, 3901, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

创建模型

创建模型:建立Embedding层,建立CNN模型,模型图如下:

Python实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# -*- coding: utf-8 -*-
import math
import torch
import torch.nn as nn

from params import NUM_WORDS, SENT_LENGTH, EMBEDDING_SIZE


class TextClassifier(nn.ModuleList):

def __init__(self):
super(TextClassifier, self).__init__()
# Parameters regarding text preprocessing
self.seq_len = SENT_LENGTH
self.num_words = NUM_WORDS
self.embedding_size = EMBEDDING_SIZE

# Dropout definition
self.dropout = nn.Dropout(0.25)

# CNN parameters definition
# Kernel sizes
self.kernel_1 = 2
self.kernel_2 = 3
self.kernel_3 = 4
self.kernel_4 = 5

# Output size for each convolution
self.out_size = 32
# Number of strides for each convolution
self.stride = 2

# Embedding layer definition
self.embedding = nn.Embedding(self.num_words + 2, self.embedding_size)

# Convolution layers definition
self.conv_1 = nn.Conv1d(self.seq_len, self.out_size, self.kernel_1, self.stride)
self.conv_2 = nn.Conv1d(self.seq_len, self.out_size, self.kernel_2, self.stride)
self.conv_3 = nn.Conv1d(self.seq_len, self.out_size, self.kernel_3, self.stride)
self.conv_4 = nn.Conv1d(self.seq_len, self.out_size, self.kernel_4, self.stride)

# Max pooling layers definition
self.pool_1 = nn.MaxPool1d(self.kernel_1, self.stride)
self.pool_2 = nn.MaxPool1d(self.kernel_2, self.stride)
self.pool_3 = nn.MaxPool1d(self.kernel_3, self.stride)
self.pool_4 = nn.MaxPool1d(self.kernel_4, self.stride)

# Fully connected layer definition
self.fc = nn.Linear(self.in_features_fc(), 5)

def in_features_fc(self):
"""
Calculates the number of output features after Convolution + Max pooling
Convolved_Features = ((embedding_size + (2 * padding) - dilation * (kernel - 1) - 1) / stride) + 1
Pooled_Features = ((embedding_size + (2 * padding) - dilation * (kernel - 1) - 1) / stride) + 1
source: https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html
"""

# Calcualte size of convolved/pooled features for convolution_1/max_pooling_1 features
out_conv_1 = ((self.embedding_size - 1 * (self.kernel_1 - 1) - 1) / self.stride) + 1
out_conv_1 = math.floor(out_conv_1)
out_pool_1 = ((out_conv_1 - 1 * (self.kernel_1 - 1) - 1) / self.stride) + 1
out_pool_1 = math.floor(out_pool_1)

# Calcualte size of convolved/pooled features for convolution_2/max_pooling_2 features
out_conv_2 = ((self.embedding_size - 1 * (self.kernel_2 - 1) - 1) / self.stride) + 1
out_conv_2 = math.floor(out_conv_2)
out_pool_2 = ((out_conv_2 - 1 * (self.kernel_2 - 1) - 1) / self.stride) + 1
out_pool_2 = math.floor(out_pool_2)

# Calcualte size of convolved/pooled features for convolution_3/max_pooling_3 features
out_conv_3 = ((self.embedding_size - 1 * (self.kernel_3 - 1) - 1) / self.stride) + 1
out_conv_3 = math.floor(out_conv_3)
out_pool_3 = ((out_conv_3 - 1 * (self.kernel_3 - 1) - 1) / self.stride) + 1
out_pool_3 = math.floor(out_pool_3)

# Calculate size of convolved/pooled features for convolution_4/max_pooling_4 features
out_conv_4 = ((self.embedding_size - 1 * (self.kernel_4 - 1) - 1) / self.stride) + 1
out_conv_4 = math.floor(out_conv_4)
out_pool_4 = ((out_conv_4 - 1 * (self.kernel_4 - 1) - 1) / self.stride) + 1
out_pool_4 = math.floor(out_pool_4)

# Returns "flattened" vector (input for fully connected layer)
return (out_pool_1 + out_pool_2 + out_pool_3 + out_pool_4) * self.out_size

def forward(self, x):
# Sequence of tokes is filterd through an embedding layer
x = self.embedding(x)

# Convolution layer 1 is applied
x1 = self.conv_1(x)
x1 = torch.relu(x1)
x1 = self.pool_1(x1)

# Convolution layer 2 is applied
x2 = self.conv_2(x)
x2 = torch.relu((x2))
x2 = self.pool_2(x2)

# Convolution layer 3 is applied
x3 = self.conv_3(x)
x3 = torch.relu(x3)
x3 = self.pool_3(x3)

# Convolution layer 4 is applied
x4 = self.conv_4(x)
x4 = torch.relu(x4)
x4 = self.pool_4(x4)

# The output of each convolutional layer is concatenated into a unique vector
union = torch.cat((x1, x2, x3, x4), 2)
union = union.reshape(union.size(0), -1)

# The "flattened" vector is passed through a fully connected layer
out = self.fc(union)
# Dropout is applied
out = self.dropout(out)
# Activation function is applied
# out = nn.Softmax(dim=1)(out)

return out

训练模型,调整参数得到最优表现的模型,获取模型评估指标,Python实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# -*- coding: utf-8 -*-
import torch
from torch.optim import RMSprop, Adam
from torch.nn import CrossEntropyLoss, Softmax
import torch.nn.functional as F
from torch.utils.data import DataLoader
from numpy import vstack, argmax
from sklearn.metrics import accuracy_score

from model import TextClassifier
from text_featuring import CSVDataset
from params import TRAIN_BATCH_SIZE, TEST_BATCH_SIZE, LEARNING_RATE, EPOCHS, TRAIN_FILE_PATH, TEST_FILE_PATH


# model train
class ModelTrainer(object):
# evaluate the model
@staticmethod
def evaluate_model(test_dl, model):
predictions, actuals = [], []
for i, (inputs, targets) in enumerate(test_dl):
# evaluate the model on the test set
yhat = model(inputs)
# retrieve numpy array
yhat = yhat.detach().numpy()
actual = targets.numpy()
# convert to class labels
yhat = argmax(yhat, axis=1)
# reshape for stacking
actual = actual.reshape((len(actual), 1))
yhat = yhat.reshape((len(yhat), 1))
# store
predictions.append(yhat)
actuals.append(actual)
predictions, actuals = vstack(predictions), vstack(actuals)
# calculate accuracy
acc = accuracy_score(actuals, predictions)
return acc

# Model Training, evaluation and metrics calculation
def train(self, model):
# calculate split
train = CSVDataset(TRAIN_FILE_PATH)
test = CSVDataset(TEST_FILE_PATH)
# prepare data loaders
train_dl = DataLoader(train, batch_size=TRAIN_BATCH_SIZE, shuffle=True)
test_dl = DataLoader(test, batch_size=TEST_BATCH_SIZE)

# Define optimizer
optimizer = Adam(model.parameters(), lr=LEARNING_RATE)
# Starts training phase
for epoch in range(EPOCHS):
# Starts batch training
for x_batch, y_batch in train_dl:
y_batch = y_batch.long()
# Clean gradients
optimizer.zero_grad()
# Feed the model
y_pred = model(x_batch)
# Loss calculation
loss = CrossEntropyLoss()(y_pred, y_batch)
# Gradients calculation
loss.backward()
# Gradients update
optimizer.step()

# Evaluation
test_accuracy = self.evaluate_model(test_dl, model)
print("Epoch: %d, loss: %.5f, Test accuracy: %.5f" % (epoch+1, loss.item(), test_accuracy))


if __name__ == '__main__':
model = TextClassifier()
# 统计参数量
num_params = sum(param.numel() for param in model.parameters())
print(num_params)
ModelTrainer().train(model)
torch.save(model, 'sougou_mini_cls.pth')

模型预测

对保存好的模型,在验证集上进行指标评估,得到的结果:accuracy为0.7960,precision,recall为0.7960,F1-score为0.7953,混淆矩阵如下:

在验证集上的混淆矩阵

对新样本进行预测,Python代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# -*- coding: utf-8 -*-
# @Time : 2023/3/16 16:42
# @Author : Jclian91
# @File : model_predict.py
# @Place : Minghang, Shanghai
import torch as T
import numpy as np

from text_featuring import load_file_file, text_feature
from model import TextClassifier

model = T.load('sougou_mini_cls.pth')

label_dict, char_dict = load_file_file()
print(label_dict)

text = '盖世汽车讯,特斯拉去年击败了宝马,夺得了美国豪华汽车市场的桂冠,并在今年实现了开门红。1月份,得益于大幅降价和7500美元美国电动汽车税收抵免,特斯拉再度击败宝马,蝉联了美国豪华车销冠,并且注册量超过了排名第三的梅赛德斯-奔驰和排名第四的雷克萨斯的总和。根据Experian的数据,在所有豪华品牌中,1月份,特斯拉在美国的豪华车注册量为49,917辆,同比增长34%;宝马的注册量为31,070辆,同比增长2.5%;奔驰的注册量为23,345辆,同比增长7.3%;雷克萨斯的注册量为23,082辆,同比下降6.6%。奥迪以19,113辆的注册量排名第五,同比增长38%。凯迪拉克注册量为13,220辆,较去年同期增长36%,排名第六。排名第七的讴歌的注册量为10,833辆,同比增长32%。沃尔沃汽车排名第八,注册量为8,864辆,同比增长1.8%。路虎以7,003辆的注册量排名第九,林肯以6,964辆的注册量排名第十。'

label, sample = ['汽车'], [text]
samples, y_true = text_feature(label, sample, label_dict, char_dict)
print(text)
print(samples, y_true)
x = T.from_numpy(np.array(samples)).long()
y_pred = model(x)
print(y_pred)

预测结果如下:

新文本 预测类别
盖世汽车讯,特斯拉去年击败了宝马,夺得了美国豪华汽车市场的桂冠,并在今年实现了开门红。1月份,得益于大幅降价和7500美元美国电动汽车税收抵免,特斯拉再度击败宝马,蝉联了美国豪华车销冠,并且注册量超过了排名第三的梅赛德斯-奔驰和排名第四的雷克萨斯的总和。根据Experian的数据,在所有豪华品牌中,1月份,特斯拉在美国的豪华车注册量为49,917辆,同比增长34%;宝马的注册量为31,070辆,同比增长2.5%;奔驰的注册量为23,345辆,同比增长7.3%;雷克萨斯的注册量为23,082辆,同比下降6.6%。奥迪以19,113辆的注册量排名第五,同比增长38%。凯迪拉克注册量为13,220辆,较去年同期增长36%,排名第六。排名第七的讴歌的注册量为10,833辆,同比增长32%。沃尔沃汽车排名第八,注册量为8,864辆,同比增长1.8%。路虎以7,003辆的注册量排名第九,林肯以6,964辆的注册量排名第十。 汽车
北京时间3月16日,NBA官方公布了对于灰熊球星贾-莫兰特直播中持枪事件的调查结果灰熊,由于无法确定枪支是否为莫兰特所有,也无法证明他曾持枪到过NBA场馆,因为对他处以禁赛八场的处罚,且此前已禁赛场次将算在禁赛八场的场次内,他最早将在下周复出。 体育
3月11日,由新浪教育、微博教育、择校行联合主办的“新浪&微博2023国际教育春季巡展•深圳站”于深圳凯宾斯基酒店成功举办。深圳优质学校亮相展会,上千组家庭前来逛展。近30所全国及深圳民办国际化学校、外籍人员子女学校、公办学校国际部等多元化、多类型优质学校参与了本次活动。此外,近10位国际化学校校长分享了学校的办学特色、教育理念及学生的成长案例,参展家庭纷纷表示受益匪浅。展会搭建家校沟通桥梁,帮助家长们合理规划孩子的国际教育之路。深圳国际预科书院/招生办主任沈兰Nancy Shen参加了本次活动并带来了精彩的演讲,以下为演讲实录:" 教育
指导专家:皮肤科教研室副主任、武汉协和医院皮肤性病科主任医师冯爱平教授在临床上,经常能看到有些人出现反复发作的口腔溃疡,四季不断,深受其扰。其实这已不单单是口腔问题,而是全身疾病的体现,特别是一些免疫系统疾病,不仅表现在皮肤还会损害黏膜,下列几种情况是造成“复发性口腔溃疡”的原因。缺乏维生素及微量元素。缺乏微量元素锌、铁、叶酸、维生素B12等时,会引发口角炎。很多日常生活行为可能造成维生素的缺乏,如过分淘洗米、长期进食精米面、吃素食等,很容易造成B族维生素的缺失。 健康

总结

本项目已上传至Github,访问网址为:https://github.com/percent4/PyTorch_Learning/tree/master/cnn_text_classification

参考文献

  1. Text-Classification-CNN-PyTorch: https://github.com/FernandoLpz/Text-Classification-CNN-PyTorch
欢迎关注我的公众号NLP奇幻之旅,原创技术文章第一时间推送。

欢迎关注我的知识星球“自然语言处理奇幻之旅”,笔者正在努力构建自己的技术社区。


PyTorch入门(五)使用CNN模型进行中文文本分类
https://percent4.github.io/PyTorch入门(五)使用CNN模型进行中文文本分类/
作者
Jclian91
发布于
2023年7月30日
许可协议