NLP(三十六)使用keras-bert实现文本多标签分类任务

本文将会介绍如何使用keras-bert实现文本多标签分类任务,其中对BERT进行微调

项目结构

本项目的项目结构如下:

项目结构

其中依赖的Python第三方模块如下:

1
2
3
4
pandas==0.23.4
Keras==2.3.1
keras_bert==0.83.0
numpy==1.16.4

数据集介绍

本文采用的数据集与文章NLP(二十八)多标签文本分类中的一致,以事件抽取比赛的数据集为参考,形成文本与事件类型的多标签数据集,一共为65种事件类型。样例数据(csv格式)如下:

1
2
3
4
5
label,content
司法行为-起诉|组织关系-裁员,最近,一位前便利蜂员工就因公司违规裁员,将便利蜂所在的公司虫极科技(北京)有限公司告上法庭。
组织关系-裁员,思科上海大规模裁员人均可获赔100万官方澄清事实
组织关系-裁员,日本巨头面临危机,已裁员1000多人,苹果也救不了它!
组织关系-裁员|组织关系-解散,在硅谷镀金失败的造车新势力们:蔚来裁员、奇点被偷窃、拜腾解散

在label中,每个事件类型用|隔开。

在该数据集中,训练集一共11958个样本,测试集一共1498个样本。

模型训练

模型训练的脚本model_train.py的完整代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
# -*- coding: utf-8 -*-
import json
import codecs
import pandas as pd
import numpy as np
from keras_bert import load_trained_model_from_checkpoint, Tokenizer
from keras.layers import *
from keras.models import Model
from keras.optimizers import Adam

# 建议长度<=510
maxlen = 256
BATCH_SIZE = 8
config_path = './chinese_L-12_H-768_A-12/bert_config.json'
checkpoint_path = './chinese_L-12_H-768_A-12/bert_model.ckpt'
dict_path = './chinese_L-12_H-768_A-12/vocab.txt'


token_dict = {}
with codecs.open(dict_path, 'r', 'utf-8') as reader:
for line in reader:
token = line.strip()
token_dict[token] = len(token_dict)


class OurTokenizer(Tokenizer):
def _tokenize(self, text):
R = []
for c in text:
if c in self._token_dict:
R.append(c)
else:
R.append('[UNK]') # 剩余的字符是[UNK]
return R


tokenizer = OurTokenizer(token_dict)


def seq_padding(X, padding=0):
L = [len(x) for x in X]
ML = max(L)
return np.array([
np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X
])


class DataGenerator:

def __init__(self, data, batch_size=BATCH_SIZE):
self.data = data
self.batch_size = batch_size
self.steps = len(self.data) // self.batch_size
if len(self.data) % self.batch_size != 0:
self.steps += 1

def __len__(self):
return self.steps

def __iter__(self):
while True:
idxs = list(range(len(self.data)))
np.random.shuffle(idxs)
X1, X2, Y = [], [], []
for i in idxs:
d = self.data[i]
text = d[0][:maxlen]
x1, x2 = tokenizer.encode(first=text)
y = d[1]
X1.append(x1)
X2.append(x2)
Y.append(y)
if len(X1) == self.batch_size or i == idxs[-1]:
X1 = seq_padding(X1)
X2 = seq_padding(X2)
Y = seq_padding(Y)
yield [X1, X2], Y
[X1, X2, Y] = [], [], []


# 构建模型
def create_cls_model(num_labels):
bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len=None)

for layer in bert_model.layers:
layer.trainable = True

x1_in = Input(shape=(None,))
x2_in = Input(shape=(None,))

x = bert_model([x1_in, x2_in])
cls_layer = Lambda(lambda x: x[:, 0])(x) # 取出[CLS]对应的向量用来做分类
p = Dense(num_labels, activation='sigmoid')(cls_layer) # 多分类

model = Model([x1_in, x2_in], p)
model.compile(
loss='binary_crossentropy',
optimizer=Adam(1e-5), # 用足够小的学习率
metrics=['accuracy']
)
model.summary()

return model


if __name__ == '__main__':

# 数据处理, 读取训练集和测试集
print("begin data processing...")
train_df = pd.read_csv("data/train.csv").fillna(value="")
test_df = pd.read_csv("data/test.csv").fillna(value="")

select_labels = train_df["label"].unique()
labels = []
for label in select_labels:
if "|" not in label:
if label not in labels:
labels.append(label)
else:
for _ in label.split("|"):
if _ not in labels:
labels.append(_)
with open("label.json", "w", encoding="utf-8") as f:
f.write(json.dumps(dict(zip(range(len(labels)), labels)), ensure_ascii=False, indent=2))

train_data = []
test_data = []
for i in range(train_df.shape[0]):
label, content = train_df.iloc[i, :]
label_id = [0] * len(labels)
for j, _ in enumerate(labels):
for separate_label in label.split("|"):
if _ == separate_label:
label_id[j] = 1
train_data.append((content, label_id))

for i in range(test_df.shape[0]):
label, content = test_df.iloc[i, :]
label_id = [0] * len(labels)
for j, _ in enumerate(labels):
for separate_label in label.split("|"):
if _ == separate_label:
label_id[j] = 1
test_data.append((content, label_id))

# print(train_data[:10])
print("finish data processing!")

# 模型训练
model = create_cls_model(len(labels))
train_D = DataGenerator(train_data)
test_D = DataGenerator(test_data)

print("begin model training...")
model.fit_generator(
train_D.__iter__(),
steps_per_epoch=len(train_D),
epochs=10,
validation_data=test_D.__iter__(),
validation_steps=len(test_D)
)

print("finish model training!")

# 模型保存
model.save('multi-label-ee.h5')
print("Model saved!")

result = model.evaluate_generator(test_D.__iter__(), steps=len(test_D))
print("模型评估结果:", result)

模型结构如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) (None, None) 0
__________________________________________________________________________________________________
input_2 (InputLayer) (None, None) 0
__________________________________________________________________________________________________
model_2 (Model) (None, None, 768) 101677056 input_1[0][0]
input_2[0][0]
__________________________________________________________________________________________________
lambda_1 (Lambda) (None, 768) 0 model_2[1][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 65) 49985 lambda_1[0][0]
==================================================================================================
Total params: 101,727,041
Trainable params: 101,727,041
Non-trainable params: 0
__________________________________________________________________________________________________

从中我们可以发现,该模型结构与文章NLP(三十五)使用keras-bert实现文本多分类任务中给出的文本多分类模型结构大体一致,修改之处在于BERT后接的网络结构,所接的依然是dense层,但激活函数采用sigmoid函数,同时损失函数为binary_crossentropy。就其本质而言,该模型结构是对输出的65个结果采用0-1分类,故而激活函数采用sigmoid,这当然是文本多分类模型转化为多标签标签的最便捷方式,但不足之处在于,该模型并未考虑标签之间的依赖关系。

模型评估

模型评估脚本model_evaluate.py的完整代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# -*- coding: utf-8 -*-
# @Time : 2020/12/23 15:28
# @Author : Jclian91
# @File : model_evaluate.py
# @Place : Yangpu, Shanghai
# 模型评估脚本,利用hamming_loss作为多标签分类的评估指标,该值越小模型效果越好
import json
import numpy as np
import pandas as pd
from keras.models import load_model
from keras_bert import get_custom_objects
from sklearn.metrics import hamming_loss, classification_report

from model_train import token_dict, OurTokenizer

maxlen = 256

# 加载训练好的模型
model = load_model("multi-label-ee.h5", custom_objects=get_custom_objects())
tokenizer = OurTokenizer(token_dict)
with open("label.json", "r", encoding="utf-8") as f:
label_dict = json.loads(f.read())


# 对单句话进行预测
def predict_single_text(text):
# 利用BERT进行tokenize
text = text[:maxlen]
x1, x2 = tokenizer.encode(first=text)
X1 = x1 + [0] * (maxlen - len(x1)) if len(x1) < maxlen else x1
X2 = x2 + [0] * (maxlen - len(x2)) if len(x2) < maxlen else x2

# 模型预测并输出预测结果
prediction = model.predict([[X1], [X2]])
one_hot = np.where(prediction > 0.5, 1, 0)[0]
return one_hot, "|".join([label_dict[str(i)] for i in range(len(one_hot)) if one_hot[i]])


# 模型评估
def evaluate():
test_df = pd.read_csv("data/test.csv").fillna(value="")
true_y_list, pred_y_list = [], []
true_label_list, pred_label_list = [], []
common_cnt = 0
for i in range(test_df.shape[0]):
print("predict %d samples" % (i+1))
true_label, content = test_df.iloc[i, :]
true_y = [0] * len(label_dict.keys())
for key, value in label_dict.items():
if value in true_label:
true_y[int(key)] = 1

pred_y, pred_label = predict_single_text(content)
if true_label == pred_label:
common_cnt += 1
true_y_list.append(true_y)
pred_y_list.append(pred_y)
true_label_list.append(true_label)
pred_label_list.append(pred_label)

# F1值
print(classification_report(true_y_list, pred_y_list, digits=4))
return true_label_list, pred_label_list, hamming_loss(true_y_list, pred_y_list), common_cnt/len(true_y_list)


true_labels, pred_labels, h_loss, accuracy = evaluate()
df = pd.DataFrame({"y_true": true_labels, "y_pred": pred_labels})
df.to_csv("pred_result.csv")

print("accuracy: ", accuracy)
print("hamming loss: ", h_loss)

Hamming Loss为多标签分类所特有的评估方式,其值越小代表多标签分类模型的效果越好。运行上述模型评估代码,输出结果如下:

1
2
3
4
5
6
7
   micro avg     0.9341    0.9578    0.9458      1657
macro avg 0.9336 0.9462 0.9370 1657
weighted avg 0.9367 0.9578 0.9456 1657
samples avg 0.9520 0.9672 0.9531 1657

accuracy: 0.8985313751668892
hamming loss: 0.001869158878504673

在这里,笔者希望与之前的文章NLP(二十八)多标签文本分类中的模型对比一下。当时采用的模型为用ALBERT提取特征向量,再用Bi-GRU+Attention+FCN进行分类,模型结构如下:

Bi-GRU+Attention+FCN

对该模型同样采用上述评估办法,输出的结果如下:

1
2
3
4
5
6
   micro avg     0.9424    0.8292    0.8822      1657
macro avg 0.8983 0.7218 0.7791 1657
weighted avg 0.9308 0.8292 0.8669 1657
samples avg 0.8675 0.8496 0.8517 1657
accuracy: 0.7983978638184246
hamming loss: 0.0037691280681934887

可以发现,采用BERT微调的模型,在accuracy方面高出了约10%,各种F1值高出约5%-10%,Hamming Loss也小了很多。因此,BERT微调的模型比之前的模型效果好很多。

总结

本项目已经开源,Github地址为:https://github.com/percent4/keras_bert_multi_label_cls

2020年12月27日于上海浦东

参考文章

  1. NLP(二十八)多标签文本分类:https://blog.csdn.net/jclian91/article/details/105386190
  2. NLP(三十五)使用keras-bert实现文本多分类任务:https://blog.csdn.net/jclian91/article/details/111742576
欢迎关注我的公众号NLP奇幻之旅,原创技术文章第一时间推送。

欢迎关注我的知识星球“自然语言处理奇幻之旅”,笔者正在努力构建自己的技术社区。


NLP(三十六)使用keras-bert实现文本多标签分类任务
https://percent4.github.io/NLP(三十六)使用keras-bert实现文本多标签分类任务/
作者
Jclian91
发布于
2023年7月10日
许可协议