第一种模板是Unigram template:第一个字符是U,用于描述unigram
feature的模板。每一行%x[#,#]生成一个CRF中的点(state)函数: f(s, o),
其中s为t时刻的的标签(output),o为t时刻的上下文。假设home NN O所在行为CURRENT TOKEN,
1 2 3 4 5 6 7 8 9 10
played VBD O on IN O Monday NNP O ( ( O home NN O << CURRENT TOKEN team NN O in IN O CAPS NNP O ) ) O : : O
那么%x[#,#]的对应规则如下:
template
expanded feature
%x[0,0]
home
%x[0,1]
NN
%x[-1,0]
(
%x[-2,1]
NNP
%x[0,0]/%x[0,1]
home/NN
ABC%x[0,1]123
ABCNN123
以“U01:%x[0,1]”为例,它在该语料中生成的示例函数如下:
1 2 3 4
func1 = if (output = O and feature="U01:NN") return 1 elsereturn 0 func2 = if (output = O and feature="U01:N") return 1 elsereturn 0 func3 = if (output = O and feature="U01:NNP") return 1 elsereturn 0 ....
played on Monday ( home team in CAPS ) : VBD IN NNP ( NN NN IN NNP ) : O O O O O O O O O O American League NNP NNP B-MISC I-MISC Cleveland 2 DETROIT 1 NNP CD NNP CD B-ORG O B-ORG O BALTIMORE 12 Oakland 11 ( 10 innings ) VB CD NNP CD ( CD NN ) B-ORG O B-ORG O O O O O TORONTO 5 Minnesota 3 TO CD NNP CD B-ORG O B-ORG O ......
# NER预料train.txt所在的路径 dir = "/Users/Shared/CRF_4_NER/CRF_TEST"
withopen("%s/train.txt" % dir, "r") as f: sents = [line.strip() for line in f.readlines()]
# 训练集与测试集的比例为9:1 RATIO = 0.9 train_num = int((len(sents)//3)*RATIO)
# 将文件分为训练集与测试集 withopen("%s/NER_train.data" % dir, "w") as g: for i inrange(train_num): words = sents[3*i].split('\t') postags = sents[3*i+1].split('\t') tags = sents[3*i+2].split('\t') for word, postag, tag inzip(words, postags, tags): g.write(word+' '+postag+' '+tag+'\n') g.write('\n')
withopen("%s/NER_test.data" % dir, "w") as h: for i inrange(train_num+1, len(sents)//3): words = sents[3*i].split('\t') postags = sents[3*i+1].split('\t') tags = sents[3*i+2].split('\t') for word, postag, tag inzip(words, postags, tags): h.write(word+' '+postag+' '+tag+'\n') h.write('\n')
sentence = "Venezuelan opposition leader and self-proclaimed interim president Juan Guaidó said Thursday he will return to his country by Monday, and that a dialogue with President Nicolas Maduro won't be possible without discussing elections." #sentence = "Real Madrid's season on the brink after 3-0 Barcelona defeat" # sentence = "British artist David Hockney is known as a voracious smoker, but the habit got him into a scrape in Amsterdam on Wednesday." # sentence = "India is waiting for the release of an pilot who has been in Pakistani custody since he was shot down over Kashmir on Wednesday, a goodwill gesture which could defuse the gravest crisis in the disputed border region in years." # sentence = "Instead, President Donald Trump's second meeting with North Korean despot Kim Jong Un ended in a most uncharacteristic fashion for a showman commander in chief: fizzle." # sentence = "And in a press conference at the Civic Leadership Academy in Queens, de Blasio said the program is already working." #sentence = "The United States is a founding member of the United Nations, World Bank, International Monetary Fund."
withopen("%s/NER_predict.data" % dir, 'w', encoding='utf-8') as f: for item in postags: f.write(item[0]+' '+item[1]+' O\n')
print("write successfully!")
os.chdir(dir) os.system("crf_test -m model NER_predict.data > predict.txt") print("get predict file!")
# 读取预测文件redict.txt withopen("%s/predict.txt" % dir, 'r', encoding='utf-8') as f: sents = [line.strip() for line in f.readlines() if line.strip()]
word = [] predict = []
for sent in sents: words = sent.split() word.append(words[0]) predict.append(words[-1])
# print(word) # print(predict)
# 去掉NER标注为O的元素 ner_reg_list = [] for word, tag inzip(word, predict): if tag != 'O': ner_reg_list.append((word, tag))
# 输出模型的NER识别结果 print("NER识别结果:") if ner_reg_list: for i, item inenumerate(ner_reg_list): if item[1].startswith('B'): end = i+1 while end <= len(ner_reg_list)-1and ner_reg_list[end][1].startswith('I'): end += 1
Real Madrid's season on the brink after 3-0 Barcelona defeat
识别效果1:
ORGANIZATION: Real Madrid LOCATION: Barcelona
输入语句2:
British artist David Hockney is known as a voracious smoker, but the
habit got him into a scrape in Amsterdam on Wednesday.
识别效果2:
MISC: British PERSON: David Hockney LOCATION: Amsterdam
输入语句3:
India is waiting for the release of an pilot who has been in
Pakistani custody since he was shot down over Kashmir on Wednesday, a
goodwill gesture which could defuse the gravest crisis in the disputed
border region in years.
识别效果3:
LOCATION: India LOCATION: Pakistani LOCATION: Kashmir
输入语句4:
Instead, President Donald Trump's second meeting with North Korean
despot Kim Jong Un ended in a most uncharacteristic fashion for a
showman commander in chief: fizzle.
识别效果4:
PERSON: Donald Trump PERSON: Kim Jong Un
输入语句5:
And in a press conference at the Civic Leadership Academy in Queens,
de Blasio said the program is already working.
识别效果5:
ORGANIZATION: Civic Leadership Academy LOCATION: Queens PERSON: de
Blasio
输入语句6:
The United States is a founding member of the United Nations, World
Bank, International Monetary Fund.
识别效果6:
LOCATION: United States ORGANIZATION: United Nations PERSON: World
Bank ORGANIZATION: International Monetary Fund
在这些例子中,有让我们惊喜之处:识别出了人物Donald Trump, Kim Jong
Un. 但也有些不足指出,如将World
Bank识别为人物,而不是组织机构。总的来说,识别效果还是让人满意的。