stanford coreNLP是目前Java常用的分词器。使用它需要JDK1.8。

项目配置

maven依赖

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.8.0</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.8.0</version>
<classifier>models</classifier>
</dependency>
<!--中文module包-->
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.8.0</version>
<classifier>models-chinese</classifier>
</dependency>

properties文件

方式1:
将models-chinese包内的StanfordCoreNLP-chinese.properties拷贝到resources目录下
代码中添加:

1
StanfordCoreNLP nlp = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties");

方式2:
用自定义Properties类生成

1
2
3
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP nlp = new StanfordCoreNLP(props);

*自定义词典可在properties文件内配置:

1
2
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz,mydic.txt
//在resources目录下新建mydic.txt文件保存自定义词典,每行一词;自定义的NER文件同样可以这样配置

基本概念

corenlp中对文本的一次处理称为一个pipeline,annotators代表一个处理节点,类似于函数,如segment切词、ssplit句子切割(将一段话分为多个句子)、pos词性、ner实体命名、regexner是用自定义正则表达式来标注实体类型、parse是句子结构解析。
通过properties配置各annotator的属性。
每个处理生成一个Annotation,本质是一个Map,可通过不同的key获取不同annotator的结果。结果是一个CoreMap的List。
具体见Demo。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
Demo
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.process.*;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import edu.stanford.nlp.util.CoreMap;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.StringReader;
import java.util.List;

public class SParserTest {

private static final String text = "碳碳键键能能否否定定律四";


public void testPipeline() {
StanfordCoreNLP pipeline = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties");
Annotation annotation = pipeline.process("腾讯公司的马化腾特别的厉害。");


List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
CoreMap sentence = sentences.get(0);


List<CoreLabel> tokens = sentence.get(CoreAnnotations.TokensAnnotation.class);
System.out.println("分词" + "\t " + "标注" + "\t " + "实体识别");
System.out.println("-----------------------------");
for (CoreLabel token : tokens) {
String word = token.getString(TextAnnotation.class);
String pos = token.getString(PartOfSpeechAnnotation.class);
String ner = token.getString(NamedEntityTagAnnotation.class);
System.out.println(word + "\t " + pos + "\t " + ner);
}
}

public void testTagger() {
MaxentTagger tagger = new MaxentTagger("edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger");
String text = "碳碳键 键能 能否 否定 定律四";
// text is splitted with space internally
List<List<HasWord>> sentences = tagger.tokenizeText(new BufferedReader(new StringReader(text)));
for (List<? extends HasWord> sentence : sentences) {
List<edu.stanford.nlp.ling.TaggedWord> tSentence = tagger.tagSentence(sentence);
System.out.println(tSentence);
}
}


public void testTokenizer() throws IOException {

for(String s : ChineseDocumentToSentenceProcessor.fromPlainText(text)) {
System.out.println(s);
}

//English tokenization
PTBTokenizer<CoreLabel> ptbTokenizer = new PTBTokenizer<CoreLabel>(new StringReader("i love machine learning and i'm stupid"),
new CoreLabelTokenFactory(), "");
while(ptbTokenizer.hasNext()) {
CoreLabel word = ptbTokenizer.next();
System.out.println(word);
}
}

public void testSegmenter() throws IOException {
StanfordCoreNLP nlp = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties");
Annotation document = nlp.process(text);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for(CoreMap sentence : sentences) {
for(CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
String word = token.get(CoreAnnotations.TextAnnotation.class);
System.out.println(word);
}
}
}


public static void main(String[] args) throws IOException {
SParserTest test = new SParserTest();
System.out.println("......test tokenizer......");
test.testTokenizer();
System.out.println("......test segmenter......");
test.testSegmenter();
System.out.println("......test tagger......");
test.testTagger();
System.out.println("......test pipeline......");
test.testPipeline();
}
}

程序输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
......test tokenizer......
碳碳键键能能否否定定律四
i
love
machine
learning
and
i
'm
stupid
......test segmenter......
碳碳键
键能
能否
否定
定律

......test tagger......
[碳碳键/NR, 键能/NN, 能否/VV, 否定/VV, 定律四/NN]
......test pipeline......
分词 标注 实体识别
-----------------------------
腾讯 NR ORGANIZATION
公司 NN ORGANIZATION
的 DEC O
马化腾 NR PERSON
特别 JJ O
的 DEG O
厉害 NN O
。 PU O

常用annotators

Name description Generated Annotation 相关
ssplit 将文本切分成句子,可配置正则表达式 SentencesAnnotation
tokenizier 分词 TokensAnnotation (list of tokens);
CharacterOffsetBeginAnnotation,
CharacterOffsetEndAnnotation,
TextAnnotation (for each token)
pos 词性标注 PartOfSpeechAnnotation 宾州中文数库标记
ner 命名实体标记 NamedEntityTagAnnotation
NormalizedNamedEntityTagAnnotation
常见的命名实体,包括:名称(PERSON,LOCATION,ORGANIZATION,MISC),数字(MONEY,NUMBER,ORDINAL,PERCENT)和时间(DATE,TIME,DURATION,SET)

使用时首先在properties中配置属性,之后在处理生成的Annotation中传入不同的Annotation名称获取对应的处理结果。