文档分析-分词器

分析包含下面的过程

将一块文本分成适合于倒排索引的独立的词条
将这些词条统一化为标准格式以提高它们的 可搜索性，或者recall

# 分析器组成

分析器执行上面的工作

分析器实包含三个功能，依次执行

字符过滤器
- 字符串按顺序通过每个字符过滤器
- 在分词前整理字符串
- 一个字符过滤器可以用来去掉 HTML，或者将 & 转化成 and
分词器
- 字符串被分词器分为单个的词条
- 一个简单的分词器遇到空格和标点的时候，可能会将文本拆分成词条
Token 过滤器
- 词条按顺序通过每个 token 过滤器
- 这个过程可能会
  - 改变词条，如小写化Quick
  - 删除词条，如像 a， and， the 等无用词
  - 增加词条，如像jump和leap这种同义词

# 内置分析器

Elasticsearch附带可直接使用的预包装的分析器，下面列出最重要的分析器

示例语句

Set the shape to semi-transparent by calling set_trans(5)

# 标准分析器

Elasticsearch 默认使用的分析器

分析各种语言文本最常用的选择

根据Unicode联盟定义的单词边界划分文本

删除绝大部分标点，将词条小写

set, the, shape, to, semi, transparent, by, calling, set_trans, 5

# 简单分析器

在任何不是字母的地方分隔文本，将词条小写

set, the, shape, to, semi, transparent, by, calling, set, trans

# 空格分析器

空格分析器在空格的地方划分文本

Set, the, shape, to, semi-transparent, by, calling, set_trans(5)

# 语言分析器

特定语言分析器可用于很多语言

可以考虑指定语言的特点，如英语分析器

附带了一组英语无用词（常用单词，如and或者the ,它们对相关性没有多少影响），它们会被删除
由于理解英语语法的规则，这个分词器可以提取英语单词的词干

set, shape, semi, transpar, call, set_tran, 5

注意：transparent、calling和 set_trans已经变为词根格式

# 分析器使用场景

当索引一个文档，它的全文域被分析成词条以用来创建倒排索引

当在全文域搜索的时候，需要将查询字符串通过相同的分析过程，以保证搜索的词条格式与索引中的词条格式一致

全文查询，理解每个域是如何定义的，因此它们可以做正确的事

当查询一个全文域时，会对查询字符串应用相同的分析器，以产生正确的搜索词条列表
当查询一个精确值域时，不会分析查询字符串，而是搜索你指定的精确值

# 测试分析器

使用analyze API来看文本是如何被分析的。在消息体里，指定分析器和要分析的文本

在 Postman 中，向 ES 服务器发 GET 请求

GET http://localhost:9200/_analyze

body

{
    "analyzer": "standard",
    "text": "Text to analyze"
}

1
2
3
4

response

结果中每个元素代表一个单独的词条
token是实际存储到索引中的词条
start_ offset 和end_ offset指明字符在原始字符串中的位置
position指明词条在原始文本中出现的位置

{
    "tokens": [
        {
            "token": "text",
            "start_offset": 0,
            "end_offset": 4,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "to",
            "start_offset": 5,
            "end_offset": 7,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "analyze",
            "start_offset": 8,
            "end_offset": 15,
            "type": "<ALPHANUM>",
            "position": 2
        }
    ]
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

# 指定分析器

当Elasticsearch在文档中检测到一个新的字符串域，它会自动设置其为一个全文字符串域，使用标准分析器对它进行分析

指定分析器场景，需要手动指定这些域的映射

如使用一个适用于当前业务数据使用的语言的分析器
如用户 ID 或一个内部的状态域或标签，需要一个字符串域就是一个字符串域，不使用分析，直接索引传入的精确值

# 默认分词器

通过 Postman 发送 GET 请求查询分词效果

GET http://localhost:9200/_analyze

body

{
    "text":"测试单词"
}

1
2
3

response

ES 的默认分词器无法识别中文中测试、单词这样的词汇，而是简单的将每个字拆完分为一个词
这样的结果显然不符合中文的使用要求，需要下载 ES 对应版本的中文分词器

{
    "tokens": [
        {
            "token": "测",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "试",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "单",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "词",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        }
    ]
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

# IK分词器

针对中文，采用 IK 中文分词器进行分词操作

下载地址为 https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.8.0
将解压后的后的文件夹放入 ES 根目录下的 plugins 目录下，重启 ES 即可使用

这次加入新的查询参数"analyzer":"ik_max_word"

通过 Postman 发送 GET 请求查询分词效果

GET http://localhost:9200/_analyze

body

ik_max_word：将文本做最细粒度的拆分
ik_smart：将文本做最粗粒度的拆分

{
	"text":"测试单词",
	"analyzer":"ik_max_word"
}

1
2
3
4

response

{
    "tokens": [
        {
            "token": "测试",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "单词",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

# 扩展词汇

在进行中文分词时，有些人名与新生词不在分词的范围内，但是依然会被分词，此时需要进行扩展词汇处理

如人名弗雷尔卓德 进行IK分词的结果如下

通过 Postman 发送 GET 请求查询分词效果

GET http://localhost:9200/_analyze

body

{
	"text":"弗雷尔卓德",
	"analyzer":"ik_max_word"
}

1
2
3
4

response

仅可以得到每个字的分词结果，而需要做的就是使分词器识别到弗雷尔卓德也是一个词语

{
    "tokens": [
        {
            "token": "弗",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
            "token": "雷",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "尔",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 2
        },
        {
            "token": "卓",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "德",
            "start_offset": 4,
            "end_offset": 5,
            "type": "CN_CHAR",
            "position": 4
        }
    ]
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

# 配置扩展词汇

进入 ES 根目录中的 plugins 文件夹下的 ik 文件夹，进入 config 目录
- 如本地目录 D:\tool\elasticsearch-7.8.0\plugins\elasticsearch-analysis-ik-7.8.0\config
创建 custom.dic 文件，写入弗雷尔卓德
- 如果有多个需要扩展的词汇，换行填写
打开config/IKAnalyzer.cfg.xml 文件，将新建的 custom.dic 配置其中

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">custom.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

1
2
3
4
5
6
7
8
9
10
11
12
13

重启 ES 服务器

再次请求人名弗雷尔卓德 进行IK分词的结果如下

通过 Postman 发送 GET 请求查询分词效果

GET http://localhost:9200/_analyze

body

{
	"text":"弗雷尔卓德",
	"analyzer":"ik_max_word"
}

1
2
3
4

response

{
    "tokens": [
        {
            "token": "弗雷尔卓德",
            "start_offset": 0,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 0
        }
    ]
}

1
2
3
4
5
6
7
8
9
10
11

# 自定义分析器

在Elasticsearch中可以通过在一个适合特定数据的设置之中组合字符过滤器、分词器、词汇单元过滤器来创建自定义的分析器

一个分析器就是在一个包里面组合了三种函数的一个包装器，三种函数按照顺序被执行

# 字符过滤器

字符过滤器用来整理一个尚未被分词的字符串

如文本是HTML格式的，包含<p>或者<div>HTML标签，如果这些标签不想被索引，则使用html清除字符过滤器来移除掉所有的HTML标签，像把Á转换为相对应的Unicode字符Á 这样，转换HTML实体

一个分析器可能有0个或者多个字符过滤器

# 分词器

一个分析器必须有一个唯一的分词器

分词器把字符串分解成单个词条或者词汇单元，标准分析器里使用的标准分词器把一个字符串根据单词边界分解成单个词条，并且移除掉大部分的标点符号，然而还有其他不同行为的分词器存在，如

关键词分词器完整地输出接收到的同样的字符串，并不做任何分词
空格分词器只根据空格分割文本
正则分词器根据匹配正则表达式来分割文本

# 词单元过滤器

经过分词，作为结果的词单元流会按照指定的顺序通过指定的词单元过滤器，词单元过滤器可以修改、添加或者移除词单元，Elasticsearch里面有多重词单元过滤器，如

lowercase和stop词过滤器
词干过滤器把单词过滤为词干
ascii_folding过滤器移除变音符，把一个像très这样的词转换为tres
ngram和 edge_ngram词单元过滤器产生适合用于部分匹配或者自动补全的词单元

# 示例-创建自定义的分析器

字符过滤器：将 & 转换为 and

词单元过滤器：添加 the 与 a 停止词

通过 Postman 发送 PUT 请求

PUT http://localhost:9200/my_index

body

{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type": "mapping", 
                    "mappings": [
                        "&=> and "
                    ]
                }
            }, 
            "filter": {
                "my_stopwords": {
                    "type": "stop", 
                    "stopwords": [
                        "the", 
                        "a"
                    ]
                }
            }, 
            "analyzer": {
                "my_analyzer": {
                    "type": "custom", 
                    "char_filter": [
                        "html_strip", 
                        "&_to_and"
                    ], 
                    "tokenizer": "standard", 
                    "filter": [
                        "lowercase", 
                        "my_stopwords"
                    ]
                }
            }
        }
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

response

{
    "acknowledged": true,
    "shards_acknowledged": true,
    "index": "my_index"
}

1
2
3
4
5

# 示例-使用自定义分析器

索引被创建以后，使用 analyze API 来测试这个新的分析器

通过 Postman 发送 GET 请求

GET http://localhost:9200/my_index/_analyze

body

{
    "text":"The quick & brown fox",
    "analyzer": "my_analyzer"
}

1
2
3
4

response

{
    "tokens": [
        {
            "token": "quick",
            "start_offset": 4,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "and",
            "start_offset": 10,
            "end_offset": 11,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "brown",
            "start_offset": 12,
            "end_offset": 17,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "fox",
            "start_offset": 18,
            "end_offset": 21,
            "type": "<ALPHANUM>",
            "position": 4
        }
    ]
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Last Updated: 2022/02/05, 15:58:51

← 文档刷新-刷写-合并文档控制→