ElasticSearch 映射类型

全文检索

发布日期: 2025-03-16

映射类型

Elasticsearch 字段类型的核心类型有字符串类型、数字类型、日期类型、布尔类型、二进制类型、范围类型等。

一级分类	二级分类	具体类型
核心类型	字符串类型	text、keyword
	数字类型	long、integer、short、byte、double、float、half_float、scaled_float
	日期类型	date
	布尔类型	boolean
	二进制类型	binary
	范围类型	range

字符串类型

text

需要被全文搜索的
如:邮件内容、产品描述、新闻内容，
设置 text 类型以后，字段内容会被分析，在生成倒排索引以前，字符串会被分词器分成一个一个词项;

text 类型的字段不用于排序，且很少用于聚合
text 类型字段在映射中需要指定相关的分析器，通过 index 选项指定
index 选项可以设置为 analyzed（默认）、not_analyzed 或 no

analyzed

默认情况下，index 被设置为 analyzed，并产生了如下行为：分析器将所有字符转为小写，并将字符串分解为单词
当期望每个单词完整匹配时，请使用这种选项。举个例子，如果用户搜索 “elasticsearch”，他们希望在结果列表里看到 “Principles and Practice of Elasticsearch”。

not_analyzed

将 index 设置为 not_analyzed，将会产生相反的行为：分析过程被略过，整个字符串被当作单独的词条进行索引
当进行精准的匹配时，请使用这个选项，如搜索标签。你可能希望 “big data” 出现在搜索 “big data” 的结果中，而不是出现在搜索 “data” 的结果中。同样，对于多数的词条计数聚集，也需要这个。如果想知道最常出现的标签，可能需要 “big data” 作为一整个词条统计，而不是 “big” 和 “data” 分开统计。

no

index 设置为 no，索引会忽略，也没有词条产生，因此无法在那个字段上进行搜索。
当无须在这个字段上搜索时，这个选项节省了存储空间，也缩短了索引和搜索的时间。
例如，可以存储活动的评论。尽管存储和展示这些评论是很有价值的，但是可能并不需要搜索它们。在这种情况下，关闭那个字段的索引，使得索引的过程更快，并节省了存储空间。

keyword

keyword 类型适用于索引结构化的字段，比如 email 地址、主机名、状态码和标签，通常用于过滤、排序、聚合
类型为
keyword 的字段只能通过精确值搜索到

数字类型

映射示例
PUT my_index

{
  "mappings": {
    "my_type": {
      "properties": {
        "number_of bytes": {
          "type": "integer"
        },
        "time_in_seconds": {
          "type": "float"
        },
        "price": {
          "type": "scaled_float",
          "scaling_factor": 100
        }
      }
    }
  }
}

scaled_float 是通过缩放因子把浮点数变成 long 类型，比如价格只需要精确到分，price 字段的取值为 57.34，设置放大因子为 100，存储起来就是 5734。所有的 API 都会把 price 的取值当作浮点数，
Elasticsearch 底层存储的是整数类型，因为压缩整数比压缩浮点数更加节省存储空间

日期类型

JSON 中没有日期类型，所以在 Elasticsearch 中的日期可以是以下几种形式：

格式化日期的字符串，如 “2015-01-01” 或 “2015/01/01 12:10:30”。
代表 milliseconds-since-the-epoch 的长整型数（epoch 指的是一个特定的时间：1970-01-01 00:00:00 UTC）。
代表 seconds-since-the-epoch 的整型数。
Elasticsearch 内部会把日期转换为 UTC（世界标准时间），并将其存储为表示 milliseconds-since-the-epoch 的长整型数，这样做的原因是和字符串相比，数值在存储和处理时更快。日期格式可以自定义，如果没有自定义，默认格式："strict_date_optional_time||epoch_millis"

映射示例
PUT my_index

{"mappings": {"my_type": {"properties": {"dt": {"type": "date"}}}}}

写入示例

PUT my_index/my_type/1
{"dt": "2019-11-14"}
PUT my_index/my_type/2
{"dt": "2019-11-14T13:16:302"}
PUT my_index/my_type/3
{"dt": 1573664468000}

默认情况下，以上 3 个文档的日期格式都可以被解析，内部存储的是毫秒计时的长整型数。

布尔类型

可接受的值为true、false、”true”、”false”

映射示例PUT my_index

{ "mappings": {"my_type": {"properties": {"is_published": {"type": "boolean" } }} }}

写入示例

PUT my_index/my_type/1
{ "is_published": true}
PUT my_index/my_type/2
{"is_published": "true"}
PUT my_index/my_type/3
{"is_published": false}

binary 类型

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {"name": {"type": "text"}, "blob": {"type": "binary"} }
    }
  }
}

写入示例

PUT my_index/my_type/1
{ "name": "Some binary blob", "blob": "U29t2SBiaW5hcnkgYmxvYg=="}

range 类型

range 类型的使用场景包括网页中的时间选择表单、年龄范围选择表单等，range 类型支持的类型和取值范围如下表

类型	范围
integer_range	-2^31 至 2^31-1
float_range	32-bit IEEE 754
long_range	-2^63 至 2^63-1
double_range	64-bit IEEE 754
date_range	64 位整数，毫秒计时

映射示例

PUT range_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "expected_attendees": {
          "type": "integer_range"
        },
        "time_frame": {
          "type": "date_range",
          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
        }
      }
    }
  }
}

索引一条文档，expected_attendees 的取值为 10 到 20，time_frame 的取值是 2015-10-31 12:00:00 至 2015-11-01，命令如下：

PUT my_index/my_type/1
{
  "expected_attendees": {
    "gte": 10,
    "lte": 20
  },
  "time_frame": {
    "gte": "2019-10-31 12:00:00",
    "lte": "2019-11-01"
  }
}

复合类型

Elasticsearch 字段 - 复合类型
一级分类二级分类具体类型
核心类型数组类型 array
对象类型 object
嵌套类型 nested

数组类型
Elasticsearch 没有专用的数组类型，默认情况下任何字段都可以包含0个或者多个值，但是一个数组中的值必须是同一种类型。例如：

字符数组：[“one”,”two”]
整型数组：[1,3]
嵌套数组：[1,[2,3]]，等价于 [1,2,3]
对象数组：[{“city”:”amsterdam”,”country”:”nederland”},{“city”:”brussels”,”country”:”belgium”}]
动态添加数据时，数组的第一个值的类型决定整个数组的类型。混合数组类型是不支持的，比如：[1，”abc”]。数组可以包含 null 值，空数组 [ ] 会被当作 missing field 对待。

在文档中使用 array 类型不需要提前做任何配置，默认支持。例如写入一条带有数组类型的文档，命令如下：

PUT my_index/my_type/1
{
“message”: “some arrays in this document…”,
“tags”: [“elasticsearch”, “wow”],
“lists”: [{
“name”: “prog_list”,
“description”: “programming list”
},
{
“name”: “cool_list”,
“description”: “cool stuff list”
}]
}
搜索 lists 字段下的 name，命令如下：

GET my_index/_search
{
“query”: {
“match”: {
“lists.name”: “cool_list”
}
}
}
object 类型
JSON 本质上具有层级关系，文档包含内部对象，内部对象本身还包含内部对象，请看下面的例子：

{
“region”: “US”,
“manager”: {
“age”: 30,
“name”: {
“first”: “John”,
“last”: “Smith”
}
}
}
上面的文档中，整体是一个 JSON 对象，JSON 中包含一个 manager 对象，manager 对象又包含名为 name 的内部对象。写入到 Elasticsearch 之后，文档会被索引成简单的扁平 key-value 对，格式如下：

{
“region”: “US”,
“manager.age”: 30,
“manager.name.first”: “John”,
“manager.name.last”: “Smith”
}
上面文档结构的显式映射如下：

{
“mappings”: {
“my_type”: {
“properties”: {
“region”: {
“type”: “keyword”
},
“manager”: {
“properties”: {
“age”: {
“type”: “integer”
},
“name”: {
“properties”: {
“first”: {
“type”: “text”
},
“last”: {
“type”: “text”
}
}
}
}
}
}
}
}
}
nested 类型
nested 类型是 object 类型中的一个特例，可以让对象数组独立索引和查询。Lucene 没有内部对象的概念，所以 Elasticsearch 将对象层次扁平化，转化成字段名字和值构成的简单列表。

使用 Object 类型有时会出现问题，比如文档 my_index/my_type/1 的结构如下：

PUT my_index/my_type/1
{
“group”: “fans”,
“user”: [{
“first”: “John”,
“last”: “Smith”
}, {
“first”: “Alice”,
“last”: “White”
}]
}
user 字段会被动态添加为 Object 类型，最后会被转换为以下平整的形式：

{
“group”: “fans”,
“user.first”: [“alice”, “john”],
“user.last”: [“smith”, “white”]
}
user.first 和 user.last 扁平化以后变为多值字段，alice 和 white 的关联关系丢失了。执行以下搜索会搜索到上述文档：

GET my_index/_search
{
“query”: {
“bool”: {
“must”: [{
“match”: {
“user.first”: “Alice”
}
},
{
“match”: {
“user.last”: “Smith”
}
}]
}
}
}
事实上是不应该匹配的，如果需要索引对象数组并避免上述问题的产生，应该使用 nested 对象类型而不是 object 类型，nested 对象类型可以保持数组中每个对象的独立性。Nested 类型将数组中每个对象作为独立隐藏文档来索引，这意味着每个嵌套对象都可以独立被搜索，映射中指定 user 字段为 nested 类型：

PUT /my_index
{
“mappings”: {
“my_type”: {
“properties”: {
“user”: {
“type”: “nested”
}
}
}
}
}
再次执行上述查询语句，文档不会被匹配。

索引一个包含 100 个 nested 字段的文档实际上就是索引 101 个文档，每个嵌套文档都作为一个独立文档来索引。为了防止过度定义嵌套字段的数量，每个索引可以定义的嵌套字段被限制在 50 个。

特殊类型:nested
建立数据默认的数据类型是Object
nested类型是一种特殊的对象;允许对象数组彼此独立地进行索引和查询;嵌套类型(nested数据类型)，可以防止数据出现扁平化的错误

扁平化示例

错误示例

新增数据PUT my_index/_doc/1

{
  "group" : "fans",
  "user" : [ 
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

执行查询GET my_index/_search

{
  "query": {
    "bool": {
      "must": [
        { "match": { "user.first": "Alice" }},
        { "match": { "user.last":  "Smith" }}
      ]
    }
  }
}

预期结果:查询出来为空
实际结果:user中的两个对象都会显示
说明:不符合预期结果(业务逻辑)
解决方案:使用nested嵌套类型

正确示例

修改类型为nested类型PUT my_index,type使用nested类型

{
  "mappings": {
    "_doc": {
      "properties": {
        "user": {
          "type": "nested" 
        }
      }
    }
  }
}

执行查询GET my_index/_search

{
  "query": {
    "bool": {
      "must": [
        { "match": { "user.first": "Alice" }},
        { "match": { "user.last":  "Smith" }}
      ]
    }
  }
}

预期结果:查询出来为空
实际结果:数据为空
说明:符合预期结果(业务逻辑)

nested查询语法

GET /my_index/_search

{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            {"match": {"user.first": "Alice"}},
            {"match": {"user.last" : "White"}} 
          ]
        }
      }
    }
  }
}

nested聚合查询语法

GET my_index/_search

{
  "query": {
    "match": { "group": "fans"    }
  },
  "aggs": {
    "fan": {
      "nested": {    "path": "user"  }
    }
  }
}

zrh

https://zrh-main.github.io/2025/03/16/Elastic%20Stach/ElasticSearch%20%E6%98%A0%E5%B0%84%E7%B1%BB%E5%9E%8B/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 zrh !

ElasticSearch 全文检索站内搜索

ElasticSearch 映射

2025-03-16 全文检索

ElasticSearch 全文检索站内搜索

ElasticSearch 查询

2025-03-16 全文检索

ElasticSearch 全文检索站内搜索