Today there was a Discuss post on “Elasticsearch data type” that demonstrates one of the more confusing features in Elasticsearch. But if you are familiar with Elasticsearch it is an excellent puzzle — so follow along and test your knowledge.
First, add a document:
PUT users/_doc/1
{
"user_id": 1
}
This index uses a dynamic mapping, which defaults to what data type for the user_id
field?
Default Numeric Type
GET users/_mapping
shows the answer:
{
"users" : {
"mappings" : {
"properties" : {
"user_id" : {
"type" : "long"
}
}
}
}
}
So your user_id
field is a long
. Next, you try to add four more documents:
PUT users/_doc/2
{
"user_id" : 2
}
PUT users/_doc/3
{
"user_id" : "3"
}
PUT users/_doc/4
{
"user_id" : 4.5
}
PUT users/_doc/5
{
"user_id" : "5.1"
}
The Document with ID two is the same as our first one, so that will work. But what happens if you try to store a string
, a float
, or even a stringified float
value into a long
field?
Handling Dirty Data
It still works. But why?
By default, Elasticsearch will coerce
data to clean it up. Quoting from its documentation:
Coercion attempts to clean up dirty values to fit the datatype of a field. For instance:
- Strings will be coerced to numbers.
- Floating points will be truncated for integer values.
Especially for quoted numbers, this makes sense, since some systems err on the side of quoting too much rather than too little. Perl used to be one of the well-known offenders there, and coercing would helpfully clean this up.
Sounds reasonable, but you want to verify this by retrieving the documents with GET users/_search
and expect the user_id
values 1 2 3 4 5, right? But you actually get the result — focus on the array in hits.hits
:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "users",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"user_id" : 1
}
},
{
"_index" : "users",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"user_id" : 2
}
},
{
"_index" : "users",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"user_id" : "3"
}
},
{
"_index" : "users",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.0,
"_source" : {
"user_id" : 4.5
}
},
{
"_index" : "users",
"_type" : "_doc",
"_id" : "5",
"_score" : 1.0,
"_source" : {
"user_id" : "5.1"
}
}
]
}
}
Is this a bug? How could you store 4.5 in a long
? If you recheck the mapping with GET users/_mapping
, it’s still returning "type": "long"
.
_source
Is Only an Illusion
The final piece in this puzzle is that Elasticsearch never changes the _source
. But the stored field user_id
is a long
as you would expect. You can verify this by running an aggregation on the field:
GET users/_search
{
"size": 0,
"aggs": {
"my_sum": {
"sum": {
"field": "user_id"
}
}
}
}
Which gives the correct result for 1 + 2 + 3 + 4 + 5 = 15:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : []
},
"aggregations" : {
"my_sum" : {
"value" : 15.0
}
}
}
Which, by the way, now defaults to a floating representation for value
which you could change with the extra parameter "format": "0"
. That would add a "value_as_string" : "10"
to the result. But let’s leave it at that.
Conclusion
I hope you are less confused than before or at least enjoyed the puzzle. As a parting note, be aware that coerce
might be removed in the future since it is a trappy feature — especially around truncating floating-point numbers 😄.
Top comments (0)