Source: Photo by Jan Antonin Kolar on Unsplash
Queries, scans, indexes, pagination, and parallelism are common concepts in databases, but each database differs in specifics. It is vital to understand the specifics in order to get the most out of a database. In Aerospike, queries and indexes play a key role in realizing its speed-at-scale objective. The goal of this post is to help developers better understand the Aerospike capabilities in these areas.
A query is a request for data that meets specific criteria. The criteria or conditions that the result must meet are called the query predicate.
In Aerospike, a query is processed using one of these indexes: the primary index, a set index, or a secondary index.
Primary and Set Indexes
Primary Index
In Aerospike, there is only one system-defined primary index for each namespace, built on the digest
of records. The digest is a RIPEMD160 hash of the tuple (set, user-key)
, where a set, which is equivalent to a table, is an application defined grouping of records in a namespace, which is equivalent to a database or schema, and user-key is an application-provided id that is unique within the set. The primary index is not optional, is created automatically, and cannot be removed, nor can the field on which it is defined be changed. While another index may be created on a bin, which is equivalent to a column, holding a primary key, it is considered a secondary index. For example, for a set having records with employee-number as the user-key and an "ssn" bin for social security number, an index created on the ssn bin is a secondary index.
Scan or Primary-Index Query
In Aerospike, the general way of processing data requests is a scan with a filter expression
that captures the query predicate. For example, for a request "get records from employees set where employee-number is in range 100-200", a scan is performed with a filter expression to capture the query predicate "employee-number is in range 100-200". The primary index is used to scan the namespace, and therefore a scan is also called a primary-index query
.
Set Indexes
A set index can optionally be created for potential performance improvements when querying a set. In the previous example, the request will execute faster by having a set index on the employees set. If a set index has been created, it will be used for a set query instead of the primary index.
Order
A query using the primary index, or a set index, follows the internal, deterministic digest ordering. Given the digest is a hashed value, this order will not be significant to the application. For example, while the employee numbers have a recognizable order, records with employee-number user-key will have a random scan order. In general, any ordering must be implemented in the application as query results are typically gathered from multiple partitions on multiple nodes.
Query Processing
A filter expression is a boolean computation of the record metadata and/or data, using supported operators and methods. The expression is specified as part of the operation policy, and is evaluated for every record. Only records that pass the filter are further processed. If no filter expression is specified, all records in the set or namespace are processed.
A primary index or set index query can be performed in sync or async mode. In the sync mode, the application thread is blocked until the results are returned, whereas in the async mode, the application thread is not blocked, and the results are returned in a callback.
Code examples of queries using a filter expression can be found here and here.
Secondary Indexes
A secondary index can be optionally defined to speed up processing of queries.
Mapping Bin or CDT Values to Records
A secondary index is defined on a bin (column) or an element at any level of a Collection Data Type (CDT, which is a List or a Map), over its integer, string, or geospatial values. A secondary index keys (maps) a value to one or more records, but not within a record (such as a bin or a CDT element). Thus, a seoondary index is a mapping from a value to one or more records. When a secondary index is defined on a CDT, all CDT values of the indexed type map to the record. So a secondary index on List [1,2,3] in record R will have mappings 1->R, 2->R, and 3->R.
A secondary index is created on a set (table) of records. In Aerospike Database 6.1+, a secondary index created with a null set (or no set parameter) encompasses all records in the namespace. In earlier versions, it would span only the records that were created with a null set parameter. In 6.1+, a secondary index cannot be created on records that are not in any set, and the best practice recommendation is to always create a record in a (non-null) set.
Indexed Value Types
It is important to note that an index is strongly typed, meaning it holds only values of a specific type: integer, string, or geospatial. A bin or a CDT element in Aerospike however is not strongly typed, and can hold a value of any type. An index maps only values of the index type in the bin or CDT element; other values are ignored. For example, an integer index on a List [1, 2, 3, "a", "b", "c"] will index only 1, 2, and 3, and ignore the string elements. A bin or CDT element can have multiple indexes defined to allow queries on different types of values. In this example, a string index must be created on the same List if records need to be retrieved that, say, have a value "c" in the list using a secondary index.
Indexing on Custom Values
In some cases, the values may not be available in one place or even stored in a record or CDT element for indexing. For example, a specific object field "a" in an array of objects: [{"a": 1, "b": 11}, {"a": 2, "b":22}, ...]. In such cases, these values can be copied, or computed and stored, to a bin or a CDT that can then be indexed. In this example, create and index List a-values: [1, 2, ...]. The indexed bin or CDT must be kept in sync with the changes in the values. In the example, if the field "a" is updated in any object, that must be reflected in the a-values list.
Uniqueness and Order
A secondary index cannot be defined as unique or sorted. That is, the secondary index does not support the uniqueness constraint on the field, although it can be defined on a bin that holds unique values, such as the ssn bin in the earlier example. As explained above, it is up to the application to order query results. Also, composite indexes over multiple bins are currently not directly supported, but can be implemented as described earlier in the section Indexing on Custom Values.
Secondary-Index Query
A query using a secondary index is called a secondary-index query
, to be distinguished from a primary-index query. A secondary-index query will fail if the supporting secondary index does not exist.
Query Processing
The secondary index lookup identifies the records to process. A secondary-index query may also specify a filter expression, in which case the secondary-index predicate is processed first, the filter expression is evaluated for the resulting records, and the matching records then are processed further. For efficient processing, the most selective available index should be used for the secondary-index predicate and the remaining condition as the filter expression. For example, to find all black Skoda cars in California, a secondary index on manufacturer and not on color should be used, along with a filter expression for black color.
Code examples of a scan using a filter expression can be found here.
Find additional details on CDT indexing in the blog post Query JSON Document Faster (and More) with CDT Indexing.
Pagination
The application can get the results in a stream of smaller chunks by using pagination. Pagination is supported with all types of queries and indexes.
The chunk size limit is specified in the max-records policy parameter. Note, a smaller number of records may be returned because the chunk size limit is divided evenly across all server nodes, but the data may be unevenly distributed with respect to the query predicate.
The same query handle is used to get subsequent chunks until all records are retrieved.
Check out a concrete pagination example and code here.
Parallel Query Processing
Aerospike distributes namespace records in 4096 uniform partitions, and allows separate queries over them for parallelism. Queries can be split into independent parallel sub-queries over one or more partitions, for the needed parallelism to match the required throughput. Further, each partition can be subdivided into N sub-partitions by adding the modulo filter expression digest % N == i
for 0 <= i < N. Note, the filter expression evaluation for the sub-partitions is purely metadata based, digest being record metadata. Since record metadata is held in memory, the evaluation requires no access to data on the SSD. A sub-partition only reads its own records, minimizing the necessary SSD reads across the multiple sub-partitions, resulting in maximum parallel throughput.
Using this scheme, a large number of workers on a platform such as Spark (which supports up to 32K workers) can uniformly spread the data among workers for processing via an equal number of mutually exclusive and collectively exhaustive sub-streams using partition queries in combination with the modulo filter expression as described above. The appropriate data scale, throughput, and latency can be achieved by adjusting the cluster size as well as the number of attached SSD devices per node.
Processing Using Indexes
In addition to retrieving records, one can perform additional operations on the selected records:
- Project (retrieve) specific bins and computed expressions.
- Further processing of selected records with read or write operations.
1 is not conceptually different from retrieving entire records and hence we will not discuss it further. 2 is discussed below.
It is worth mentioning that processing multiple records using indexes is different from batch processing where records are specified by their keys and not by a predicate. Please refer to the blog post to learn more about Batch Operations in Aerospike
Read and Write Operations
In processing operations using indexes, read and write operations cannot be mixed; either only read or only update operations can be specified for processing.
A record may match multiple times for a given condition when using a collection index type. There is no guarantee that a record will be de-duplicated for the same value. In such cases,
- The application must be prepared to handle duplication within results.
- Write operations must deal with any duplication appropriately (for example, make them idempotent or include logic to apply the operations only once).
Read operations
Read operations are specified using:
- bin (transaction) operations for efficient access to complex data types such as HyperLogLog, GeoJSON, Blob, List, and Map, or
- a stream UDF for aggregate processing.
Background Updates
Records can also be updated in a “background mode” in conjunction with a query. Such background updates work differently from read operations: the entire operation is processed in the background. The application can only check the status of a background operation, but cannot obtain granular results from it. Any record specific status must be ascertained, and corrected if necessary, separately. Background updates are an efficient way to update a large number of records.
Note that updates using indexes are not supported in “foreground” sync and async modes like read operations where the application receives record-specific results.
Update operations are specified using bin (transaction) operations or a record UDF.
Find additional code examples here.
Queries and indexes are important to realize speed at scale. This post describes key aspects of indexes and queries in Aerospike to help developers better understand these capabilities and utilize them effectively.
Top comments (0)