Neel Phadnis for Aerospike

Posted on Sep 13, 2022 • Originally published at developer.aerospike.com

Building Large-Scale Real-Time JSON Applications

#aerospike #json #document #realtime

Source: Photo by Wilhelm Gunkel on Unsplash

“Real-time describes various operations or processes that respond to inputs reliably within a specified time interval (Wikipedia).”

Real-time data must be processed soon after it is generated otherwise its value is diminished, and real-time applications must respond within a tight timeframe otherwise the user experience and business results are impaired. It is critical for real-time applications to have reliably fast access to all data, real-time or otherwise.

The number of real-time interactions between people and devices continues to grow. Leveraging real-time data is still a competitive edge in some areas but its use is expected in others. Up-to-the-moment relevant information is expected to be applied in delivering the best possible customer experience or business decisions.

Much of the data today is generated, transferred, stored, and consumed in the JSON format, including real-time data such as feeds from IOT sensors and social networks, and prior data such as user profiles and product catalogs. Therefore, JSON data is ubiquitous and growing in use. The best possible real-time decisions, increasingly based on AI/ML algorithms, will be arrived at using continually updated massive data sets.

Overview

This article discusses the database perspective on building large-scale real-time JSON applications and touches upon the following key topics:

What to look for in a real-time data platform
How to organize JSON documents for speed at scale
The core JSON functionality required for ease of development

Database for Large-Scale JSON Applications

The key requirements in a database to build such applications are described below, along with how the Aerospike Database delivers them.

Reliably fast random access at scale

Reliably fast response time for read and write operations at any scale and any read-write workload mix is required to meet the real-time contract. Aerospike delivers it through:

Fast and uniform hash-based data distribution to all nodes for optimal resource utilization
Hybrid Memory Architecture (HMA) to store indexes and data in DRAM, SSD, and other devices to provide cost-effective fast storage capacity
Optimized processing of writes and garbage collection for predictable response
One-hop access to all data from the application
Smart Client that handles cluster transitions and data movements transparently
Primary and secondary indexes for fast access
Async and background processing modes for greater efficiency
Multi-op requests to perform many single-record operations atomically in one request

Fast ingest rate

The database must support fast ingestion speeds so that surges in real-time data feeds do not overwhelm the system or result in data loss.

In Aerospike Database 6.0+, batch operations for read, write, delete, and UDF operations are supported so that ingest can achieve the necessary high throughput.

Fast queries

The database must handle concurrent queries over large data efficiently. To this end, Aerospike provides various indexes and granular control over parallel processing of queries.

Convenient JSONPath based access

JSONPath based Document API offers a convenient way to access and modify specific elements within a document. Aerospike support for JSON documents in 6.0+ is discussed below.

Rich Document Functionality

JSON documents are stored in the database as a Collection Data Type (CDT). CDTs are essentially Map and List data types that offer rich functionality to JSON applications.

Efficient storage and transfer

CDTs are stored and transferred efficiently in the MessagePack format.

Rich API

The API supports many common List and Map usages that involve complex processing. They are processed entirely on the server side to eliminate retrieval of data to the client side.

Well integrated into other performance features

CDTs are well integrated into various performance features including Expressions, batch requests, multi-op requests, and secondary indexes.

CDT operations can be used in Expressions that offer efficient server side execution.
Batch requests allow operations on multiple documents in one request.
Multi-op request allows many operations on one document to be performed in one request. For instance, in the same request, you can add items to a JSON array, sort it, get its new size, and top N items in it.
CDT elements at any nested level can be indexed for fast and convenient access, described further below.

Synchronizing data with other systems

Aerospike offers control over replicating all or a subset of the data efficiently to other Aerospike clusters through Cross-Data-Center Replication (XDR). Edge-core synchronization is often necessary for collecting real-time data as well as delivering real-time user experience at the edge. Various connectors facilitate convenient and fast synchronization with other systems as described below.

Easy integration with real-time data streams

Aerospike provides streaming connectors to integrate with the standard streaming platforms like Kafka, Pulsar and JMS, and also allow CDC streams to be delivered to any HTTP end-point.

Fast access from data processing and analytics platforms

The Aerospike Spark and Presto(Trino) connectors enable analytics, AI/ML, and other processing on the respective platforms.

Organizing for Scale and Speed

A critical part of building large-scale JSON applications is to ensure the JSON objects are organized efficiently in the database for optimal storage and access.

Documents may be organized in Aerospike in one or more dedicated sets, over one or more namespaces to reflect ingest, access, and removal patterns. Multiple documents may be grouped and stored in one record either in separate bins (columns) or as sub-documents in a container group document. Record keys are constructed as a combination of the collection-id and the group-id to provide fast logical access as well as group-oriented enumeration of documents. For example, the ticker data for a stock can be organized in multiple records that have keys consisting of the stock symbol (collection-id) + date (group-id). Multiple documents can be accessed using either a scan with a filter expression, a query on a secondary index, or both. A filter expression consists of values and properties of the elements in JSON, for example, an array larger than a certain size or a certain value being present in a sub-tree. A secondary index defined on a basic or collection type provides fast value-based queries as described below.

Example: Real-Time Events Data

Real-Time event streams can be ingested and stored in Aerospike as JSON documents. To allow access by event-id as well as timestamp, they can be organized as follows.

Record key:(namespace, set, <event_id>)
JSON bin:
{ 
    id: <event-id>,
    timestamp: <ts>,
    … 
}

Event-id based document access is a simple record access by incorporating the event-id in the record key. The exact match or range query on timestamp is possible by defining an integer index on it.

For greater scalability, multiple event objects can be grouped in a single document:

Record key:(namespace, set, <group-id>)
JSON bin:
{
    events: [ 
        {
            id: <group-id, event-num>,
            timestamp: <ts>,
            … 
        }, {
        …
        }
    ]
}

The event-id id contains the group-id and event-num which is unique within the group. The group-id, which identifies the record, can be a time period identifier such as the day, week, or month in the year covering all events in the record, or another logical identifier for all record events such as the sensor-id. To access an event directly by its event-id, the group-id is extracted from the event-id, the record is accessed by group-id, and then a JSONPath query is issued on the matching id field. The exact match or range query on timestamp can be performed by creating an integer index on the respective fields in the record.

Review the blog posts Aerospike Time Series API and Data Modeling for Speed-At-Scale (Part 2) for further discussion on organizing JSON documents.

JSON Support in Aerospike

Aerospike announced support for JSON documents in Database 6.0. The Aerospike Document API provides CRUD operations on a JSON document at points indicated by JSONPath. Below are some snippets of document APIs.

More details on the document API can be found in the github repo, tutorial and blog post.

Store a JSON file to database

// Initialize the DocumentClient from AerospikeClient
AerospikeClient aerospikeClient = new AerospikeClient(cPolicy, seedHost, port);
AerospikeDocumentClient documentClient = new AerospikeDocumentClient(aerospikeClient);

// Read the json document into a string.
String jsonString = FileUtils.readFileToString(new File(JsonFilePath));

// Convert JSON string to a JsonNode
JsonNode jsonNode = JsonConverters.convertStringToJsonNode(jsonString);

// Add the document to database
documentClient.put(recordKey, documentBinName, jsonNode);

Get document elements by JsonPATH

// Read an element by path
Object docObject = documentClient.get(recordKey, documentBinName, "$.path.to.the.element");
Object anotherDocObject = documentClient.get(recordKey, documentBinName, "$.path.to.array[index]");

// Get instances of a field from array elements 
Object docObject = documentClient.get(recordKey, documentBinName, "$.path.to.array[*].field");
// Get instances of a field in the document
Object anotherDocObject = documentClient.get(recordKey, documentBinName, "$...field");

Query JSON documents

JSON documents can be indexed for fast queries. In Aerospike Database 6.1+, any JSON element may be indexed to support exact match or range queries.

client.createIndex(policy,namespace,set,indexName,documentBinName,
                indexType, collectionType, contextPath);

A query can be issued using different filters depending on the index type - either a basic type (string or integer) or a collection type (List, MapKeys, MapValues):

Filter filter = Filter.range(documentBinName, fromValue, toValue, contextPath));
Filter filter2 = Filter.contains(documentBinName, collectionType,
                value, contextPath));

In Aerospike Database 6.0+, parallel partition-grained secondary index queries are available to boost throughput in large-scale applications.

Find more details on indexing JSON documents in the blog post Query JSON Documents Faster and code examples in the tutorial on CDT Indexing.

Find, run, and modify working examples, and also run your own code, in the code sandbox from your browser.

. . . .

Real-time large-scale JSON applications need reliably fast access to data, high ingest rates, powerful queries, rich document functionality, scalability with no practical limit, always-on operation, and integration with streaming and analytical platforms. They need all this at low cost. The Aerospike Real-time Data Platform provides all this functionality, making it a good choice for building such applications. The Collection Data Types (CDTs) in Aerospike provide powerful support for modeling, organizing, and querying a large JSON document store. Visit the tutorials and code sandbox on the Developer Hub to explore the capabilities of the platform, and play with the Document API and query capabilities for JSON.

DEV Community