Jacob Cohen for Harper

Posted on Jun 2, 2020 • Edited on Jun 5, 2020 • Originally published at harperdb.io

SQL Queries to Complex JSON Objects

#sql #database #json #harperdb

How many times have you run into a situation where you wish you could do a SQL join without getting duplicate rows back? What if we could get a list "column" returned instead? HarperDB’s ARRAY() function enables just that. In this post we’re going to take a look at a basic example of people with addresses and phone numbers.

Use Cases for the ARRAY() Function

Most existing systems have trouble transforming relational data into hierarchical data. Typically large batch processes or ETL jobs exist to perform these data transformations. HarperDB can perform these transformations out-of-the-box with a single SQL query. This query effectively performs the job of an ORM without the need for bloated software. Don’t think this is possible? Keep reading.

How the ARRAY() Function Works

The HarperDB ARRAY() function, forthcoming in a future release, is an aggregate function, similar to COUNT, SUM, AVG. The difference is that while standard aggregate functions will return computation results, ARRAY() returns a list of data as a field. While this may not be intuitive to those, like myself, who have been using SQL for years, it does enable the developer to create complex JSON objects with a single query. Let’s take a look at an example use case…

Example Data

We’ll be working with People, Phone Numbers, and Addresses. Each Address and/or Phone Number links back to a single Person. We have 10 person records, each with one or more phone numbers and addresses for a total of 20 addresses and 24 phone numbers.

Array Example ERD

Connecting Person and Phone Number

Let’s say I want to get all of the phone numbers for a person with ID 1. That’s fairly simple, I just query the phone number table for that person. But what happens if I also want to get the person data? I have to execute two queries and connect the data in my application.

SELECT * FROM arr.person WHERE person_id = 1
SELECT * FROM arr.phone WHERE person_id = 1

Now what happens if I want to get all people and all of their phone numbers. While I’d like to do a simple join, I can’t, because I’d end up with duplicate person data.

SELECT * FROM arr.person LEFT JOIN arr.phone ON person.person_id = phone.person_id

So, again, I have to run two queries and aggregate the data together in my application.

In HarperDB, we have the ARRAY() aggregate function which allows us to return this data, with no duplicates, in a single query. Remember, because ARRAY() is an aggregate function that we need to have a GROUP BY clause specified. In this case, since we are selecting multiple person fields, we need to specify all of them in our GROUP BY clause. Since we included our hash, person_id, we will safely retrieve each person record.

SELECT
  person.person_id, 
  person.first_name,
  person.middle_name,
  person.last_name,
  person.saluation,
  person.dob, 
  ARRAY({
    type: addr.address_type,
    addressLine1: addr.address_line_1,
    addressLine2: addr.address_line_2,
    city: addr.city,
    state: addr.state,
    zip: addr.zip_code,
  }) as address 
FROM 
  arr.person AS person 
    LEFT JOIN arr.address AS addr 
      ON person.person_id = addr.person_id
GROUP BY 
  person.person_id,
  person.first_name,
  person.middle_name,
  person.last_name,
  person.saluation,
  person.dob

This returns a list of complex JSON objects where each Person object contains a list of Phone objects. For example, the complex object for person ID 1 would look like this:

{
  "person_id": 1,
  "first_name": "Doug",
  "middle_name": "James",
  "last_name": "Henley",
  "saluation": "Mr.",
  "dob": "8/15/57",
  "address": [
    {
      "type": "MAILING",
      "addressLine1": "94317 Roxbury Court",
      "addressLine2": "Apt 102",
      "city": "Tampa",
      "state": "FL",
      "zip": 33625
    },
    {
      "type": "MAILING",
      "addressLine1": "35 Elgar Court",
      "city": "Arvada",
      "state": "CO",
      "zip": 80005
    }
  ]
}

Connecting Person, Phone Number, and Address

Now that we’ve shown how to aggregate list data from a single table let’s take a look at how we can retrieve multiple lists within our complex JSON objects. Ordinarily, if I wanted to pull data for person, phone, and address, then I would need three SQL queries.

SELECT * FROM person WHERE person_id = 1
SELECT * FROM phone WHERE person_id = 1
SELECT * FROM address WHERE person_id = 1

Now, if I were to put all three of those tables into a JOIN statement, I would receive a lot of duplicate data across all three tables. Take a look, here, at what is returned by the below SQL statement.

SELECT * 
FROM person 
  LEFT JOIN phone 
    ON person.person_id = phone.person_id 
  LEFT JOIN address
    ON person.person_id = address.person_id

Moving back to HarperDB we can query with the ARRAY() function to help us out with this. However, because we are joining across multiple tables we may still see some duplicate data in the phone and address lists. This is the inherent nature of SQL JOINS. In order to solve this problem, HarperDB created the DISTINCT_ARRAY() wrapper function. This function can be placed around a standard ARRAY() function call to ensure a distinct (deduplicated) results set is returned. Now to create our complex Person object with lists of both Phone and Address we can write a SQL statement like this:

SELECT 
  person.person_id,
  person.first_name,
  person.middle_name,
  person.last_name,
  person.saluation,
  person.dob,
  DISTINCT_ARRAY(ARRAY({
    type: addr.address_type,
    addressLine1: addr.address_line_1,
    addressLine2: addr.address_line_2,
    city: addr.city,
    state: addr.state,
    zip: addr.zip_code,
  })) as address,
  DISTINCT_ARRAY(ARRAY({
    type: phone.phone_type,
    num: phone.number,
    primaryFlag: phone.primary_flag,
  })) as phone
FROM arr.person AS person 
  LEFT JOIN arr.address AS addr
    ON person.person_id = addr.person_id
  LEFT JOIN arr.phone AS phone
    ON person.person_id = phone.person_id
GROUP BY
  person.person_id,
  person.first_name,
  person.middle_name,
  person.last_name,
  person.saluation,
  person.dob

The complex object for Person ID 1 returned from the above query looks like this

{
  "person_id": 1,
  "first_name": "Doug",
  "middle_name": "James",
  "last_name": "Henley",
  "saluation": "Mr.",
  "dob": "8/15/57",
  "address": [
    {
      "type": "MAILING",
      "line1": "94317 Roxbury Court",
      "line2": "Apt 102",
      "city": "Tampa",
      "state": "FL",
      "zip": 33625
    },
    {
      "type": "MAILING",
      "line1": "35 Elgar Court",
      "city": "Arvada",
      "state": "CO",
      "zip": 80005
    }
  ],
  "phone": [
    {
      "type": "REFERENCE",
      "num": "926-647-6907",
      "primaryFlag": 1
    },
    {
      "type": "HOME",
      "num": "737-377-6038",
      "primaryFlag": 0
    }
  ]
}

With a single query in HarperDB we were able to transform SQL data into a complex JSON object that can be used in your modern application!

Sample Data

Here are links to CSVs of each table used in the above example. You can also view the data below.
person.csv
address.csv
phone.csv

Person Table

person_id	first_name	middle_name	last_name	saluation	dob
1	Doug	James	Henley	Mr.	8/15/57
2	Megan		Creech		6/29/66
3	Michael	Samuel	Lang	Mr.	9/18/68
4	Charles	Jay	Cohen	Mr.	1/12/76
5	Gabby	Sarah	Hughes		9/30/82
6	Emily	Alexandra	Wood	Mrs.	1/18/64
7	Samantha	Grace	Choi	Mrs.	5/25/64
8	Hana		Smith	Ms.	3/12/72
9	Kent	Richard	Garrett		9/24/79
10	Kara	Caitlin	May	Ms.	9/17/90

Address Table

address_id	person_id	address_type	address_line_1	address_line_2	city	state	zip_code
1	1	MAILING	94317 Roxbury Court	Apt 102	Tampa	FL	33625
2	2	BILLING	9 Mayer Plaza	#277	Washington	DC	20430
3	3	MAILING	99 Cascade Crossing		Hartford	CT	6152
4	4	MAILING	39094 Hoard Center	#418	Flushing	NY	11388
5	5	MAILING	6 Waubesa Point		Aurora	CO	80045
6	6	BILLING	94209 Kinsman Place	#135	Atlanta	GA	30311
7	7	MAILING	526 Barnett Hill		Waco	TX	76711
8	8	BILLING	9 Luster Trail	#348	Nashville	TN	37240
9	9	BILLING	33553 Talmadge Hill		Bakersfield	CA	93386
10	10	MAILING	21900 Rusk Drive	Apt 8	Harrisburg	PA	17121
11	1	MAILING	35 Elgar Court		Arvada	CO	80005
12	3	BILLING	9 Tennessee Street		Trenton	NJ	8619
13	6	MAILING	0 Old Gate Alley	Apt 439	Wilkes Barre	PA	18768
14	7	BILLING	3918 Messerschmidt Way	Apt 234	Oklahoma City	OK	73173
15	9	BILLING	41778 Stephen Circle		Salt Lake City	UT	84145
16	10	BILLING	50 Tony Terrace		Sioux Falls	SD	57198
17	2	MAILING	6 Hanson Trail		Nashville	TN	37240
18	3	BILLING	0 Darwin Terrace	#144	Montpelier	VT	5609
19	5	MAILING	59265 Dakota Center		Pittsburgh	PA	15279
20	9	MAILING	369 Badeau Road		Miami	FL	33283

Phone Table

phone_id	person_id	phone_type	number	primary_flag
1	1	REFERENCE	926-647-6907	1
2	2	REFERENCE	864-324-2292	1
3	3	WORK	540-908-1691	1
4	4	HOME	253-590-9734	1
5	5	CELL	302-785-7313	1
6	6	REFERENCE	670-198-4073	1
7	7	CELL	923-662-5491	1
8	8	REFERENCE	176-225-5902	1
9	9	WORK	228-536-6858	1
10	10	REFERENCE	175-549-9915	1
11	1	HOME	737-377-6038	0
12	2	WORK	603-492-5375	0
13	4	WORK	192-656-9676	0
14	5	REFERENCE	537-446-7971	0
15	7	WORK	627-936-7236	0
16	8	WORK	762-324-7571	0
17	10	WORK	521-906-6326	0
18	2	REFERENCE	390-785-1962	0
19	4	CELL	787-954-6675	0
20	7	WORK	168-382-4627	0
21	8	WORK	199-264-7443	0
22	2	REFERENCE	212-508-4836	0
23	4	WORK	493-724-1771	0
24	8	CELL	156-617-7276	0

Top comments (1)

Pacharapol Withayasakpunt • Aug 1 '20 • Edited

Thanks for the example, but I am concerned about one thing in SQL -- SQL keywords clashes with table / column name. How do I ensure that the column names are safe to use?

NVM -- Found the guide.

Another consideration is still, no prepared statement? How to prevent SQL injections? I am especially concerned about escaping backticks and [].