In our new 2.3.0 release, we included an often requested NoSQL upsert operation to HarperDB*. This new hybrid operation will insert new records, if they do not exist, or update them, if they do.
This new feature can be used in two different ways via HarperDB’s API - via a simple NoSQL operation
or as the action
for a bulk load operation.
*This new operation is only available in HDB instances utilizing a LMDB data store. While the File System (FS) data store is still configurable and are still supported in HDB, some new/more advanced features may not be implemented for FS moving forward.
NoSQL Upsert Operation
As noted above, HarperDB users can now utilize an upsert operation via our API which will insert new records and/or update existing records.
A new record (to be inserted) is identified as a record that does not include a hash value or with a hash value that does not already exist on the table being upserted to.
An existing record (to be updated) is identified by a valid table hash value and will be updated based on the attribute values included in the JSON to upsert for that record - i.e. as with update, any attributes not included in the record’s JSON will NOT be updated.
Example NoSQL Upsert Operation
Example Request
{
"operation": "upsert",
"schema": "dev",
"table": "dog",
"records": [
{
"id": 1,
"nickname": "Sammy"
},
{
"name": "Harper",
"nickname": "Good boy!"
"breed": "Mutt",
"age": 5,
"weight_lbs": 155
}
]
}
Example Response
{
"message": "upserted 2 of 2 records",
"upserted_hashes": [
1,
"6bca9762-ad06-40bd-8ac8-299c920d0aad"
]
}
In the above example:
- The existing record with hash value equal to
1
, will have the nickname attribute updated to equal “Sammy” and all other attribute values for that record will remain untouched. Note: if there was no record withid
equal to1
, a new record would be inserted with the provided nickname value. - The new record will be inserted as written and with a system generated hash value. If a new, unused hash value had been provided for this record, we would have used that hash value when inserting the new record.
NoSQL Bulk Load Upsert
Similar to our NoSQL insert
and update
operations, upsert
is also now available to specify as the action
on a bulk load API operation. This will communicate to the bulk load job to run an upsert operation on the large data set provided.
Bulk Load w/ Upsert Action
Request
{
"operation":"csv_url_load",
"action": "upsert",
"schema":"dev",
"table":"dogs",
"csv_url":"https://s3.amazonaws.com/data/dogs.csv"
}
Response
{
"message": "Starting job with id e047424c-5518-402f-9bd4-998535b65336"
}
Example Response from get_job
operation for bulk load
[
{
"__createdtime__": 1607897781553,
"__updatedtime__": 1607897784027,
"created_datetime": 1607897781549,
"end_datetime": 1607897784026,
"id": "e047424c-5518-402f-9bd4-998535b65336",
"job_body": null,
"message": "successfully loaded 348 of 348 records",
"start_datetime": 1607897781562,
"status": "COMPLETE",
"type": "csv_url_load",
"user": "admin",
"start_datetime_converted": "2020-12-13T22:16:21.562Z",
"end_datetime_converted": "2020-12-13T22:16:24.026Z"
}
]
In the above example:
- A
csv_url_load
bulk load operation is started using the linked data set. All records included in the linked data will be upserted into the table identified using the logic described above. - Hitting the
get_job
endpoint with the job id will provide you with an updated status of the bulk load job and, when complete, confirm the number or records upserted from the linked data set.
A Note RE: Clustering
As with other database operations like insert
, update
, csv_file_load
, etc., in HarperDB, an upsert
operation to a table on a specific node will distribute to the other nodes subscribed to changes on that table.
A few things to keep in mind when thinking through how this will play out for your clustering architecture:
In a scenario where you are upserting new records without hash values provided, the system generated hashes will be included in the transaction payload that is shipped to connected nodes - i.e. the auto-generated hashes for the new records will be mirrored on connected nodes.
-
In a clustered architecture, it is important to take a moment to consider the best NoSQL operation to use in every situation, while it may seem easy to just use upsert even when you are only intending to insert or update those records, there could be unintended consequences to your data integrity from that strategy.
For example, in a scenario where you have provided the hash values for upsert records, the upsert transaction will do one of the following on any connected nodes:- If no matching hash value is found on the subscribing table, a new record will be inserted on the table even if the operation on the publishing node was an update on the record
- If there is a hash value match on the subscribing table, the record will be updated even if the operation on the publishing node was a record insert
To be specific, in some scenarios, using upsert could cause hash values for what you consider to be the same record to become out of sync across the cluster.
While this may not make a difference to the overall value or use of your data cluster - it could be the preferred outcome! - in others, your data cluster may be affected negatively so think through your use case carefully. Being explicit about the operation you want to transact will also make reviewing and understanding the transaction logs on your clustered nodes easier in the case where an issue arises and a rollback/fix is needed.
Happy upserting!
Do you have a new feature idea for HarperDB?
Our Feedback Board is a great place to vote and leave product suggestions, and you can always connect with our team in the community Slack Channel.
Top comments (3)
I have a new DB to look into! Thanks!
Really great feature, any plans to add unique fields (columns) to harper db yet?
The hash attribute enforce uniqueness. We have this feature request on our feedback board, and we plan to add it in the future, you can vote for it here: feedback.harperdb.io/suggestions/1...