THE ROAD TO AWS RE:INVENT 2018 – WEEKLY PREDICTIONS, PART 2: DATA 2.0

#aws #reinvent #cloud

Originally published here.

Last week I made the easy prediction that at re:Invent, AWS would announce more so-called ‘serverless’ capabilities. It’s no secret that they are all-in on moving from server management to service management. I guessed at a few specific possibilities – SFTP-as-a-Service, ‘serverless’ EC2, and a few others.

This week, I want to look at some of the other capabilities provided by AWS and make some predictions as to what announcements we might see. Why should any or all of this matter to you? If you’re in the business of processing, storing, and analyzing large sets of data, these updates may significantly impact the speed, efficiency, and cost at which you’re able to do so.

WEEK 2 PREDICTION: DATA 2.0

While AWS has a number of existing tools to manage data ingestion and processing (e.g. Data Pipeline, Glue, Kinesis), I think adding in an orchestration framework optimized for all the steps in a robust data processing framework would really allow for AWS’ data analytical tools (Athena, QuickSight, etc) to really shine.

DATA-MAPPING-AS-A-SERVICE

I cut my teeth with data integration on platforms like WebMethods. While it may have had some drawbacks, it was, as a solution set, really excellent at:

Providing endpoints for data delivery
Identification of data by location, format, or other specific data elements
Routing the data to the right processors based on the above features
Mapping of each data entry from one format to another
Delivery of transformed data into target location

I can see an equivalent of something akin to a managed Apache NiFi solution – in a manner like AWS’ ElasticSearch Service. Tying in the ability to route various tasks to be executed by Lambda and/or Fargate, supporting Directed Acyclic Graph (DAC) modeling, and a tight integration into writing out data to S3 as both final and intermediate steps would be a game-changer for products that have to import and process data files – particularly from third parties.

S3 LIFECYCLE ON READ TIME

One of my pet peeves on the S3 lifecycle management is that moving from Standard to Infrequent Access storage class has nothing to do with the frequency of accessing the file. While I would imagine that the underlying capabilities of an object store makes it very difficult to actually do this, it would provide a much-needed metric to make storage decisions.

DYNAMODB DEEP DOCUMENT MODE

DynamoDB is a great hybrid key and document store. I use it often for small document store and retrieval. However, the current limits on document size and scan patterns make using DynamoDB as a managed MongoDB-level solution is a challenge. Providing more robust document-centric capabilities, while still supporting the scalability, replication, and global presence would significantly “up the game” for DynamoDB. As a wish-list factor for DynamoDB I would like to completely remove the pre-allocation of throughput for reads and writes. Let each request set an optional throttle, but charge me for what I actually use rather than what I might use. The current autoscaling is a significant improvement over nothing – but it can be improved.

RDS – POLYGLOT EDITION

For a while there, there was an interesting trend to try to combine multiple database paradigms into a single view – combining document + graph, etc. I think that AWS may try to tip their toe into this view.By combining a few of their existing products together behind the scenes, it would be interesting to link ElasticSearch, Aurora, and Neptune together for a solution that tries to combine the best of each of the storage paradigms. Like most all-in-one tools, I’m honestly not sure if it will just do the multiple features equally mediocre. I often recommend a multi-storage solution for clients for their data – each one optimized for a particular use case, so there may be something there.

S3 AUTO-CRAWLING AND METRICS

Imagine setting a flag on a data bucket so whenever a data file drops there, it is automatically classified, indexed, and ready for Athena, Glue, or Hive querying. Having some high-level metrics on the data within would be useful for other business decisions – row count – average values, etc. Adding in some SageMaker algorithms for data variance (e.g. random cut forest for discovering data outliers and/or trends) to fire off alerts would be incredible, too.

WRAPPING IT UP

In closing this week, I think there will be a lot of different announcements around data processing as an AWS-centric framework. AWS has most of the parts in play already – having AWS manage the wiring up of them so you only have to focus on the business value you are extracting from the data would realize the promise of the cloud for data processing.

Going to be at re:Invent? Drop a comment below and let me know what you hope to see there or your thoughts on what’s next.

Top comments (2)

Thomas H Jones II • Nov 16 '18

One of my pet peeves on the S3 lifecycle management is that moving from Standard to Infrequent Access storage class has nothing to do with the frequency of accessing the file. While I would imagine that the underlying capabilities of an object store makes it very difficult to actually do this, it would provide a much-needed metric to make storage decisions.

S3's lifecycle management generally leaves much to be desired. I mean, it's great that you could have a multi-stage lifecycle for data. But, the fact that your only choice for sub-30day policies is to just straight to Glacier is kind of dreadful. S3 is potentially great as a repository for nearline/offline storage (i.e., backups) ...but it currently lacks the useful lifecycle capabilities you get used to in legacy products like NetBackup. And, even aside from the whole loss of POSIX attributes if you want to simply sync a filesystem to disk, performance of such is dreadful due to the whole common-key issue. Both the POSIX attributes an common-key problems are solveable, but it's painful to sort the programmatic logic out.

Overall, it has the feel of "you guys have been pestering us, here's something to shut you up for a while", but not really a fully-realized HSM.

Maybe what AWS will introduce is an actual HSM-style interface to S3 or a service-overlay?

Thomas H Jones II • Nov 19 '18

Also, I would hope that they're opting to flesh-out the EFS offering. Things like:

More/better pre-selected performance-tiers. Would be great to have a shared filesystem that was useful for busy applications that didn't have large data-size requirements:
- The default performance tier has decent latency but throughput is dependent on how much you're storing. Sucks to have to store more data &dash; especially dummy-data — just to get better base-performance.
- The "Max I/O Performance Mode" is better for throughput, but the penalty is increased latency.
An actual, built-in backup capability. Yeah, EFS itself is durable, but it doesn't really offer "oops" protection. EFS is currently like relying on RAID as your only data-protection method. While you can jury-rig backups, doing so will blow-out your daily I/O credits.
An actual, built-in region-to-region replication capability. While EFS is great in (a supported) region, if a region manages to get knocked off the air (or you otherwise need to do an off-region migration of your services), your EFS-hosted data is offline or otherwise not easily available. While you can jury-rig region-to-region replication, as with backups, doing so will blow-out your daily I/O credits.
Windows/CIFS interface would be great. Lack of CIFS support limits the ability to use EFS in Windows-based clustered deployments. I'd assume they'd be working towards this to enhance their WorkSpaces service, any way.