Most of the solutions in Big Data analysis are based around many of the AWS services offerings — they are quite a lot by the way. I work in a small developer team and we didn’t have the time, nor the experience to try all of them before beginning to build a solution for a Big Data problem we had at our company.
Instead of painful hours of work with each service, we decided to tackle the problem as quickly as possible. We began by deploying a solution with an architecture involving only AWS Lambda. Knowing that there were other ways to do what we have done, we went further and experimented with Amazon Athena. We studied and worked with them for a few weeks. We deployed both and tested them so we knew which suited best for us.
So, I wanted to share my experience in learning, developing, and using these two architectures— the one using only AWS Lambda vs the Amazon Athena architecture.
This story will focus more on the process of the development with a big emphasis in the project itself — not giving every detail about it but exploring the development as a whole. Also I want you to know the differences and the insights in both of them.
The big constraints… money and time
Our project needed to be in production in the shortest time possible, saving as much money as possible.
The project requirements were fairly straightforward:
A platform that analyze logs from routers, and then do aggregations of the information to see if a device can be seen as a visitor or passer-by.
We didn’t want to pay for anything else than the data processing .
We wanted an easy to deploy, self-provisioned solution .
Let’s get down to business
The product required a large time investment in the following areas:
First we had to research, implement and weigh up which was the best architecture for our problem. As well, we had to learn about the technologies that we didn’t knew.
AWS Lambda was very familiar to us, but as Amazon Athena was fairly new, so we had to get our hands dirty and start experimenting with the tool.
Our team was experienced about developing applications using serverless — so we knew the ins and outs of the whole Lambda / SNS / S3 services, and deploying them using CloudFront.
But this challenge was new. We had to analyze large amounts of routers data with lots of information about the devices that are connected to them — all of this in an strict execution time schedule.
Face-to-face with the problem
This was the schema of tasks that our solution had to implement:
An external application uploads files to a preconfigured location every minute.
Our application checks this file location at 10 minute intervals and processes all the files currently existing there, one-by-one, merging all the information in one file.
After successfully processing the files, we had to obtain the statistics from the passers-by and the visitors of the location where the router is from.
Parallel to this we wanted to have the information not only of the ten minutes interval but aggregate the information to have some desired intervals such as 1 hour period, 8 hours period, 1 day period, etc.
First we used what we knew — logically
Only we had a few certainties, AWS Lambda works — we used it before.
We knew that if you use AWS Lambda for processing, you only need to pay for the actual processing time, not a cent for the idle time. And if you use AWS S3 for file storage, you have to pay for the size of the files and for the movement of data — this is also an expensive part. With that in mind we started planning.
The above diagram shows an approximation of how we integrated the AWS components to build our solution:
A CloudWatch scheduled event was configured to trigger the lambda function at 10 minutes intervals.
A Lambda function that acts like a scheduler for all the different intervals. It sends a SNS notification when a batch processing is needed.
Some folders in a S3 bucket were provisioned to store the raw and the processed information.
Some SNS topics were configured to publish processing notifications to them.
Lambda functions that were programmed with necessary permissions to read the files from the S3 bucket, process them, and finally send them to S3 again.
At first, we were pretty happy with what we built in this instance — but we knew we could do better. Also after taking another look at the solution we saw that it had some limitations, one of the biggest ones was the size of the files and the Lambda file storage restriction.
We knew we had a big amount of data and this made the number of instances of Lambda, that then translate to time amount, big. As I said before one of the biggest constraint for our project was about saving as much money as possible — but we knew this was not exactly what was happening .
In addition, we needed to manage a quite large architecture — a point not less important.
In order to improve this we parallelized as much as we could, we tuned our algorithms, but we had the insight that it could be done better at a much lower cost.
So in the process of exploring AWS services we stumbled upon the boom of Amazon Athena.
Beginning to steer the wheel
Amazon Athena is a serverless, SQL-based query service for objects stored in S3. To use it you simply define a table that points to your S3 data file and fire SQL queries away! This is pretty painless to setup in a Lambda function.
But, what was the difference if we still had to use Lambda as a mean to process our data?
Disruption occurs with the price model of Athena; you are charged only for the amount of data scanned by each query and nothing more. Athena charges you an amount per TB of data scanned, charged to a minimum of 10 MB. While Lambda pricing model is charging money for every 100ms of computation.
We had a lot information to process, and we had lots of Lambda functions for each of the files that we had to process. That means that we had a huge amount of accumulated time in Lambda processing.
This is where we knew we had to make the most of it, as Athena doesn't charges you for the time that a query is running, only for the amount of data processed. This meant we now needed only one Lambda to run the queries instead of the many we needed previously — but it was not as simple as saying this.
The boat had already sailed again — we knew the way
We started working on the scheme and this was the architecture we obtained:
This is a major change in the architecture we had before. We were able to see that we could use the benefits of Step Functions to make our solution easier to manage and provision. We improved the two fundamental aspects that we wanted — money and the provisioning of the solution.
Let’s have an insight in the step functions as well:
So let’s explain the scheme a little bit. The first thing to know is that if you use the SDK to connect to Athena, then calls to the service are asynchronous. This means that if you want to do a query in a lambda function you have to send it but you don’t receive the answer immediately. Athena should be asked to see the information was processed.
To mitigate this we had to add some intermediate decision steps where a certain amount of time is waited to give Athena time to finish processing. In case Athena does not finish processing the information, it will wait for this time again to ask again.
Here we can see the first benefit of this model, in the lambda we use we only have to send a message to Athena to begin the query, then Athena does all the work. So this is where the improvement underlay, not having many Lambdas to process the files but one that sends the request and goes to sleep.
The other parts are not much more sophisticated than the one before. As the first architecture, the process begins with a parsing task in order to leave the files ready for Athena to query. This can be done with crawlers, using AWS Glue to transform the data so that Athena could query it. Another alternative that we used to reduce costs is to create the partitions via an Athena query.
After finishing this, the data analysis begins. This is where a Lambda Function calls Athena and ask for the processed data. This is done for the different periods of time only adding, as mentioned before, the time waits and the logic for retries and errors.
And the best of all, is that if you know basic SQL you can do amazing queries.
As we started to learn and research we realized that there were even more ways to make the performance more optimal.
So I want to share some of them with you:
Compression — *Because data is always compressible, and having data compressed means less ammount of data.
Columnar Data Format — As suggested by AWS, you can convert data in parquet format, massively reducing the amount of data queries are run on.
Caching — You don’t want to rerun the same queries over and over so you can begin to systematically store and categorise the results in a S3 datalake.
Running queries together — So to make it cheaper still, you can begint to string multiple queries to be run together, then split them apart on Lambda before sending them back to S3. Athena sets a maximum of 10 concurrent queries. That’s why is best to do more queries in one.
Having gone all this way, we decided to deploy to production the Amazon Athena solution. As you may have seen, throughout this whole process we found that when we worked with Athena many benefits came to light.
We think we arrived at a robust and scalable solution. Furthermore, using an architecture that takes the advantages of Athena is far more cost effective.
Now that you have seen how two different architectures are implemented, I hope you can try them out for yourselves and comment on the architectures you use on daily base. We are a group that is growing with a desire to learn from every experience.
So if you have any questions about what we have done, I’d love to hear your questions and feedback in the comments below.
Thanks for reading! Be sure to give it a clap if you enjoyed it!