Processing data could be an intensive task, especially on the computing units as the read and write operations takes a lot of resources. Luckily, if you have the right tools, you can achieve it easily. But, is it worthy? Let’s find out.
In this article I will share my experience with you on how did I achieved that. It is really simple and complicated at the same time. Why? Because of how Lambda works and what you have to think of when you “Code” because that really makes a difference.
Why did I thought about it?
Few years ago, my manager told me to think about a processing architecture to process BIG volumes of records but not that heavy operations. Like 800k rows of data, with 16 columns, that the amount of work needs to be done over each row isn’t complicated. So, Event-Driven Architecture!!
I went through a lot of issues on how to deal with the limited resources in Lambda and how to deal with my records dropping because of timing out and OS errors. S3 was another story also to learn how to tune it for my use-case. A dear friend who works as senior consultant in AWS Bahrain helped me getting some tools in place to achieve this promising idea. It was one of the best experiences I ever got dealing with AWS resources.
Enough talking lets get some diagram in place..
Diagram looks scary? Trust me, it isn’t.
Let me break it down for you in some steps:
1- Initiate the process:
Because I’m adopting Serverless architecture, it means Event-Driven events that if something happened, things will act based on it and the results will trigger another actions, till the end of the process.
So, in our case is S3 Put request. When we upload the file into S3, it will put the file inside a bucket and when the file fully uploaded into it, Lambda will get triggered with S3 payload. Our first step just got finished. Whats next?
2- Clean data:
Because we got a csv file, some columns and rows could contain spaces, some special characters that MIGHT break your code. So, clean it up.
Cleaning these records will prepare it to be ready for insertion. But, since we have a lot of records and the function might fail, how to track what was added and what left over?
3- Add clean data into a queue:
We will add the cleaned records into a queue. The reason is to track what has been added and what did not got added. Basically, SQS will act as an organizer. It will send small batches into Lambda, Lambda will add it into DynamoDB, then it will return success message to SQS to remove it from the queue.
What will happen in case of failed record is SQS will retry 3 times, as per my configuration, to try insert operation. If these 3 tries failed, then it will move it into Dead Letter Queue (DLQ) which is another SQS queue that have the failed records. Then, you can debug why these records never make it into DynamoDB and can be processed again or even reject it.
Because we are trying to process massive chunks of data, we need some sort of Database that can handle the extreme load or records. DynamoDB solves this issue. There were a lot of experiments on how to handle the number of records and how to behave with the limited read/write throughput as write can handle 1kb data per unit. So, DynamoDB on-demand solve the issue.
As per AWS documentation, using on-demand DynamoDB throughput is the option when you cannot predict your workload. Because it will prepare the max throughput just in case its needed.
We moved the records from csv to DynamoDB, then what?
5- Stream records to SQS:
DynamoDB is really good event executer for Lambda. When you enable Stream, you can specify a Lambda function that react to your passed payloads from it. The good thing is you need to act based on type of record. We are dealing now with the newly added records. So, when we verify the tag, we add the record into another SQS.
The reason for this queue is the consistency of records delivery. You are adding the record once, capture it and add it into the queue so you can deal with it. If not, then you have to scan the table to get unprocessed records and process them. Why the hassle? Let the queue deal with it.
6- Process data:
We reached the latest stages of record lifecycle in this architecture. When it reaches the Process Queue, it pass the records in batches, process them, and then pass them to another queue. As I clarified earlier, for consistency purposes.
7- Update the processed record:
Finally, the record will be grabbed from the Finished queue and passed into the Lambda function that will update the record with the processed information. If records failed to be passed, DLQ will gather them for your further debugging and actions.
These points seems to be a straight-forward scenario, BUT, its not. Let me walk you through some problems and how it got solved.
1- Lambda Lambda Lambda:
Lambda is the key player here, we have limited time to execute the logic in your code. How can you insure the records were red from the file, cleaned, and added into the queue? Its hard but what you will need is speed. I code in Python and I used Multiprocessing library to speed things up. I used Multiprocessing Process function to use every single possible processing unit in Lambda. This action made my process to clean (in some tests) 558k transaction in 1:30 min! It was really fast. Again, it is not straight-forward scenario. Lambda can handle around 500 process when you allocate the max memory. Any other process will raise “OS Error 38: too many files open”. Why did I faced this issue? Because I joined all running processes and it was not closing finished processes. So, I run half the batch and loop over the running processes. If it finished, I force it to close. Problem solved…
2- Keep an eye on CloudWatch:
I made a big mistake by passing event variable into CloudWatch even when I was running the big batches. This resulted to write 6.6TB of data because of my many tests. The price of CloudWatch put log action is expensive so use it wisely.I made a big mistake by passing event variable into CloudWatch even when I was running the big batches. This resulted to write 6.6TB of data because of my many tests. The price of CloudWatch put log action is expensive so use it wisely.
3- DynamoDB on-demand is the keyword:
I started my configuration with making my DynamoDB provisioned, which is something for predicted workload. I made it 5 read and write throughput and that was an issue. I faced the issue that out of 558k records, only 1k was inserted into my table. I raised it into 100 throughput and still at least 60% out of the file get lost and not added! Then I reread the documentation and noticed my issue, DynamoDB on-demand is the solution for unpredicted load. I added all the 558k+ records within 5 min! Pretty FAST!
4- SQS can be tricky:
SQS is a great service and has a lot options and opportunities. But, you need to know what is the size of the batch you’re passing every time and what is the predicted time for your batch to finish. The reason for that is when you tell SQS to wait x seconds before making this batch available again, it might get processed multiple times. Know your code and your data, test test test and then configure it for heavy workload.
These points were my top concerns, S3 was interesting but not that complicated as I expected. But the main question, does it worth it?
Everything in this life depends on conditions, if you don’t want to manage instances or you want it with the minimal effort then yes, this scenario is valid for you. Keep in mind that debugging these use-cases can be tense because of how connected and how one mistake in one step can affect the coming steps.