Split your CSV into 6 smaller once so that they are 1 Mil each.
Create an Athena table. In your code you can then query all the data.
Create a Lambda, that gets triggered when the CSV file is PUT into the bucket. Use a SQS to notify the Lambda, set batch size to 1 and concurrency to 1. If your using NodeJS, search for fast-csv package, you can then stream the CSV to read it in chunks.
Query Athena to find all the data, this is more efficient than using Dynamo to do a full scan. Stream these rows and create a hash key of the fields that you use to determine if the row is unique or not. Store this array in memory.
Then read the new CSV data also as stream, compute the unique hash same as before. If it is found, no need to update, if not insert the record.
Use DynamoDB's BatchWrite API to write in batches of 25. Do this concurrently (in parallel) within the same function. You will easily be able to hit 2k to 4k inserts if the table is set to OnDemand Pricing with a cold table. Then as mentioned before your file must have 1 Mil records max so that the insert to DynamoDB takes 1,000,000/4,000 = 250 seconds (max run time is 900) so your safe.
Upload the 6 raw CSV files to S3.
Solution 2:
Same principal as before, but instead of using Athena you could do a full table scan on your DynamoDB table. Bit more expensive and won't be as fast to read all the data for the Deduping part as Athena.
I will DM you, just follow me so that we are connected. I don't have an example ready, but I quickly found this stackoverflow.com/a/33755463 with a quick search. Just change the async.series to async.parallelLimit, and make it say 25 as well so then it will do 25*25 = 625 concurrent writes.
Okay so 5 to 6 Million, isn't that much.
Solution 1:
Solution 2:
Same principal as before, but instead of using Athena you could do a full table scan on your DynamoDB table. Bit more expensive and won't be as fast to read all the data for the Deduping part as Athena.
Something in the lines of that.
Do you have a working example of this ? I am currently trying the same thing.
I will DM you, just follow me so that we are connected. I don't have an example ready, but I quickly found this stackoverflow.com/a/33755463 with a quick search. Just change the async.series to async.parallelLimit, and make it say 25 as well so then it will do 25*25 = 625 concurrent writes.
I wrote a post on how to do this for you/others -> rehanvdm.com/serverless/dynamodb-i...