Abstract
I'm currently working on a new (exciting 🤩) project. The project contains data storage💾 and analyzation. One of the most important characteristics of this BigData project is data integrity.
Here are the hard facts:
- I'm using MongoDB as data storage
- asp.net core as web service framework
- .net core to analyze/collect the data
There are a few so-called collectors that harvest information from different sources. Those collectors are written in c# and send data to the so-called Raw.Api🌐. The Raw.Api handles the information and takes care to make inserts into the MongoDb collections. Since data integrity is crucial to this project I devided the operations in three steps:
Transaction scopes
The first transaction scope is: every collector must never send incomplete information. As soon as the data is transferred to the API all information must be written all at once into the database collections. Partial inserts are never allowed.
The second transaction scope is: as soon as the data is dumped into a temporary collection a background worker starts and writes information to the final collection(s). This must again happen in a DB transaction. Partial updates are not allowed. MongoDB 4.2 supports ACID across multiple collections as far as I know.
The third transaction scope is: the analyzer must read all data at once and work on the information. A partial refresh is not allowed because the data might have been updated in the meantime and this would lead to inconsistent information.
Handle the data⚙️
There are multiple collectors and every collector sends information every now and then. For example every 5 minutes. One transaction of data consists of approximately 40mb. 40mb x 12 transactions per hour * 24 hours * 365 = ~4tb. That's too much data📚. Therefore every dataset has a specific Data Save Interval for example daily, hourly, etc.
Every new request replaces all the data from the previous request as long as the request is within the same Data Save Interval.
Every document is stored with a "Data Scope" and a collection date time stamp. A "Data Scope" is, for example, a set of data that can be uniquely identified. This could be a domain (www.google.com). The first step of writing information from the temporary collection to the history collection is to delete every document in the same "Data Scope" and in the same "Data Save Interval". This guarantees that if less information was transmitted by the collector (because someone deleted data which is an allowed scenario) the "old" information is deleted. Then every document is inserted into the History collection. The same happens for a Current collection to make sure that access to the current information is faster.
Analysis
The analysis happens on a base of rules and I'm not yet sure how to do that yet. Is there a great tool or programming language to archive that?
Any ideas or suggestions for this project? I'm happy to discuss 😜
Top comments (4)
It sounds like you're trying to shoehorn an entire data pipeline into MongoDB alone. Have you looked into using tools like Kafka and/or Spark to queue and process batches without inviting a ton of database churn?
You have a point there. I thought about using RabbitMq, which is similar. This way the Raw.API would publish an event containing the data that is to be written in the db. But I'm a bit afraid that this might complicate the solution a lot?
Disclamer: I'm not a big data expert! For my understanding, Apache Spark is an analytics engine, that helps you with processing large datasets in "real time". Processing happens in memory, so it is fast. As your data volume grows, you can add more nodes to your spark cluster to distribute the workload. Lucky for you, there is a connector for mongodb and also a dotnet sdk.
It's a complicated problem! You should be equally wary of simple solutions that try to do too much with too little.