I'm currently working on a new (exciting 🤩) project. The project contains data storage💾 and analyzation. One of the most important characteristics of this BigData project is data integrity.
Here are the hard facts:
- I'm using MongoDB as data storage
- asp.net core as web service framework
- .net core to analyze/collect the data
There are a few so-called collectors that harvest information from different sources. Those collectors are written in c# and send data to the so-called Raw.Api🌐. The Raw.Api handles the information and takes care to make inserts into the MongoDb collections. Since data integrity is crucial to this project I devided the operations in three steps:
The first transaction scope is: every collector must never send incomplete information. As soon as the data is transferred to the API all information must be written all at once into the database collections. Partial inserts are never allowed.
The second transaction scope is: as soon as the data is dumped into a temporary collection a background worker starts and writes information to the final collection(s). This must again happen in a DB transaction. Partial updates are not allowed. MongoDB 4.2 supports ACID across multiple collections as far as I know.
The third transaction scope is: the analyzer must read all data at once and work on the information. A partial refresh is not allowed because the data might have been updated in the meantime and this would lead to inconsistent information.
There are multiple collectors and every collector sends information every now and then. For example every 5 minutes. One transaction of data consists of approximately 40mb. 40mb x 12 transactions per hour * 24 hours * 365 = ~4tb. That's too much data📚. Therefore every dataset has a specific Data Save Interval for example daily, hourly, etc.
Every new request replaces all the data from the previous request as long as the request is within the same Data Save Interval.
Every document is stored with a "Data Scope" and a collection date time stamp. A "Data Scope" is, for example, a set of data that can be uniquely identified. This could be a domain (www.google.com). The first step of writing information from the temporary collection to the history collection is to delete every document in the same "Data Scope" and in the same "Data Save Interval". This guarantees that if less information was transmitted by the collector (because someone deleted data which is an allowed scenario) the "old" information is deleted. Then every document is inserted into the History collection. The same happens for a Current collection to make sure that access to the current information is faster.
The analysis happens on a base of rules and I'm not yet sure how to do that yet. Is there a great tool or programming language to archive that?
Any ideas or suggestions for this project? I'm happy to discuss 😜