DEV Community

Discussion on: What are the most common tools for data pre-calculation and aggregation?

Collapse
 
bgadrian profile image
Adrian B.G. • Edited

I am familiar with the technologies, but I have not used them yet. Because your requirements are very vague, I will list the most popular Apache solutions (there are other alternatives).

Even Google (its creator) does not use MapReduce anymore, they made a new framework, more flexible that is under the Apache umbrella (Beam): beam.apache.org/

So just a quick oversight:

  • to move the data from your data-lake to the processing units, and back: Apache NiFi or Apache Airflow, perhaps with a Kafka on the way, if needed
    These tools also allows Data Enrichment!

  • to process your data: Beam, Flink (they both support batch + streaming), or Spark (especially if you have any ML algorithms). If it is text based you may need something on Lucene (Solr or ElasticSearch).

Managed solutions would be BigQuery/BigTable, managed Spark and more: cloud.google.com/products/big-data/

Collapse
 
buinauskas profile image
Evaldas Buinauskas • Edited

Thanks!

Requirements are vague because I just didn't want to go too much into details.

I haven't heard about Apache Beam yet, but this looks quite interesting. Will definitely look at it!