What are the most common tools for data pre-calculation and aggregation?

#sql #discuss

Company, I work at, does data research and scraping which is later aggregated and published to our clients. We also try to denormalize data in order to provide faster data lookup in web applications.

Until now, we used mechanisms within SQL Server to do these aggregations. But recently this has became a bottleneck and processes take too much time to execute and overlap to business hours.

What are other tools that market uses to perform aggregations and pre-calculation outside of relational database? My discoveries include:

Apache Hadoop MapReduce
Apache Pig
Apache Spark

Top comments (2)

Adrian B.G. • Oct 21 '18 • Edited

I am familiar with the technologies, but I have not used them yet. Because your requirements are very vague, I will list the most popular Apache solutions (there are other alternatives).

Even Google (its creator) does not use MapReduce anymore, they made a new framework, more flexible that is under the Apache umbrella (Beam): beam.apache.org/

So just a quick oversight:

to move the data from your data-lake to the processing units, and back: Apache NiFi or Apache Airflow, perhaps with a Kafka on the way, if needed
These tools also allows Data Enrichment!
to process your data: Beam, Flink (they both support batch + streaming), or Spark (especially if you have any ML algorithms). If it is text based you may need something on Lucene (Solr or ElasticSearch).

Managed solutions would be BigQuery/BigTable, managed Spark and more: cloud.google.com/products/big-data/

Evaldas Buinauskas • Oct 22 '18 • Edited

Thanks!

Requirements are vague because I just didn't want to go too much into details.

I haven't heard about Apache Beam yet, but this looks quite interesting. Will definitely look at it!

DEV Community

What are the most common tools for data pre-calculation and aggregation?

Top comments (2)

Read next

Exploring Aurora DSQL with TypeScript, Drizzle, Lambda, and AWS CDK

Throwing if fetch() returns response.ok === false?? Terrible!

How to Install PostgreSQL on Ubuntu 22.04 LTS

SQL: ROW_NUMBER, RANK and DENSE_RANK