DEV Community

loading...

How would you approach a big data query(many TBs of dataset) with non-big data solutions?

perigk profile image Periklis Gkolias ・1 min read

Discussion

pic
Editor guide
Collapse
rubberduck profile image
Christopher McClellan

Like this.
adamdrake.com/command-line-tools-c...

Or possibly, with a language like Elixir or F# that has great support for streaming data.

load(mediumData.txt)
|> filter(somePredicate)
|> map(someTransform)
|> filter(otherPredicate)
|> reduce(aggregator)

The trick is to never resolve the stream until it’s absolutely necessary. You filter away as much of the data as possible and process only the entities you need to. Hopefully, the final aggregation fits into memory, if not, you spill to disk and aggregate in chunks (which is exactly what Hadoop does anyway).

Collapse
barendb profile image
Barend Bootha

You'd still use big data ideas. Map over the data, reduce the data set and repeat. In the end you're after an aggregate from that dataset right?

You'd need to segment your dataset in many many smaller parts. Your MapReduce program can then be spawned many times to process many segments at once.

The output of that result set, might not be the final result, so you'd need to repeat the process possibly with a different logic in your MapReduce. Basically you'll iterate till have the final result.

If you've ever dissected a query plan in MS SQL or other major SQL vendor you would have noticed that a simple SELECT and JOIN is actually made of many tiny programs, they assemble the result, it's all hidden behind the higher order Structured Query Language.

Same principle would apply to your big data problem. However you'd never be able to walk over the entire dataset in a single go. Divide and Conquer