It is about big data versus conventional database
I recently ended a hackathon www.spaceapps.cl and my project was simple: crawled information about the weather and process it visually. The system worked. This hackathon lasted 2 days so everything was rushed.
I didn't win it but I am not mad
However, a jury lost her arms :-)
Technically, it is illegal to crawl a site so I can't share the library that I use and it is a shame, it works really well.
->enterLevel('<A HREF="http://www.nws.noaa.gov/dm-cgi-bin/nsd_lookup.pl?station=','"',false,false) ->if() ->set('myid','@_value@') ->showmessage('@myid@') ->object('myrow','stationid','@_value@','add') ->else() ->showmessage('exit') ->break() ->endif() ->exitLevel()
It is part of the code.
I love the use of the database. Databases are ideal for data analysis.
However, I tried to insert a lot of information into the database and it was impossible to do it in a timely manner. It took around 3 hours to store 5% of the whole information (and this hackathon lasted 48 hours).
So, I decided to change strategy: FILE SYSTEM and surprise
It did the job in 5 minutes.
Why? It's simple. Every time we insert a value into a database, the database does a lot of jobs, updating the index, adding values to the redo-log, reserving space into the tablespace and inserting the value. Rinse and repeat x 1 million times. Even if we don't use an index, the work is huge. Instead, the file system is simple, it stores the information as-is (and once). The only bottleneck is the hard disk.
I could have done with MongoDB but (for this job), the file system is way faster even to MongoDB. Also, MongoDB adds a new level of complexity.
Finally, I compressed all the information and I store a consolidated into the database and the system works decently.