Hi guys,
I need an advice. I need to Load a huge amout of data(50 M). From Oracle to HBase.
In this moment there is an Job wrote in Talend (ETL System) that read data from CSV and load to HBase.
Oracle -> CSV File -> Talend Job -> HBase Database
Can I get better upload performance if I connect to the oracle database?
Is reading from a table Oracle faster than reading from a file CSV?
Thanks,
Daniel
Top comments (2)
Most likely, as you are anyway "connecting to the Oracle DB" to generate the CSV file. By reading directly from Oracle you save the CSV generation and parsing step. This step not only takes time but is also error-prone as all schema information is lost and all variables are converted to String.
It might depend on the load and frequency of these ETL jobs and data format.
Dumping a table allows you to decouple the extraction and insertion steps, which means extraction could be done by a serial job and insertion to the destination DB could be done in parallel. Granted this can also be accomplished by using an intermediate programming language but ETL tools are normally equipped at handling massive CSV.
Depending on how the data is it might not matter to have data type conversion in place.