loading...

PostgreSQL or MySQL data import into Google BigQuery

pedrojmfidalgopt profile image pedrojmfidalgopt ・4 min read

Nowadays cloud platforms provide huge processing power, Amazon decided to make better use of their available machines once their processing capabilities were mostly directed to Christmas sales (being stopped the rest of the year...), Microsoft rents Office or Azure and Google lends processing CPU's and data processing services. Everything paid on a monthly basis. We have the new trends, permanent paid rents but depending on the business and considering what one can save in infrastructure and hand-labor it may be worthwhile. Proper mid/long term evaluation plans should be made when choosing the best solution for the actual business case study.

Either way all technologies and platforms conquer their niche market and they are here to stay. Traditional databases will continue to have their space as to access cloud platforms one needs very good internet connections and there are some places where that is not possible yet. And you will always have the problem of sensitive content that simply cannot be in cloud. Consider for instance government security information, insurance policies, wage or tax information among others.

Google BigQuery allows any small business to store and process huge amounts of data using a SQL querying approach, it is provided as a cloud-based big-data analytics web service. It allows data importing using the traditional csv file data type or even using the more recent json structured file type. Google provides a good tutorial explaining that and I even wrote one for MySql as well some time ago. Things can be accomplished using simple mouse operations or for the more command line purists, the bq load command is available to load data into BigQuery created tables. To move data from Postgres to BigQuery a similar approach can be used as it is only a matter of exporting data using a proper acceptable format. SQL or any tool capable of connecting to Postgres using an ODBC driver or other can be used to achieve that.

These approaches are interesting for unique or ad-hoc bulk data export and loading as they do not require much integration, can be executed manually and just when required. Just export data from the existing database using any export script and import it in BigQuery after proper data structures are created waiting for correctly formatted data.
But, free databases are getting more market, licensing costs make any CFO consider them when comparing to expensive per core solutions like Oracle or SQL Server.
If ones checks how databases ranking compare, PostgreSQL appears in fourth place and MySQL in second. They are free databases that present themselves as very high performance and scalable free solutions (it is impressive the tools MySQL database provides in its desktop workbench). PostgreSQL is also becoming a very used database capable of storing demanding transactional or data warehouse solutions. But in the end even if it can store large amounts of data, to process it one needs CPU power and that is where Google or AWS shine as you have thousands of CPUs available in the palm of your hand.

For small businesses it is not difficult to have huge amounts of data to analyze, any company working with web data knows that, facebook provides millions of logs linkedin has millions of users. Two years ago I developed a data migration process for a company in US that uses spiders to look for prices all over the web. They end up having millions of rows that must be merged into a single database so they can compare supply and selling prices and perform good management decisions. Running analysis over so much data became difficult considering the small SQL Server machine they used. When dealing with large data sets one should really consider platforms like Google BigQuery or Amazon Redshift as a possible data analysis engine options.

As a conclusion, if regular data migrations are required, ETL processes should be built to make our life easier and should not depend on permanent manual iterations (or at least should be minimized...). If possible we should be able to have near real time data migration that runs at scheduled time frames only informing us they have finished without errors.
It seems acceptable one should look for other data integration solutions between these free databases and Google BigQuery. If the reader searches in Google for these subjects several commercial solutions will be presented and all point to the same direction, Google desktop integration. Web based platforms that put it all inside proper well thought workbenches, they allow data mapping, data transformations, process scheduling. Some steps once developed at desktop side can now be migrated to cloud side. It is possible to define replication frequency so in each defined period data is migrated without any manual intervention. The great advantage of these platforms is you have it all in one single place, data sources and data destinations, accessed from a single web portal. And they do not limited to PostgreSQL or MySQL, there are a lot of connectors available.
If you do not believe me search for them in youtube as well, you will find a lot of tutorials available and you will be able to evaluate their real power, almost compared to some of the desktop ETL solutions available. They cost money but on the other hand you will stop having to worry about server maintenance or database backups and DBA’s wages ...

Posted on by:

pedrojmfidalgopt profile

pedrojmfidalgopt

@pedrojmfidalgopt

I am mostly a database developer that knows how to develop software as well ... VBA process automation is my middle name.

Discussion

markdown guide
 

How fast is bigquery. I imported a CSV file with around a lakh rows and ran a query both in mysql and bigquery. Couldn't find any difference in the time taken. Is it really worth the money?