How I built a data pipeline from an end user’s legacy software to a modern cloud infrastructure using Google Cloud Platform and very little money
My name is Ido, and I’ve been in the tech industry for quite a while. My latest project is aimed at bringing some modern cloud and data technology into the world of farming. I’ve started with a simple enough goal: deliver the data from present-day (out dated) farm management software into the cloud. There we can conduct advanced analysis and help farmers better manage their farms.
I am on a very low budget, without outside investors or any spare money. Also, as it’s only me working on this project, I will need to keep any solution as simple as possible, and beware of wasting my time on unnecessary optimisations/features.
To incorporate for these requirements I’ve used Google Cloud Platform’s set of tools. I’ve found that they have a good approach to data processing and analysis. Moreover, they really believe in managed services, which should save me a lot of time. They do have some shortcomings, which I’ll also go into details about.
On to the real stuff…
High Level Description
This is what we set out to do in the project:
- Data files are collected on each farm’s local computer. The computers are connected to the internet and the files are stored locally in .csv format.
- Send these files to the cloud.
- Write these files’ contents into a cloud based database, after (deep) parsing and cleansing.
- Visualise and display the aggregated data on a web, mobile friendly, page.
- Make the data accessible for further analysis and whatever data science magic we may want to apply to it.
I’ll start with a diagram that describes the full flow the data makes from farm to its end goal (be it dashboards or data science stuff). Following it I’ll describe each stage and how it’s implemented on google cloud:
- Local computer to cloud storage: local .csv files are uploaded to Google Cloud Storage, a normal cloud object storage. Done once a day, and fairly simple.
- Pub/Sub job: a Pub/Sub topic gets a job on each new Google storage object uploaded. Very easy to configure this with google’s CLI tool. You can just specify a path prefix for queueing, as well as an HTTP endpoint that will be called each time a new file is created. And of course google Pub/Sub itself is managed and infinitely scalable.
I did have a couple of (self-inflicted) messes with it: at my first usage, I didn’t specify a path prefix, and on my triggered HTTP endpoint I copied the uploaded file to a new location. This of course triggered a new job, and so on, causing an infinite loop. Oops. Another thing is that you have to make sure to return status 200 OK to the PubSub service, otherwise is will keep bombing you with requests. It’s part of that ‘deliver at least once’ promise you receive from the service.
- App Engine: this is where I implemented the HTTP endpoint. It copies the new data file to another cloud storage bucket according to its type (a few types of .csv files can be uploaded) and source (which farm it came from). I used the python language for the implementation.
I found App Engine to be an interesting solution for simple work. It’s easier to set up than real servers, but has a lot of pitfalls that if you’re not careful can (and did) cost a lot of time. I’ll go over some quick points:
- There are two App Engine environments: Standard and Flexible. Basically standard is like a sandbox version of the programming language you’re using, but it’s supposed to be easier to deploy (it’s not really easier). The biggest difference I saw: the flexible environment is much more expensive than the standard.
- When using the standard environment to interact with other Google Cloud services, and running your program locally, the standard environment will work with ‘mock’ services on your local machine. When I worked with google’s Datastore service (a managed cloud DB), I expected to see the changes from my local environment reflected in the cloud DB, but instead it was reflected in a local mock that was hard to find.
- Logging and versioning and a lot of other stuff are right out of the box, which is very convenient I have to say.
- Data parse and cleanse: I did this on a new and cool tool called Google Dataprep (by Trifacta). It’s very useful and simple for parsing and cleaning data, I’ll go over it in the next section (recommended tools). It has a native write mechanism to Google BigQuery, which is next in the pipeline.
- Write to cloud DB: I used Google’s BigQuery as my cloud DB. It’s a really awesome tool, and provides very fast data queries. See more in the recommended tools section.
- Visualise the data: I’ve used google’s Data Studio for connecting to BigQuery and visualising the data. It is a fairly new BI tool. It’s very lean compared to classic BI tools (Tableau, QlikView etc.), but also very clean and intuitive. It’s still in beta, and that was definitely felt with many minor bugs and annoying pitfalls. Being a simple tool is also good: It is very easy to learn and get by with, and it looks very good and clear. It also has native and direct connectors to BigQuery and integrates very well with it. The data is displayed fast, and very easy to configure.
- Research the data: I haven’t done much of this, but it’s a very small step to make. Google offers managed Jupiter notebooks that are directly connected to BigQuery, for researching the data, and creating production machine learning models.
- Dataprep: “An intelligent cloud data service to visually explore, clean, and prepare data for analysis.” (from the product page). It is basically a UI based parsing and cleansing tool, and it is very good at that. You can define parsing and cleaning using a very comprehensive UI. It saved me a ton of time on writing regex’s and applying rules on different types of encodings, deleting empty rows etc.
Some more advantages: it integrates well with Cloud Storage and BigQuery, it has a convenient job scheduling mechanism, and it looks great. The worst thing about it (it could be an advantage for some): it runs the parsing and cleaning job created on the UI with a very big overhead. I think this happens because it’s aimed at much larger scale data than what I’m using (I won’t go into any deeper details here), but the bottom line is that even the smallest parse job takes about 5 minutes. I considered moving from it because of this, but the usability and integration to other services are so great the I stuck with it.
- BigQuery: a cloud hosted (and managed) analytics database. It is relational, and integrates really well with all of Google’s other services. Also, most BI and machine learning tools have good integrations for it, as it has become really popular tool. It has a good web based UI for querying and understanding the data. There’s no need to set indexes (it’s supposed to be indexed on every column). I really enjoyed using it and found that it has mostly advantages. Most queries I’ve used just returned really quickly with the results. However, on the odd case that they don’t, it’s a little hard to optimise, as a lot of the database details are abstracted away.
The billing has been very reasonable, although my customer base is still relatively small (I have tens of farms I work with). The entire testing and setting up of the environment were all within the free tier, so no need to commit for any payment before launching.
At the moment I’m supporting tens of farms with a monthly cost of 10–15$. A big part part of this cost can be ‘optimised away’, but as mentioned I try to avoid too early optimisations.
Google Cloud Platform helped me a lot in getting up and running with the project quickly and efficiently. Their managed services approach is great for these purposes. All their services work very well together, and offer full end to end solution for web app development and data pipelines.
Nonetheless, it has its good share of downsides. The documentation isn’t always comprehensive enough, and just googling for a solution doesn’t work that well with a lot of the services being new and not that common yet. Moreover, issues in managed services are harder to solve, as a lot of the details are abstracted away.
I left all of the technical hard core details out, but be sure that I faced many pitfalls and nitpicks. Feel free to leave comments and questions, and I can dig deep into a lot of the tools I’ve used if you’re interested in anything specific.