Attention, dear ladies and gentlemen! FREE Data conference!
On the 16th of April, we are running the online Data Love Conference.
Here’s what’s in store for you at Data Love 2021:
- Jules Damji from DataBricks will focus his talk on a data-centric and data-driven approach.
- Mofi Rahman from IBM will present how to set up an E2E ML platform with Kubeflow and DevOps practices.
- Pasha Finkelshtein from JetBrains will explain why we need (not yet) another API for Spark.
- Thanks to Roksalana Diachuk from Captify, we will get into the BD engineer’s role and explore machine learning deploy concepts with Kubeflow.
- Oracle’s David Stokes will show us why Windowing Functions are the wind of change for Database Analytics.
Our next lovely speaker is Ruben Berenguel, Lead Data Engineer at Hybrid Theory.
Ruben Berenguel is a Big Data engineer consultant and occasional contributor for Spark (especially PySpark). PhD in Mathematics, he moved to data engineering where he works mostly with Scala, Python, and Go designing and implementing big data pipelines in London and Barcelona.
How did you become interested in Data?
As a mathematician, being interested in data came somewhat naturally, as if it was bound to be. During my postgraduate studies, I started working with early distributed computing approaches and being interested in language processing, both naturally led to data engineering tools and practices.
What are you working on right now? What drew you to your company?
I'm working as the lead data engineer at Hybrid Theory, a programmatic advertising agency. Serendipity got me here: they needed someone with my skills immediately, and I was just available. They have vast amounts of data, so that made the decision incredibly easy: I have been with the company for 5 years.
What is your favorite project or a project that you’re particularly proud of?
There are many projects I am proud of at Hybrid Theory, and probably the best to focus is the one I will be talking on my presentation Keeping identity graphs in sync with Apache Spark. It is a very large-scale graph computation that lies at the core of all our external products.
Are your projects similar, do they have common focus points, or they can be completely different?
At work, most of my projects related to high volume data processing, with a focus on keeping costs low and performance high. Out of work, I like working on many different things, like custom productivity tools (like this one to save time watching technical presentations) or creating generative artwork.
A lot of people are wondering about Data Engineers and Data Scientists, and the differences between them. What’s your favorite part about your role? What are you measured on? What do you expect when working in a tandem with Data Engineers/Scientists?
The thing I love most is a good problem, and data engineering offers many of these. "How can I make this cheap, fast, and scalable?", "Why did this job run out of memory?" and so on, all these questions happen more or less regularly and prove hard and interesting. My main measure of performance is delivery: if the data team is delivering what is expected of us. I'm lucky to have worked with excellent data scientists that not only crush it at the statistics/machine learning level, but are also excellent software engineers and write robust code. That is not the norm, and I'm very grateful for this.
What are the core skills that you think are important in your job, especially if you want to develop your Data Science/Engineering career?
Even though it is not the sexiest technology, a good knowledge of SQL is always necessary for data engineering. Also, there are many problems that can be solved almost instantaneously with SQL and are way more complex in other languages. Aside from this "base", most data engineers need to be proficient with at least one programming language (better if two, and better if these are Python and Scala). There is also a trend towards DataOps as well, where knowledge of deployment, machine tuning and infrastructure are also appreciated, and for these you'd probably need to know a bit of Kubernetes, Docker and Terraform, to name some general "in use" tools in the area.
The industry demand for Data Engineers is constantly on the rise and with it more and more software engineers and recent graduates try to enter the field. Data Engineering is a discipline notorious for being framework-driven and it is often hard for newcomers to find the right ones to learn. What technology are you most excited about? Share top 3 data engineering frameworks to learn, please.
Aside from Apache Spark, which is the one I use most often, I'm very interested in seeing where Ray and Dask lead to. To round it to 3, Apache Flink is probably the framework you want for streaming: as nice as Spark Streaming might be if you are used to Spark, Flink's model and semantics feel much better.
‘I love working with data because…'
It's so much fun! There's always something new to learn in the area, some new tool to try, some new insight to discover.
What's Your Data Resolution for 2021?
Get proficient with a new framework (probably one of those I mentioned before).
We thank Ruben for the thoughtful answers!
At the conference, he is going to speak on the topic: Keeping identity graphs in sync with Apache Spark.
If you want to attend Ruben’s talk and to discuss some questions “in person” you can join us on the 16th of April!
The lineup of speakers is incredible. Topics are diverse. Suitable for any level. Interesting Q&A sessions in Spatial Chat. New career opportunities.
Data is all around you.