The year 2019 as well as my previous work assignment, has been very interesting, challenging and educating. I happen to work on different technologies and domains.
Here I am jotting down the learnings from my last assignment. While I learnt many things, one of the most important lesson to me goes in sync with a great man’s words “All I know is, Nothing“.
- Importance of Coding and Logging standards for an organization or at-least a team. By defining those standards upfront, the effort saved as the application grows bigger was quite evident to me.
- Importance of defining metrics and having a clear distinction between application and business metrics.
- Identifying standards for publishing events and consuming them.
- Identifying data storage formats, I worked mainly on Apache Avro. The advantages it brings when there is a Schema Evolution.
- Importance Encryption and Tokenization standards for the data we deal with. You know privacy is no more a wishlist, but a law. It also made me understand difference between the two terms used here.
- Importance of defining a schema/model for the data we deal with and how Avro enforces certain data quality checks in the pipeline.
- I also understood what schema evolution is and how its needs to be a planned move. Enforcing schema compatibility checks before changing them.
- Avro IDL makes it super easy to define/design schemas for the data we deal with.
- Challenges involved with Union and complex union types.
- Impacts of breaking schema changes on production systems and probable solutions to handle that.
- Defining our custom Logical type as part of Avro. Example : Encryption and Tokenization.
- I learnt a lot about how Kafka can fit into some of the applications, especially when its event based.
- Difference between System and Business events.
- Leveraging Schema registry to enforce checks while producing and consuming messages from topic.
- Kafka headers and how that can be leveraged in the pipeline.
- Producing data with multiple schemas versus single schema to a single topic.
- Importance of metadata (Data about data).
- Kafka Connect and use cases around it.
- A little bit about Kafka Streams.
- Learnt Scala (just enough for Spark), build and execute jobs by leveraging Livy.
- Understand about partitioning, re-partitioning, data shuffles.
- Consuming messages from Kafka in batch mode.
- Learnt few things about the executor, executor cores and memory management in Spark.
- Spark History, Zeppelin notebooks for Spark.
- Unit testing in Spark and its importance.
- Writing workflow as code and understanding of how Airflow works.
- Creating custom Operators.
- How docker can help us have $0 infrastructure cost during development and Unit testing.
- Kubernetes is still a partially known area. I learnt about accessing the pods, managing secrets, YAML files.
- From reading about Design patterns to actually to see it being used.
- Importance of Unit Testing and Code Reviews.
I have listed some of the domains that I have worked on. If one cannot traverse across domains for a specific Customer, then we are hardly making use of the richness of the data.
- Customer – I was able to understand the different challenges a company will have in dealing with Customer data.
- Sales – All the different attributes related to a transaction and why its critical to have them available in the system Near Real time.
- Preference – For people who work under marketing, the preference of customer plays a great role and with more laws around them, its important to have them updated and be available to the marketing teams.
- Loyalty – The success of this program can only be measured when the company can leverage this data to its benefit.
This is a brain dump of the previous assignment I worked on but it will also serve as a reminder of all the learnings as well as the unknowns. I also plan to write other posts related to some of the topics listed here as I truly believe in “To teach is to learn twice“.