Software at Scale
Software at Scale 13 - Emma Tang: ex Data Infrastructure Lead, Stripe
Emma Tang was the Engineering Manager of the Data Infrastructure team at Stripe. She was also a Lead Software Engineer at Aggregate Knowledge, where she worked on the data platform.
We explore the technological and organizational challenges of maintaining big data platforms. We discuss when a company would require a “Big Data” system, what the properties of a good system look like, how some of these systems look like today, some of the tools/frameworks that work well, hiring the right engineers, and unsolved problems in the field.
Apple Podcasts | Spotify | Google Podcasts
Highlights
0:30 - “Big Data” for software engineers - when does a company need a big data solution
2:30 - The transition from when a company uses a regular database to a big data solution, with a motivating example of Stripe
4:20 - Verification of processed output. Some of the tools involved: Amazon S3, Parquet, Kafka, and MongoDB.
9:00 - The cost of ensuring correctness in the data processing. Using tools like Protobuf to enforce schemas
13:30 - Data Governance as a trend
16:30 - Why should a company have a data platform organization?
21:30 - Hiring for data infrastructure/platform engineers
24:00 - How does a data organization maintain quality? What metrics do they look at?
28:30 - Trends of some problem areas in this space.
33:30 - Emma’s interest in data infrastructure, and advice for those looking to get into the field.