Designing Data-Intensive Applications, Chapter 1
- Recently I've acquired Martin's DDIA book, and in the following days I will be sharing some notes about the chapters
- Also reading it? Feel free to connect, let's talk about it!
Reliability
- Roughly, "continuing to work correctly, even when things go wrong"
- Things go wrong -> fault
- Systems that anticipate and can cope with faults are called fault-tolerant or resilient
- Fault =/= Failure
- Fault: one component deviating from spec
- Failure: the whole system stops providing the required service to the user
- ❌ Reduce fault probability to zero ❌
- ✅ Design fault-tolerance mechanisms that prevent faults from causing failures ✅
- There are cases where prevention is better, though, as with security matters
Hardware faults
- Crashed hard disks, faulty RAM, lack of maintainance, etc
- First response: adding redundancy to individual hardware components
- ⬆ applications' computing demands ⬆ rate of hardware faults
- And in common cloud services, flexibility and elasticity are prioritized over single-machine reliability
- Software fault tolerance techniques!
Software faults
- Harder to anticipate
- Correlated across nodes
- Which causes more system failures than uncorrelated hardware faults
- Examples include:
- Software bugs given a particular input
- Runaway processes (infinite loops)
- Downtime in some service that the system depends on
- Cascading faillures
- Prevention includes:
- Carefully thinking about interactions and assumptions about the system
- Automated/manual testing
- Process isolation
- Measuring, monitoring and analyzing system behavior in production
Human errors
-
How do we make our systems reliable, in spite of unreliable humans?
- Designing systems in a way that minimizes opportunities for error. Eg: abstractions, APIs and admin dashboards (high-level interface)
- Non-production environments (so people can explore and experiment safely)
- Automated/manual testing to cover corner cases
- Easy rollback options
- Deploying code gradually + feature flags(trunk-based development)
- Detailed monitoring and observability. Eg: performance metrics and error rates
- Good management practices and training
- Reliability is important for big or small companies (potential lost of revenue and damage to reputation)
- Balance between reliability and costs
Scalability
- Ability to cope with increased load
- If the system grows in a particular way, what are our options for coping with the growth?
Load
- Actually, first we have to consider what load is
- Load parameters depend on the system architecture
- Request per second to a web server
- Ratio of reads to writes in database
- Number of Simultaneosly active users in a chat room
- Hit rate on cache, etc
- Twitter key load parameter: distribution of followers per user (fan out load)
- SELECT all tweets in home timeline requests VS timeline cache
Performance
- Batch processing -> throughput (records processed/second or time to complete a job)
- Online systems -> response time (request -> response)
- We should think response time as a distribution of values, not a single number
- Since in every request, the response time will be different
- Random latency, loss of packet, garbage collection pause, mechanical vibrations, etc
- Average response time is NOT a good metric
- It does not tell how many users experienced that delay
- Always use percentile: sort it from fatest to slowest and take the median (half of the requests return less/longer than X)
- Also known as p50
- To know how bad the outliers are: p95, p99, p999
- Also known as tail latencies
- If p95 = 1.5 seconds, 5 out of 100 requests take 1.5 seconds or more
- Tail latencies are important, since they affect users' experience of the service
- 100ms increase in response time reduces sales by 1% (Amazon)
- 1 sec slowdown reduces customer satisfaction metrics by 16%
- These percenetiles are used when defining SLAs and SLOs
- "Server considered up if has a median response time of X ms and a p99 under 1s"
- Monitoring response times:
- Ongoing basis
- Window of response times of requests in the last 10 minutes
- Every minute, calculate median and percentiles and plot those metrics in a graph
Coping with load
- Rethink the architecture on every order of magnitude load increase
- Scaling up (moving to a more powerful machine) vs scaling out (distributing load across machines)
- Elastic systems (automatically add computing resources as load increases
- Useful if increase of load is unpredictable
- Architecture of large systems is highly specific to the application
- There is no such thing as a one-size-fits-all architecture
- 100,000 1kB requests/sec vs 3 2GB requests/min
- An architecture that scales well will be built around the load parameters
- Iterate quickly on features VS scale to hypothetical future load
- However, scalable architectures are built from general-purpose building-blocks (arranged in familiar patterns)
Maintainability
- Software cost is due mainly in maintainance, not in initial development process
- Fixing bugs, investigating failures, adapting to new platforms, modifying use cases, repaying technical debt, adding new features, etc
Operability
- Good software can not run reliably with bad operations
- Operations squad responsibilities include:
- System health monitoring/restoring from bad state
- Tracking cause of problems (eg. failures or degrade performance)
- Keep everything up to date
- Anticipating future problems and solving them
- Establishing good practices and development tools
- Performing complex management tasks (such as platform migrations)
- Maintaining system security
- Writing docs about the system
- Good operability of data systems include:
- Good monitoring (visibility)
- Automation and integration with standard tools
- Good documentation of operations
- Good and predictable default behavior but with options to override it
- Self-healing
Simplicity
- Simple and expressive code instead of bloated and complex
- Symptoms of complexity:
- Tight coupling of modules
- Tangled dependencies
- Inconsistent naming
- Hacks to solve performance issues, etc
- When the system is harder for developers to understand and reason about (hidden assumptions, unintended consequences, unexpected interactions), the risk of introducing new bugs is increased
- Making system simpler -> removing accidental complexity
- Through abstractions (façade) #### Evolvability
- Constant flux: learn new facts, new use cases emerge, business priorities change, user request new features, architecture changes, etc
- Agile development
- The ease of modifying a system is linked with its simplicity
- Easy-to-understand systems are easier to modify than complex ones
Top comments (0)