Amazon Prime Video reduced its costs by 90% after re-architecting its infrastructure from a distributed microservices system to a monolithic application. The Video Quality Analysis (VQA) team identified bottlenecks in the orchestration management implemented using AWS Step Functions, with the need for expensive Tier-1 calls to the S3 bucket as the intermediate storage for video frames. The company's new approach eliminated the need for the S3 bucket by having data transfer happen in memory. In addition, the company cloned the service multiple times to overcome the problem of exceeding the capacity of a single instance.
Overview of Prime Video
Prime Video is a streaming platform owned by Amazon that offers a wide range of TV shows, movies, and original content to its subscribers. It is available in more than 240 countries and territories worldwide and can be accessed through various devices, including smartphones, tablets, smart TVs, and gaming consoles.
Prime Video faced scaling and cost issues while monitoring perceptual quality issues of thousands of live streams using their existing Video Quality Analysis (VQA) tool. Running the infrastructure at a high scale was found to be very expensive, and scaling bottlenecks prevented them from monitoring thousands of streams. To address this, they moved all components into a single process to keep the data transfer within the process memory, which also simplified the orchestration logic. This eliminated the need for an S3 bucket as the intermediate storage for video frames.
- The Video Quality Analysis (VQA) team at Prime Video already owned a tool for audio/video quality inspection, but it was not intended to run at a high scale.
- The initial version of the service consisted of distributed components that were orchestrated by AWS Step Functions, resulting in expensive costs.
- The main scaling bottleneck was the orchestration management implemented using AWS Step Functions, which performed multiple state transitions for every second of the stream, quickly reaching account limits and charging per state transition.
- The high number of Tier-1 calls to the S3 bucket was expensive, as a result of passing video frames (images) around different components.
- A distributed approach was not providing many benefits in Prime Video's specific use case, so they decided to pack all of the components into a single process and implement orchestration that controls components within a single instance, eliminating the need for an S3 bucket as the intermediate storage for video frames.
How was issue fixed?
- To address the bottlenecks, Prime Video decided to re-architect its infrastructure and move from a distributed microservices approach to a monolith application, packing all of the components into a single process.
- The new architecture allowed data transfer to happen in memory, eliminating the need for an S3 bucket as an intermediate storage for video frames.
- They implemented orchestration that controls components within a single instance to reduce costs and increase scaling capabilities.
- The number of detectors could only scale vertically because they all run within the same instance. Prime Video cloned the service multiple times, parametrizing each copy with a different subset of detectors, and implemented a lightweight orchestration layer to distribute the load.
- Understand your scalability needs and costs before designing a distributed system: The initial version of the defect detection system was designed as a distributed system using serverless components, but it hit a hard scaling limit at around 5% of the expected load, and the cost of all the building blocks was too high to accept the solution at a large scale.
- Revisit the architecture to address cost and scaling bottlenecks: Prime Video rearchitected the infrastructure to eliminate the need for the S3 bucket as the intermediate storage for video frames and to implement orchestration that controls components within a single instance. This resulted in reducing costs and improving scaling capabilities.
- Use scalable solutions: After the rearchitecture, Prime Video relied on scalable Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Container Service (Amazon ECS) instances for the deployment, which allowed for easy scaling when needed.
- Optimize data transfer between components: Passing video frames (images) between different components resulted in high costs, so Prime Video eliminated the need for an S3 bucket by having data transfer happen in memory.
- Regularly review the design of your system: Prime Video regularly reviews and improves the design of their defect detection system by adding more detectors to the service, and cloning the service multiple times to overcome the problem of exceeding the capacity of a single instance.
- Consider both horizontal and vertical scaling when designing your system: In the initial design, Prime Video could scale several detectors horizontally, but in the new approach, the number of detectors only scales vertically because they all run within the same instance. To overcome this problem, Prime Video cloned the service multiple times, parametrizing each copy with a different subset of detectors, and implemented a lightweight orchestration layer to distribute customer requests.
Potential PoR Actions
- Design for scalability: Ensure that the architecture is designed to scale up to meet the expected load. A distributed approach may be a good choice, but it is important to make sure that the components are scalable.
- Monitor and optimize cost: To keep costs under control, regularly monitor resource utilization and optimize resource usage to reduce costs. This can be achieved by using tools such as AWS Cost Explorer.
- Optimize data transfer: Reducing the amount of data transferred between components can help to reduce costs and improve performance. Using in-memory transfer instead of transferring data between components through a network or S3 bucket can improve efficiency.
- Use the right tool for the job: It is important to use the right tool for the job to avoid unnecessary costs and improve performance. For example, using Amazon Elastic Compute Cloud (EC2) and Amazon Elastic Container Service (ECS) instances for deployment can be more cost-effective than using serverless components like AWS Lambda or AWS Step Functions for certain tasks.
- Implement fault tolerance: To ensure high availability and prevent downtime, implement fault-tolerant systems that can withstand component failures. This can be achieved by using load balancing, auto-scaling, and implementing backup and recovery mechanisms.
- Regularly review and update architecture: Regularly review the architecture and identify areas for improvement to optimize performance, cost, and scalability. This can involve revisiting the initial architecture design, as the Prime Video team did, to identify areas for optimization.
- Regularly review and optimize code: Regularly review and optimize code to improve performance, reduce resource usage, and prevent potential issues. This can involve using profiling tools to identify performance bottlenecks and optimizing code to reduce resource usage.