DEV Community

Building Elastic and Fully Managed Cloud-Native VectorDB Milvus Infrastructure on AWS


Milvus is an advanced open-source vector database designed to revolutionize AI and analytics applications. With its ability to handle high-performance similarity search on massive-scale vector datasets, Milvus has become an essential tool for businesses in today's data-driven world. From recommendation systems to image recognition and natural language processing, Milvus enables organizations to unlock the full potential of generative AI algorithms and extract valuable insights from complex data.

To fully leverage the power of Milvus, deploying it within a robust, scalable, and managed infrastructure is crucial. In this blog post, I will explore how you can build an elastic and fully managed cloud-native Milvus infrastructure on AWS, taking advantage of its scalability, reliability, and ease of management. By harnessing the capabilities of Milvus in combination with AWS services, businesses can supercharge their generative AI initiatives and achieve remarkable results in fields such as content generation, recommendation engines, and personalized user experiences.

Overview of Milvus Architecture

At the core of Milvus lies its shared-storage architecture, consisting of four essential layers: the access layer, coordinator service, worker node, and storage. This architectural design allows for scalability, as well as the disaggregation of storage and computing resources, resulting in a cost-efficient and flexible infrastructure. The independent scalability of each layer further enhances the system's agility and resilience, ensuring seamless disaster recovery.

Milvus architecture distinctively separates compute resources from storage, incorporating dedicated storage nodes. A significant advantage of this infrastructure is the ability to scale the compute cluster down to zero while maintaining the data nodes active. This setup ensures that data remains accessible in Amazon S3, providing a flexible and efficient way to manage resources without compromising data availability.

Milvus architecture

The above Milvus architecture is from the official documentation.

Existing Infrastructure on Kubernetes

The current deployment of Milvus operates on Kubernetes, utilizing components such as etcd for distributed key-value storage, MinIO for object storage, and Pulsar for distributed messaging. While this setup is functional, Milvus architecture is designed to be portable, allowing it to run on various infrastructures. To leverage the benefits of an AWS-native solution and further enhance the deployment, you can introduce updates to leverage Serverless and fully-managed solutions available on AWS.

Serverless on AWS provides technologies for running code, managing data, and integrating applications without the need to manage servers. It offers automatic scaling, built-in high availability, and a pay-for-use billing model, increasing agility and optimizing costs. By leveraging serverless technologies, you can enhance the scalability and efficiency of the Milvus deployment on AWS.

AWS fully-managed services, on the other hand, are Amazon's cloud computing provisions where AWS handles the entire infrastructure and manages the required resources to deliver reliable services. This includes managing servers, storage, operating systems, databases, and other critical resources fundamental to the service infrastructure. By utilizing fully-managed services, you can ensure a robust and reliable Milvus deployment on AWS, reducing operational overhead and increasing the focus on utilizing Milvus's capabilities effectively.

By transitioning the existing Kubernetes deployment of Milvus to leverage serverless and fully-managed solutions on AWS, you can unlock the full potential of Milvus in terms of scalability, reliability, and ease of management. In the next sections, I will explore the proposed infrastructure using AWS services and its benefits in building an elastic and fully managed cloud-native Milvus infrastructure.

Proposed Infrastructure with AWS Services

To enhance the Milvus deployment on AWS, I propose replacing certain components with AWS services that offer scalability, reliability, and ease of management. These replacements include:

  • MSK (Managed Service for Apache Kafka): MSK replaces Pulsar for messaging and log management. It provides a fully managed Kafka service that ensures robust messaging and log processing, allowing for seamless integration into your Milvus deployment. For future exploration, it is worthwhile to consider utilizing AWS Kinesis, a fully managed streaming service that offers seamless integration with the AWS ecosystem.
  • Aurora Serverless: Aurora Serverless replaces etcd as the metadata storage and coordination system. It offers a serverless database service that automatically scales to match workload demands. With Aurora Serverless, you can ensure efficient and scalable management of metadata in your Milvus infrastructure. Currently Milvus only supports MySQL, but as an alternative metastore, it is also worth exploring the use of AWS DynamoDB, a highly scalable NoSQL database optimized for key-value workloads.
  • Application Load Balancer (ALB): ALB handles load balancing and routing of Milvus requests, ensuring high availability and efficient distribution of traffic to the various components. ALB's dynamic routing capabilities enable seamless traffic management within the Milvus infrastructure.
  • Amazon S3: Amazon S3 replaces MinIO for data persistence. It offers highly scalable, reliable, and cost-effective object storage. By leveraging Amazon S3, you can achieve seamless data persistence for your Milvus deployment, while benefiting from the scalability and durability of AWS's object storage service.
  • Amazon ECS: Milvus containers can be effortlessly deployed on AWS Fargate, a serverless compute engine specifically designed for containers. By utilizing ECS Fargate, you liberate yourself from the complexities of managing underlying infrastructure, enabling you to devote your attention to fine-tuning resource utilization and elevating the performance of your Milvus deployment. For future explorations, you can draw inspiration from the design considerations of Aurora Serverless for high throughput cloud-native vector databases. This involves separating storage and computation, ensuring that you only pay for computational power when it is actually needed, resulting in optimized cost efficiency and enhanced scalability.
  • AWS Cloud Map: Milvus distributed infrastructure requires effective service discovery mechanisms to enable efficient management and scaling of applications. With AWS Cloud Map, you can easily locate and communicate with the services you need, without the hassle of managing your own service registry.

By incorporating these AWS services and considering future possibilities, you can build an elastic and fully managed cloud-native Milvus infrastructure that maximizes scalability, reliability, and operational efficiency. In the next sections, I will delve into the architecture of this new infrastructure and explore its benefits in detail.

Architecture of the New Infrastructure

Architecture of the New Infrastructure

In the proposed infrastructure, AWS services seamlessly integrate into the Milvus deployment, enhancing scalability, manageability, and overall performance. MSK, Aurora Serverless, ALB, Amazon S3, and ECS Fargate play pivotal roles in ensuring a robust and elastic infrastructure.

Benefits of AWS Services for Milvus

The adoption of AWS services brings several key advantages to Milvus deployments:

  • Scalability: AWS services such as MSK, Aurora Serverless, and ECS Fargate enable effortless scaling of resources based on workload demands. This ensures efficient management of high-volume data, allowing your Milvus deployment to handle growing datasets with ease.
  • Managed Services: By leveraging managed services, you can significantly reduce operational overhead. AWS takes care of the underlying infrastructure, ensuring high availability and durability. This allows you to focus on leveraging Milvus's capabilities without the burden of managing the infrastructure yourself.
  • Reliability: AWS services provide a robust and reliable infrastructure, offering stability and performance for your Milvus deployment. With built-in redundancy and fault-tolerant designs, you can trust that your Milvus infrastructure will operate smoothly and reliably.
  • Cost Efficiency: AWS services offer cost-effective solutions for Milvus deployments. Services like Aurora Serverless and ECS Fargate enable you to pay only for the computational resources you actually use, optimizing cost efficiency. Additionally, Amazon S3 provides highly scalable and cost-effective object storage, eliminating the need for managing and provisioning your own storage infrastructure. By leveraging AWS services, you can achieve significant cost savings while maintaining the scalability and reliability required for your Milvus deployment.

By incorporating these benefits into your Milvus deployment, you can unleash the full potential of Milvus for high-performance similarity search on massive-scale vector datasets, while ensuring scalability, reliability, and cost efficiency.

 Deployment Process on AWS and Challenges

I began by following the official instructions currently available in the documentation, which presents two options for deploying Milvus in Kubernetes: using Terraform and Ansible, or employing Docker Compose (not recommended for production environments). Initially, I opted for Docker Compose and attempted to deploy it on Amazon ECS using the ecs-cli. However, I encountered several incompatibilities and, after many hours of effort, decided to abandon Docker Compose. Despite this setback, the experience proved to be invaluable, as it greatly enhanced my understanding of both ecs-cli and Milvus' internal architecture.

Consequently, I decided to build the entire infrastructure from scratch. Given my previous experience, this approach seemed far simpler to manage. I began by deploying the Virtual Private Cloud (VPC), the ECS cluster, and then proceeded to install each of the Milvus components individually. During this process, Milvus introduced support for multiple coordinators in both active and standby modes, further complicating deployment, but in a more exciting way.

One of the most significant challenge I faced—and continue to face—is related to ETCD. As you may know, ETCD utilizes the Raft protocol, enabling a cluster of nodes to maintain a replicated state machine. I managed to deploy a single ETCD node in ECS, but to get Raft working, I had to implement several workarounds, such as assigning task names using tags. While not ideal, it was the only viable solution, particularly since ECS does not yet support StatefulSets.

Currently, I have a functioning cluster with ETCD that lacks high availability. If you have any suggestions on how to enhance the architecture, or if you're interested in collaborating on this project, your participation would be greatly appreciated.

Additionally, if you're willing, please consider helping to make the StatefulSets feature available on ECS by supporting this request: . 🙏


Deploying Milvus on AWS using managed services like MSK, Aurora Serverless, ALB, Amazon S3, and ECS Fargate offers significant benefits in terms of scalability, reliability, and ease of management. By adopting this infrastructure, businesses can unlock the full potential of Milvus for high-performance similarity search on massive-scale vector datasets. With AWS services, you can build an elastic and fully managed cloud-native Milvus infrastructure that can handle the most demanding AI and analytics workloads.

Top comments (1)

chrischurilo profile image
Chris Churilo

Love this!