Data has become a critical component and key asset of business decision-making, and as a result, the way we consume data is rapidly evolving. Today's organizations are increasingly looking to create data products and provide controlled access to their data, rather than sharing it externally. This shift is driven by the need to ensure data security, compliance, and cost efficiency. In this context, one of the customer faced similar challenges while building a data product, and we turned to AWS EMR Studio and other AWS security and storage services to help us overcome them.
The Customer had a massive amount of data stored in different data sources, including S3, Snowflake, and on-premise Oracle tables. We want to leverage this data to create data products (or sell the solution like Data as Product) and provide access to external users in a controlled environment rather than sharing data externally because sharing data externally came with several challenges, including data security, compliance, and governance. Additionally, we had to ensure that users (mainly data analysts/scientists) could access the data quickly and efficiently. Idea was not to build a complete ML platform but to provide external analysts access to data to perform analytics. Hence, the key requirements while building our data product was to provide data analysts with a secure and controlled environment in which they could explore and analyze data using popular programming languages like Python and R.
Core- Developers/Analysts need an integrated development environment (IDE) for data analysis/exploration and debugging of workflows. During our search for a solution, we considered options ranging from a self-managed Spark cluster and Jupyter Notebook setup to a partially managed solution using Kubernetes and Spark. However, we ultimately chose the fully managed solution provided by EMR Studio with EMR cluster.
While self-managed and partially managed solutions can offer more control over the infrastructure, they come with several challenges. Managing the infrastructure can be time-consuming and requires significant resources, which can take away from valuable time spent on data analysis. Additionally, managing security and compliance can be complex and expensive.
On the other hand, a fully managed solution like EMR Studio with EMR cluster takes care of the infrastructure and provides a highly scalable, cost-effective, and secure platform for data processing and analysis. This allows data teams to focus on their work and gain insights from their data without worrying about infrastructure management.
Amazon EMR provides a deployment option for Amazon EMR on both EMR on EC2 and EMR on EKS that allows user to run analytics workloads. This is a great option because it allows you to run application with ease. In addition, you can use Amazon EMR Studio to build analytics code running on Amazon EMR. EMR Studio is a web-based, integrated development environment (IDE) using fully managed Jupyter notebooks that can be attached to any EMR cluster. It uses AWS Single Sign-On (SSO) or a compatible external identity provider (IdP) to log directly in to EMR Studio through a secure URL using corporate/idp credentials.
Authentication & Authorization- AWS EMR Studio with an Identity Provider solution is a critical step in ensuring the security and governance of data products. It provides organizations with greater control over user access and activity, enabling them to comply with regulatory requirements and reduce the risk of data breaches. EMR provides seamless connectivity with AWS SSO but since our end goal is External users so we can't use AWS SSO. Don't worry, EMR Studio does provide a seamless connectivity with External IDP solution like Okta, Ping etc. We first need to create a SAML app using EMR Studio URL, and we can also pass aws:principaltagkey and aws:transitivetag key as SAML attribute to AWS SAML EMR Studio session.
Security & Compliance- Firstly, we used AWS EMR Studio's security features to ensure that all data transfers were encrypted, both at rest and in transit. We also implemented strict access controls, ensuring that only authorized users had access to the data. We achieved this by integrating AWS EMR Studio with an Identity Provider solution that allowed us to manage user identities and access privileges in a centralized manner.
Secondly, we ensured that our data product was compliant with relevant regulatory requirements. We followed best practices and guidelines for data governance and compliance, such as ensuring that we had a robust audit trail of user activity within the data product. We also ensured that our data product met various regulatory requirements, such as GDPR, CCPA, and HIPAA.
Finally, we implemented robust monitoring and reporting mechanisms to ensure that we could detect and respond to any security threats or incidents promptly. We used AWS CloudTrail and CloudWatch to monitor user activity within the data product, enabling us to identify any suspicious activity or security threats.
We also stored snowflake private key in AWS Secretsmanager.
Initially we designed the solution using EMR on EC2 using spot instances as task nodes:
Initially we used EMR on EC2 and applied following practices:
- Use Spot Instances: Spot instances are unused EC2 instances that are available at a reduced cost. Using Spot instances for Task nodes can help us reduce our overall costs while still providing the necessary compute resources for our data processing workloads. 2.Right-sizing EC2 instances: We can analyze our workloads to determine the appropriate size of the EC2 instances required for processing the data. Choosing the right-sized instance can help us reduce costs by avoiding overprovisioning of resources. 3.Use EMR autoscaling: EMR autoscaling allows us to automatically adjust the number of EC2 instances based on workload demands. This means that we can avoid overprovisioning and underutilization of resources, resulting in cost savings.
- Use lifecycle policies for data in S3: We can use S3 lifecycle policies to automatically transition infrequently accessed data to lower-cost storage classes such as S3 Glacier, which can help us reduce storage costs.
But we want to make EMR cluster more cost effective and quickly accessible for multiple users. Hence, we later explored EMR on EKS. Here are some features of using EMR on EKS-
- Kubernetes-based environment: EMR on EKS is built on Kubernetes, which is an open-source container orchestration platform. This allows EMR on EKS to take advantage of the scalability, reliability, and flexibility of Kubernetes.
- Containerization: EMR on EKS uses containers to run big data processing jobs. This makes it easier to manage the environment and scale resources as needed.
- Cost: EMR on EKS may be more cost-effective than EMR on EC2 in certain cases, especially if you have a highly dynamic workload. With EMR on EKS, you only pay for the resources that you use, and you can scale up or down as needed.
- Managed vs. Self-managed: With EMR on EC2, you are responsible for managing the underlying infrastructure. With EMR on EKS, AWS manages the infrastructure, allowing you to focus on running your big data workloads.
As we want to allow external users and prevent data leakage or storing. Workspaces and App stream is like a VDI solution where we can use a separate IDP for AuthN & AuthZ. Also, there is a provision to disable the copy data and clipboard copy to host system.
With AWS AppStream, you can deploy and manage applications on Amazon EC2 instances, which can then be accessed by users through a web browser. You can configure policies to control access to applications and data, and to ensure that data is not copied or leaked. Additionally, AWS AppStream provides you with the ability to monitor user activity and control access to resources through IAM roles and policies.
With AWS WorkSpaces, you have the ability to customize and secure the environment by configuring various policies, such as network settings, client settings, and access controls, to ensure that data scientists only have access to the necessary data and applications. You can also monitor the activity in the environment and control what can be copied or pasted from the WorkSpaces to the local machine.
Our experience with AWS EMR Studio taught us several valuable lessons. Firstly, having a controlled environment for data products is crucial for ensuring data security, compliance, and governance. Secondly, EMR Studio provides a user-friendly, collaborative, and fully-managed solution for building and deploying big data applications. Thirdly, using EMR Studio helped us save significant time and resources, as we did not have to manage our own infrastructure. Last but not least, we also wanted to avoid any data leaks, data copying, or clipboard copying. That's when we realized we could use AWS Workspaces, a fully managed virtual desktop infrastructure (VDI) solution, to provide our data analysts/scientists with a secure and controlled environment for accessing AWS EMR Studio.
Building a data product and providing external users with access to our data was a challenging task, given the risks and compliance requirements. However, by using AWS EMR Studio, we were able to overcome these challenges and build a secure and efficient solution. EMR Studio allowed us to build a collaborative and user-friendly environment, ensuring that our data was accessed only by authorized users. Overall, we found EMR Studio to be an excellent solution for building and deploying big data applications, and we would recommend it to any organization looking to leverage their data while maintaining data security and compliance.