DEV Community

Cover image for Securing ML Workloads in the Cloud and on the Edge
J
J

Posted on • Originally published at Medium

Securing ML Workloads in the Cloud and on the Edge

Machine learning vulnerabilities specific to the cloud and edge environments in which they operate, are not nearly as visible or understood as other areas of cybersecurity today. We can do better.


Abstract

The continued adoption of, and migration to cloud infrastructure is a defining technology theme across enterprises around the world today. Cloud-based, containerized infrastructure brings a host of dynamic capabilities and scalability for businesses of all sizes.

The global cloud computing market size was valued at $368.97 billion in 2021 and is expected to expand at a compound annual growth rate of 15.7% from 2022 to 2030.¹ Similarly, applications of machine learning (ML) — a statistical learning application within the field of artificial intelligence — are finding their way into more and more products and services thanks to rapid cloud computing expansion, cost accessibility, and growing numbers of trained AI/ML data scientists and engineers.²

GCP, Azure, and AWS

Definitions and Scope

Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) are powerful suites of cloud computing services and the de facto solutions on the market today.

The primary aim of this essay is to review today’s essential security principles specific to protecting ML workloads in both traditional cloud networks, and edge-based workloads.

An additional goal and hopeful byproduct is to make the technical topic of cloud security more approachable to the non-technical reader, and the security practitioner beginning their journey into an exciting field at the intersection of cloud computing, ML workloads ³, and edge devices. For the purposes of this essay, I’m going to use and rely on Amazon’s definition of a “edge device” as:

[technology] with elements of geography and networking, that brings computing closer to the user. Edge takes place at or near the physical location of either the user or the source of the data.

Foundations

Regardless of cloud provider, ML workloads require specific security requirements which set them apart from standard cloud security best practices. To illustrate these best practices in a practical and (hopefully) useful manner — we’ll look at AWS their cloud-based ML platform: AWS’ SageMaker, and also touch on a recently released confidential compute feature from GCP.

With respect to intellectual property, the data used in these ML workloads is extremely valuable; special care and attention should be taken, especially when a workload is placed on a system that hosts more than one tenant at the same time, within the same server.

While Microsoft, Google, and Amazon maintain their own terminology and architecture specifics for building and deploying workloads in the cloud — the overarching principles for securing the cloud itself, and for the ML workloads within them are thematically similar.

These platforms are also expanding big data support with ML that relies on access to hosted enterprise data; this is important to call out specifically, as unless you are working within a company where AI/ML products are core to the business itself, a ML-assisted or ML-driven data service may be the first exposure to an ML workload in the cloud. This type of service is often referred to as Platform-as-a-Service or (PaaS).[⁶]((#footnote-6)

NIST defines three service models which describe the different foundational categories of cloud services:

  • Software as a Service (SaaS) is a full application that’s managed and hosted by the provider. Consumers access it with a web browser, mobile app, or a lightweight client app.

  • Platform as a Service (PaaS) abstracts and provides development or application platforms, such as databases, application platforms (e.g. a place to run Python, PHP, or other code), file storage and collaboration, or even proprietary application processing (such as machine learning, big data processing, or direct Application Programming Interfaces (API) access to features of a full SaaS application). The key differentiator is that, with PaaS, you don’t manage the underlying servers, networks, or other infrastructure.

  • Infrastructure as a Service (IaaS) offers access to a resource pool of fundamental computing infrastructure, such as compute, network, or storage.

ML and other PaaS analysis services aren’t necessarily insecure and don’t violate privacy and compliance commitments. However, it’s vital to understand their configuration and data access both internally and within the cloud provider, based on the type of data used. For example, if an PaaS-hosted ML workload runs on the provider’s infrastructure — where cloud provider employees could technically access it — does that create a compliance exposure if the ML data in use is sensitive PII or subject to additional controls?

Interestingly, at their Cloud Security Summit last week, Google announced GCP product developments to partially address privacy concerns of cloud-hosted data while in use. Called Google Confidential Computing, the technology leverages a “confidential” virtual machine (VM) to encrypt data on GCP while in use. I’ve included a detailed product screenshot below from the live webinar, also recorded and available from Google here.

  • Developing features like this sort of confidential VM-based processing only further reinforces my belief that configuration proficiency on behalf of the customer is paramount. Again, it’s very hard to protect something you don’t understand.

Google Confidential Compute Illustrations
Taken from Google’s Cloud Security Summit ’22 — Original air date: May 17, 2022 11:55am | Link to Summit Home Page | Link to Confidential Computing presentation pdf (pg 7–8)

Within most cloud security resources, there is mention of a “shared responsibility model.” Below is a visual representation of the model from CISA’s Cloud Security Technical Reference Architecture.

CISA shared responsibility model
CISA’s Cloud Security Technical Reference Architecture — Shared Responsibility Model

Depending on the business perspective of the observer, this diagram can be in one instance a compelling illustration to offload on-prem server costs and mitigate risk onto the cloud provider (shifting from the green into the blue). While in another light, it serves as an unyielding reminder that despite PaaS virtualization and even application management (i.e. Vertex AI or SageMaker for ML workloads) by the vendor, the configuration is always your responsibility.

Taken to a slightly more “physical” mountaineering example: you could be in peak cardiovascular shape (storage and server capacity), have the exact gear for the climb (OS, middleware, applications), and even have your expedition planned during ideal weather conditions (networking and identity access management) with an experienced guide (data).

Everything seems to be in place for a successful summit, but if you don’t take the time to lace and tie your boots (configuration) — you’re not going to get very far.


Embrace Zero-Trust

Another major principle worthy of its own, distinct mention is zero trust. Google’s Cybersecurity Action Team described success with “zero trust” to CISOs in their cloud adoption guide as:

Reject the perimeter model and embrace a philosophy of zero trust. Zero trust means that, instead of trusting data and transactions after they have cleared your security perimeter, you verify every piece of data and every operation outside and inside your system.¹⁰

General Principles

With the importance of PaaS configuration and zero trust responsibility firmly in place, there are a few additional cloud security principles that must be foundationally in place before considering any ML workload. These principles can serve to protect both intentional adversarial manipulation or theft/destruction but also unintentional (and often unavoidable) problems caused by hardware or software bugs and vulnerabilities.¹¹

First, the data and computing must be kept private and all entities within the computing environment must be known to be authentic.

  • This is not as straightforward as encrypting the data and sending it into cloud storage — which should be done for any stored data as a matter of principle. ML workloads require computations on the data itself, therefore all states of data (storage, transit, and in-use) come into play.¹² This privacy is centered around configuration of the management plane, accurate identity access management (IAM), and trusted execution environments (TEEs).

Second, it should not be possible for an attacker to alter any of the cloud-hosted training data anywhere at any time.

  • Adversarial attacks on ML models have shown hundreds of different vectors for manipulating the models’ confidence factor (prediction) just by having access to the training data.¹³

Third, it should not be possible for an attacker to interfere with the normal operation of the computing platform.

  • Perimeters are not well defined in modern IT systems when compared to traditional networks and on-prem architectures and therefore a software defined networking (SDN) infrastructure setup can offer effective and flexible security isolation boundaries.

Lastly, it’s important to recognize the simple fact that ML projects will spring up from business needs across the enterprise, and individual lines of business may set up their own ML environments for exploratory purposes.

However, a key difference between ML projects and other IT projects is that the ML workloads may need to use production data owned outside of a specific line of business in order to train a high-quality ML model output. The feasibility of this is directly correlated to the cloud infrastructure configuration. At a certain point, the need will no doubt arise to centralize ML workload management across business lines and standardize under a common operating model.

Amazon Web Services

Standardization is essential for unlocking the ability for non-ML teams to quickly and safely provision new ML workloads within a well defined template.

Amazon handles this through the concept of organizational units (OUs). OUs simplify permissions management and scope. The following diagram illustrates a general OU structure within AWS using Control Tower — the Amazon product name for governance-based permission control.¹⁴

AWS OU structure

When it comes to getting up and running with an ML workload on AWS, it all starts with a virtual private cloud (VPC); this will be used to host Amazon SageMaker. While technically possible, it’s not a good idea to use a SageMaker notebook outside of a VPC. Additionally, another critical component is Amazon’s Key Management Service (KMS); this is used to ensure your data at rest is encrypted end to end — both the data lake (where it lives) and the data science environment (where we’ll perform ML training and analysis).

When instantiating a VPC, Amazon provides a number of user roles by default that must be configured. This speaks to the configuration responsibility above. These roles can control access to your data in Amazon S3, control who can access SageMaker resources like Notebook servers, and even be applied as VPC endpoint policies to put explicit controls around the API endpoints you create in your data science environment.¹⁵

The Problem with “Zero Knowledge Systems”

In cybersecurity, the term “black box” is often used to describe the scenario where an attacker does not have access or insight into the inner workings of a system or model; they can potentially send inputs and only observe the corresponding outputs.

Simple blackbox diagram
https://en.wikipedia.org/wiki/Black_box

To shift perspective within the field of adversarial machine learning (AML), an alternative term actually describing the situation seems more fitting: a zero-knowledge system.

In considering a zero-knowledge based risk examination, there is no tangible difference between an attacker who is not able to access a model directly during an attack, and a defender who for whatever reason can’t or won’t understand the model training environment they have. We know security by obscurity alone is not a defense.¹⁶

So what are the downstream implications of regarding deployed ML workloads as “mysterious unknowables” and proceeding to categorize business risk processes that contain these AI models with legacy cyber frameworks?

As we have seen above, the zero-knowledge problem can be unfortunately exacerbated by the ease of assisted cloud provider tools prioritizing quick production ramp-up and deployment over specifically tailored algorithms and fully tunable governance settings.

On one hand, it’s understandable and generally a good thing that to get the benefits of ML to a greater audience: a simplification and standard product experience from the provider is necessary. On the other hand, a question:

What forms of bias, transfer learning, or undiscovered vulnerabilities are coming into the “auto” ML outputs when customers choose to use the “easy button” equivalent for provisioning and training ML workloads in the cloud?

Today, an acceptance of this zero-knowledge paradigm in ML applications seems to apply equally to attacker and defender with very few business operators or risk and corporate governance officers understanding the fundamentals used in building, training, or calibrating ML models in the cloud. I would further argue, enterprises purchasing and deploying off the shelf AI solutions — also trained in cloud based environments — into production business environments today also have zero-knowledge beyond their own business-process inputs and the model’s corresponding outputs or decisions.

Conclusion

It is hard, if not impossible to protect something if you don’t understand how it functions. That experience matters deeply.

And therein lies the problem: if the rapid expansion of enterprise ML is accepted as the newest business tool in the value stream with “zero-knowledge” required, we as cybersecurity professionals are opening the door for AI-fueled, cyberattacks that will cause problems we have no foreseeable knowledge of understanding.

In the words of security researchers and engineers Clarence Chio and David Freeman, before putting such solutions into the line of fire, it is crucial to consider their weaknesses and understand how malleable they are under stress.¹⁷ Unfortunately, ML security vulnerabilities specific to the cloud and edge environments they operate in today are often not nearly as visible or understood as other areas of cybersecurity today.

There are major sectors of society and government today that are already vulnerable.¹⁸ Critically, the affected industries and sectors most certainly demand attention.

In 2019, the Harvard Kennedy School’s Belfer Center compiled the 5 most immediately affected sectors to adversarial attack: content filters, the military, law enforcement, any traditionally “human-based” task being replaced by AI, and civil society.¹⁹

While many organizations are eager to capitalize on the “hype” of AI and advancements in ML, they have not adequately scrutinized the security of their own cloud environments, much less the ML tools that live within.²⁰

While the value proposition of AI remains higher than ever, the vulnerabilities associated with ML are not nearly as readily apparent. Combating adversarial example attacks in ML is a very difficult problem — one that will demand our very best in mitigating.

In the face of that challenge, the least we can do as practitioners is to absolutely understand the environments where we are creating and training these workloads, and endeavor to configure those environments in the safest and most consistent manner. Anything less would be a discredit to the awe-inspiring experience of AI advancement we are living through today.


Footnotes

[1] Grand View Research: Cloud Computing Market Size, Share & Trends Analysis Report By Service (IaaS, PaaS, SaaS), By Deployment (Public, Private, Hybrid), By Enterprise Size, By End Use (BFSI, IT & Telecom, Retail & Consumer Goods), By Region, And Segment Forecasts, 2022–2030. (Feb. 2022) Report ID: GVR-4–68038–210–5. https://www.grandviewresearch.com/industry-analysis/cloud-computing-industry

[2] AI is a broad term, generally referring to the ability of computer systems to execute tasks of varying complexity, specifically tasks historically performed by humans.

[3]The term “workload” is used to identify a set of components that together deliver business value. This definition is from Amazon’s AWS Well-Architected Framework (Dec. 2, 2021) pdf.

[4]AWS Whitepaper: Security at the Edge, Core Principles. Sep. 24, 2021. (pdf.)

[5]While relatively new compared to their core businesses, Microsoft, Amazon and Google maintain a staggering volume of content, documentation and online training for these products. For the interested reader each of the following URLs can prove as useful starting points.

[6] I’ve found in certain use cases IaaS and PaaS are used relatively interchangeably when in fact they represent different service configurations within cloud computing. At risk of oversimplification, I have chosen to focus on PaaS throughout the article for consistency when describing cloud ML platforms.

[7]Cloud Security Alliance (CSA) Security Guidance: For Critical Areas of Focus in Cloud Computing v4.0 (ePub), pg 11.

[8]Cloud Security Alliance (CSA) Security Guidance: For Critical Areas of Focus in Cloud Computing v4.0 (ePub), pg 148.

[9]Cybersecurity and Infrastructure Security Agency, Cloud Security Technical Reference Architecture. (Aug 2021) v1.0 pg 4.

[10]GCAT https://services.google.com/fh/files/misc/ciso-guide-to-security-transformation.pdf

[11]The following 3 principles summarized from writings by:

[12]Required processes to de-encrypt, compute, then re-encrypt can be assumed at least until solutions like homomorphic encryption become more widespread, cost effective, and available by default.

[13]J Whorley, When Dogs Become Fish: Adversarial Examples in Machine Learning, (Sep. 24, 2021) Medium. https://medium.com/@its.jwho/adversarial-example-vulnerabilities-in-machine-learning-c74dddc67f26

[14]Nivas Durairaj, Dave Walker, Sofian Hamiti, and Stefan Natu. “Setting up secure, well-governed machine learning environments on AWS” (Jul. 01, 2021). https://aws.amazon.com/blogs/mt/setting-up-machine-learning-environments-aws/

[15]Online Course: Amazon SageMaker Workshop: Module — Using Secure Environments. https://sagemaker-workshop.com/security_for_users/security_overview.html

[16]NIST’s cyber resiliency framework, 800–160 Volume 2, recommends the usage of security through obscurity as a complementary part of a resilient and secure computing environment but stresses that secrecy alone is not sufficient and can often lead to other unintended negative consequences.

  • (NIST), Author: Ron Ross; (MITRE), Author: Richard Graubart; (MITRE), Author: Deborah Bodeau; (MITRE), Author: Rosalie McQuaid (21 March 2018). “SP 800–160 Vol. 2 (Final Draft), Systems Security Engineering: Cyber Resiliency Considerations for the Engineering of Trustworthy Secure Systems”. Csrc.nist.gov.

Additional advocacy for discussing security vulnerabilities (with an implied understanding of the vulnerability) in the open from Steven M. Bellovin and Randy Bush. “Hiding security vulnerabilities in algorithms, software, and/or hardware decreases the likelihood they will be repaired and increases the likelihood that they can and will be exploited by evil-doers. Discouraging or outlawing discussion of weaknesses and vulnerabilities is extremely dangerous and deleterious to the security of computer systems, the network, and its citizens.”

[17]Chio, C., Freeman, D. Machine Learning & Security: Protecting Systems with Data and Algorithms. (United States, O’Reilly Media Inc, February, 2018).

[18]AWS Whitepaper: Machine Learning Best Practices for Public Sector Organizations. Sep. 29, 2021. (pdf.)

[19]Marcus Comiter, Attacking Artificial Intelligence: AI’s Security Vulnerability and What Policymakers Can Do About It (Harvard Kennedy School: Belfer Center, Aug. 2019) https://www.belfercenter.org/sites/default/files/2019-08/AttackingAI/AttackingAI.pdf.

[20]Ram Shankar Siva Kumar and Ann Johnson, “Cyberattacks against machine learning systems are more common than you think”, Microsoft blog (Oct. 22, 2020). https://www.microsoft.com/security/blog/2020/10/22/cyberattacks-against-machine-learning-systems-are-more-common-than-you-think/.

  • Of particular note and concern within this blog post, Microsoft surveyed 28 businesses and found that 25 of them believe they do not have the right tools in place to secure their ML systems.
  • Additionally, colleagues of mine at NYU conducted a survey over the past year of 31 enterprises between 250 and 20,000 employees with only 10% of respondents performing a risk assessment on adversarial-based attacks for ML workloads deployed into their businesses. https://www.aisecurityframework.org/survey-results/

Top comments (0)