DEV Community

Solving AWS Network Puzzles with Mathematics - Part 1

Introduction

As engineers utilising AWS, many readers have likely encountered network issues at some point. These situations may include instances where expected communication between EC2 Instances fails to reach its destination or, conversely, when communication that should be blocked succeeds in reaching its destination. In such cases, one may be at a loss regarding modifying the numerous, complexly related Security Groups. In these moments, it is an undeniable fact that an expert's advanced knowledge and experience can be invaluable in solving the puzzle. However, are these the only tools at our disposal?

This article explores a solution that relies on established mathematical theories, an additional thought process beyond one's brain. Specifically, we aim to understand the research paper "Reachability Analysis for AWS-Based Networks" published by Backes et al. in 2019 and the subsequently implemented feature, VPC Reachability Analyzer and VPC Network Access Analyzer.

However, this paper is highly specialised, and understanding it without prior knowledge can be challenging. This series of articles will be divided into three parts to provide a comprehensive explanation.

In this initial article, we will examine a concrete example of a network configuration on AWS and explore the functionality and positioning of VPC Reachability Analyzer.

Configuration of a Sample Network

Let's consider a compact, uncomplicated network that is robust enough to demonstrate the capabilities of VPC Reachability. The diagram below illustrates this network architecture.

sample VPC network

Examining the diagram shows that within the VPC are two Subnets labelled X and Y, hosting EC2 Instances A, B, and C. Additionally, Security Groups aiming to allow communication through port 22 have been attached to the Instances.

Now, let's ask our readers an initial question: Is SSH communication from Instance A to C possible? The answer is clearly yes. The settings of Security Group 1 make this possible. On the other hand, it should be equally clear from the diagram that SSH communication from Instances B to C will be blocked by the misconfigured Security Group 2, limiting the destination to the local. Moreover, assigning a public IP address allows Instance C to access the Internet; it potentially violates your security compliance.

This might seem like a simple matter. However, is network connectivity always straightforward? Certainly not. In real-world production environments, many EC2 Instances and other computational resources are closely interconnected with Security Groups and Route Tables involving routing to multiple networks. In such complex production environments, determining which Instances can communicate with others and which cannot is a task full of complexity.

So, what are the potential consequences of this complexity? You carefully review the infrastructure, defined using CDK or Terraform, on GitHub. It looks perfect. You go ahead and deploy. At this point, you realise that your carefully deployed network isn't working as expected.

Dual Approaches to Troubleshooting

When faced with a problem, two main troubleshooting methods are available to us.

The first approach involves actively sending packets and then checking connectivity. This is a well-established technique that predates our move to the cloud. From the essential ping, traceroute, and telnet commands to the more versatile netcat utility, an experienced infrastructure engineer's toolbox is filled with various tools. Alternatively, if the issue is within a database, one may investigate the application's behaviour and logs to determine the root cause.

The second approach involves forgoing the sending of actual packets and carefully examining the configuration to identify the root cause. By logging into the AWS Management Console and navigating through numerous pages, combining the various pieces of information in one's mind, one may ultimately identify the incorrect configuration that caused the problem. This is an impressive accomplishment, a testament to the skill of an experienced engineer.

The two approaches described above are, in a way, opposite. One involves active trial and error, while the other relies on the role of an armchair detective to identify the cause. However, a key characteristic unites these two methods: they are both stressful. A single misplaced entry in a Route Table with many configurations closely intertwined can block communication. Few engineers would voluntarily take on the task of solving this puzzle.

Power of VPC Reachability Analyzer

Who, then, shall unravel this enigma if not you? The answer lies in the VPC Reachability Analyzer. This powerful tool can automatically resolve issues related to network behaviour examples without requiring human intervention.

The role of humans is limited to simply specifying the conditions for communication. For example, one might define the source and destination EC2 Instances, protocol, and port number. At the cost of a small amount of time and usage fees, the Analyzer will provide a resolution, determining whether the communication is possible.

What is particularly interesting is that during this operation, the Analyzer does not send any packets. Instead, it examines the configuration items rather than the actual communication results. Like an experienced engineer, the Analyzer possesses "knowledge" about the network.

This leads to another significant advantage. If the Analyzer were to assess reachability by sending packets in a way similar to the traceroute command, the communication would be unable to proceed beyond the point where a component with a discontinuous configuration is encountered, thus failing to obtain information about the entire communication path.

However, in reality, the Analyzer analyses the "meaning" of the configurations. As a result, it can provide a comprehensive answer, including which configurations are problematic and require fixing. This becomes clear when examining the actual results screen presented below.

result by VPC Reachability Analyzer

Upon initial inspection of the results, one is immediately drawn to the red icon, accompanied by the concerning message: "None of the egress rules in the following security groups apply." This troubling finding reveals the root cause of the SSH connection's failure, originating from the previously discussed misconfiguration of the Security Group. Moreover, it is particularly significant that the Analyzer displays the entire path from the source to the destination, clearly identifying the point of misconfiguration. This clear visualisation offers valuable guidance on resolving the network connectivity problem.

Syntax, Rules and Semantics

As previously explained, VPC Reachability Analyzer represents a verification paradigm that understands the inherent "meaning" of the configurations. When aiming to verify infrastructure as code, one can divide the verification levels into three tiers.

The first tier relates to syntactic verification, representing the most basic level. Imagine a scenario where you define your infrastructure using CloudFormation templates. In this context, syntactic verification is equivalent to examining the YAML file structure's compliance with the specified schema. As syntactic errors are directly related to the failure of the deployment itself, the necessity of this verification stands as an undeniable prerequisite. However, as previously noted, it is not a sufficient condition. The deployed resources operate according to your definitions rather than your intentions.

The second tier includes rule-based verification. One of the most typical rule-based verifications in the networking domain is the prohibition of access from 0.0.0.0/0. This rule, which comes pre-configured in AWS Trusted Advisor, denies exposure to the Internet. However, it is simply an explicit formulation of the knowledge that "allowing 0.0.0.0/0 is equivalent to granting access from the Internet."

The third and highest tier of verification under consideration is semantic verification. At this level, by providing the configurations as input, one can perform a verification that considers the "meaning" of how these configurations will manifest their effects. Imagine a scenario involving numerous components, where the interaction of their respective rules ultimately determines the network's reachability. In semantic verification, the need for individual judgments is eliminated by proactively formalising the mutual influence of rules, enabling the verification to determine whether the desired outcome will be achieved as the final result.

For the readers who have read this article thus far, it will be clear that this semantic verification is the distinguishing feature of VPC Reachability Analyzer. It is a decisive advantage for us engineers who are constantly challenged by network puzzles.

Conclusion

In this article, we have presented the challenges of network troubleshooting on AWS and introduced the capabilities of VPC Reachability Analyzer as a solution. Moreover, we have defined three tiers of verification for codified infrastructure, positioning Reachability Analyzer as the third tier enabling semantic verification.

In the subsequent article, we shall address how VPC Reachability Analyzer performs semantic verification, or in other words, what mechanisms it employs to comprehend the "meaning" of configurations. Specifically, we will introduce the foundational technologies of SAT and SMT solvers and elucidate their operating principles.

Appendix

You can try out VPC Reachability Analyzer and VPC Network Access Analyzer using the provided CloudFormation template.

Please note that I cannot be held responsible for any issues using this template. In particular, be aware that this template will launch actual EC2 instances, albeit small ones, and that both Analyzers incur costs (especially Reachability Analyzer, which is expensive at 0.1 USD per execution).

---
Description: 'Demo for VPC Analyzers'
AWSTemplateFormatVersion: 2010-09-09

Mappings:
  RegionMap:
    ap-northeast-1:
      execution: ami-02892a4ea9bfa2192

Resources:
  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 172.0.0.0/16

  InternetGateway:
    Type: AWS::EC2::InternetGateway

  InternetGatewayAttachement:
    Type: AWS::EC2::VPCGatewayAttachment
    Properties:
      InternetGatewayId: !Ref InternetGateway
      VpcId: !Ref VPC

  SubnetX:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      CidrBlock: 172.0.1.0/24

  SubnetY:
    Type: AWS::EC2::Subnet
    DependsOn: InternetGateway
    Properties:
      VpcId: !Ref VPC
      CidrBlock: 172.0.2.0/24
      MapPublicIpOnLaunch: true

  PublicRouteTable:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC

  PublicRouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref PublicRouteTable
      SubnetId: !Ref SubnetY

  PublicRoute:
    Type: AWS::EC2::Route
    Properties:
      RouteTableId: !Ref PublicRouteTable
      GatewayId: !Ref InternetGateway
      DestinationCidrBlock: 0.0.0.0/0

  SecurityGroup1:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: 'Sample SG 1'
      VpcId: !Ref VPC
      SecurityGroupIngress:
        - CidrIp: !GetAtt VPC.CidrBlock
          IpProtocol: 'tcp'
          FromPort: 22
          ToPort: 22
      SecurityGroupEgress:
        - CidrIp: 0.0.0.0/0
          IpProtocol: '-1'

  SecurityGroup2:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: 'Sample SG 2'
      VpcId: !Ref VPC
      SecurityGroupIngress:
        - CidrIp: !GetAtt VPC.CidrBlock
          IpProtocol: 'tcp'
          FromPort: 22
          ToPort: 22
      SecurityGroupEgress:
        - CidrIp: 127.0.0.1/32
          IpProtocol: '-1'

  InstanceA:
    Type: AWS::EC2::Instance
    Properties:
      ImageId:
        Fn::FindInMap:
          - RegionMap
          - !Ref AWS::Region
          - execution
      InstanceType: 't3.nano'
      SubnetId: !Ref SubnetX
      SecurityGroupIds:
        - !Ref SecurityGroup1
      Tags:
        - Key: Name
          Value: InstanceA

  InstanceB:
    Type: AWS::EC2::Instance
    Properties:
      ImageId:
        Fn::FindInMap:
          - RegionMap
          - !Ref AWS::Region
          - execution
      InstanceType: 't3.nano'
      SubnetId: !Ref SubnetX
      SecurityGroupIds:
        - !Ref SecurityGroup2
      Tags:
        - Key: Name
          Value: InstanceB

  InstanceC:
    Type: AWS::EC2::Instance
    Properties:
      ImageId:
        Fn::FindInMap:
          - RegionMap
          - !Ref AWS::Region
          - execution
      InstanceType: 't3.nano'
      SubnetId: !Ref SubnetY
      SecurityGroupIds:
        - !Ref SecurityGroup1
      Tags:
        - Key: Name
          Value: InstanceC

  ReachablePath:
    Type: AWS::EC2::NetworkInsightsPath
    Properties:
      Source: !Ref InstanceA
      Destination: !Ref InstanceC
      DestinationPort: 22
      Protocol: tcp
      Tags:
        - Key: Name
          Value: 'Reachable Path'

  BlockedPath:
    Type: AWS::EC2::NetworkInsightsPath
    Properties:
      Source: !Ref InstanceB
      Destination: !Ref InstanceC
      DestinationPort: 22
      Protocol: tcp
      Tags:
        - Key: Name
          Value: 'Blocked Path'

  AccessToInternet:
    Type: AWS::EC2::NetworkInsightsAccessScope
    Properties:
      MatchPaths:
        - Destination:
            ResourceStatement:
              ResourceTypes:
                - AWS::EC2::InternetGateway
      Tags:
        - Key: Name
          Value: 'All Access To Internet'
Enter fullscreen mode Exit fullscreen mode

Top comments (0)