In this blog, I will show you how to troubleshoot a VPN site-to-site connection between AWS and other side.
Our customer wants to continuously backup their data from AWS Aurora MySQL to local provider cloud in private connection. So, a IPsec VPN site-to-site connection is required.
In AWS side, Virtual Private Gateway (VGW) provides dual-tunnels to Customer Gateway (CW) for high availability. If there is a device failure within AWS, the VPN connection automatically fails over to the second tunnel so that the connection is not interrupted. Meanwhile, local cloud provider's border router supports a single tunnel only. The below picture shows the scenario:
From time to time, AWS performs routine maintenance on the VPN connection such as tunnel endpoint replacement, which might briefly disable the tunnel. Refer to this for your information. It's going to interrupt the data replication progress.
In my case, VPN connection was configured using default parameters. It caused the tunnel down and could not be recovered automatically even the endpoint replacement finished. We can check VPN connection status metric in CloudWatch:
After the investigation, I found that the issue occurred by IKE_SA (Internet Key Exchange_Session Association) deletion between VGW and CGW. IKE is an IPsec based tunneling protocol that provides a secure VPN communication channel between peer VPN devices and defines negotiation and authentication for IPsec security associations (SAs) in a protected manner. So, if IKE_SA is deleted, the IPsec VPN connection will be down.
When AWS does the endpoint replacement, VPN connection will be interrupted. IKE_SA will be kept in a specific time, which defined by Dead Peer Detection (DPD) timeout parameter in both VGW and CGW. After DPD timeout occurs, VGW or CGW will send IKE_SA deletion request to the other. IKE_SA will be deleted then.
As soon as the endpoint replacement finishes, CGW or VGW should initiate the IKE negotiation to restart VPN tunnel. If both of them just keep waiting for the other, VPN tunnel will be down forever. Unfortunately, it is my case.
Because I could not change the configuration in CGW, which belongs to local cloud provider, I tried to change the setting in AWS side. There are 03 VPN tunnel options which we should consider, they are:
DPD timeout: it is 30 seconds in default. We can increase this value to cover the endpoint replacement time. But, we do not know how long it takes indeed and be careful it will affect the failover time to 2nd tunnel.
DPD timeout action: In default, the value is "Clear". It means end the IKE session and clear the routes. I changed to "Restart", which will restart the IKE session when DPD timeout occurs.
Startup action: "Add" is the default value, which request that CGW must initiate the IKE negotiation to bring the tunnel up. In my case, it should be "start". AWS will proactively initiates the IKE negotiation to bring the tunnel up instead of CGW.
For more information, you can refer to this.
In this blog, I showed the solution to deal with the AWS VPN site-to-site connection issue. Hopefully, it will be helpful for you. If you have any question, do not hesitate to leave your comment. Thank you for your reading!