Reachability Analyzer is a fantastic little tool from AWS, however it does have a number of gotchya's that I encountered on a recent debugging exercise. I couldn't find any good resources at the time, so here's a collection of learnings from the experience.
The actual problem brought to me by one of my colleagues was that their Lambda function using SFTP was timing out when connecting to the server. We knew that the Lambda had some sort of web connectivity because HTTP/S requests were successfully being made. Our initial investigation brought us to Stack Overflow.
Although useful, this article was not able to solve our problem. Inspecting the VPC configuration we could see what Subnets it existed in and the Security Groups being used. We knew that the VPC was operational from other services, so the likely culprits were the Security Groups and the Network Access Control Lists - both of which indicated that everything should have been working correctly.
At this point I decided to give Reachability Analyzer a spin to see if it could illuminate the problem.
Using Reachability Analyzer
In cloud infrastructure it's pretty hard to use the usual network debugging techniques I've learnt from managing the office firewall. Whilst you can easily run
tcpdump on the relevant hosts along with other logging tools to inspect state, you can't do this managed infrastructure in the cloud.
Enter Reachability Analyzer which can simulate requests in your VPC without needing to access any of the infrastructure running. The best part of this tool is that it also provides explanations for each step (contrast this with comparing outcomes and logs across multiple different hosts and systems).
However, if you read those docs you will see that Reachability Analyzer doesn't directly work for Lambda functions - so how can we debug this?
From poking around the AWS Console, we knew that Lambda functions could have Elastic Network Interfaces (ENIs) because other functions in the same VPC could be identified from the console.
Yet none existed for our function!
A Brief Introduction to Lambda Networking
Luckily the documentation for VPC networking with Lambda is quite clear.
The key point from the documentation, is that Lambda Functions will share ENIs where possible. This sharing is based on the VPC Subnet and Security Groups applied to the function.
This explains why only some Lambda Functions are listed in the ENI descriptions: they happened to be the first Lambda Function that needed that particular Subnet-Security Group combination.
Armed with this knowledge, we can create a rough map of our network:
Reachability Analyzer Take #2
Armed with this knowledge it's time to find the source of our problems. My first attempt was the connection between one of the Lambda ENIs with the Internet Gateway:
I'll be honest: this left trace left me stumped for quite a while. We knew these ENIs had access to the internet when making HTTP requests, yet here there was no route?
Eventually I stumbled upon the (undocumented) answer when I attempted to break down the connection by first checking against the NAT Gateway.
Firstly, this told me that I wasn't going insane and that the networking configuration I had spent an hour looking at was indeed correct.
Second: that Reachability Analyzer acts a little bit like a Layer 2 network in that it does not understand transformations that come as a result of the use of a NAT.
We can now run the second hop of our request:
Problem identified! The Network ACLs of our Public Subnet does not allow the request.
One rule later and we have success!
I'm really impressed with Reachability Analyzer, it's a great tool that makes debugging cloud networking much simpler. I don't think I would have been able to solve this issue without it. Just be aware that each simulation does cost some money each time it is run.
Hopefully my adventure here will save you some time next time you have to debug Lambda Function networking.