Debugging AWS Lambda Networking with Reachability Analyzer

Reachability Analyzer is a fantastic little tool from AWS, however it does have a number of gotchya's.

Debugging AWS Lambda Networking with Reachability Analyzer

Reachability Analyzer is a fantastic little tool from AWS, however it does have a number of gotchya's that I encountered on a recent debugging exercise. I couldn't find any good resources at the time, so here's a collection of learnings from the experience.

The Problem

The actual problem brought to me by one of my colleagues was that their Lambda function using SFTP was timing out when connecting to the server. We knew that the Lambda had some sort of web connectivity because HTTP/S requests were successfully being made. Our initial investigation brought us to Stack Overflow.

Error: Connecting to a SFTP server with ssh2-sftp-client in a AWS lambda function throws a time out
I’m trying to connect to a SFTP-Server and list the documents in the /ARCHIVE Folder. Credentials are stored in a .env file. When I run this on my local machine it works and lists the documents. a...

Although useful, this article was not able to solve our problem. Inspecting the VPC configuration we could see what Subnets it existed in and the Security Groups being used. We knew that the VPC was operational from other services, so the likely culprits were the Security Groups and the Network Access Control Lists - both of which indicated that everything should have been working correctly.

Lambda VPC Configuration

At this point I decided to give Reachability Analyzer a spin to see if it could illuminate the problem.

Using Reachability Analyzer

In cloud infrastructure it's pretty hard to use the usual network debugging techniques I've learnt from managing the office firewall. Whilst you can easily run tcpdump on the relevant hosts along with other logging tools to inspect state, you can't do this managed infrastructure in the cloud.

Enter Reachability Analyzer which can simulate requests in your VPC without needing to access any of the infrastructure running. The best part of this tool is that it also provides explanations for each step (contrast this with comparing outcomes and logs across multiple different hosts and systems).

Getting started with VPC Reachability Analyzer - Amazon Virtual Private Cloud
Learn how to get started with Reachability Analyzer.

However, if you read those docs you will see that Reachability Analyzer doesn't directly work for Lambda functions - so how can we debug this?

From poking around the AWS Console, we knew that Lambda functions could have Elastic Network Interfaces (ENIs) because other functions in the same VPC could be identified from the console.

EC2 Network Interfaces created by Lambda

Yet none existed for our function!

A Brief Introduction to Lambda Networking

Luckily the documentation for VPC networking with Lambda is quite clear.

VPC networking for Lambda - AWS Lambda
Use VPC networking to securely connect Lambda functions to other AWS resources.

The key point from the documentation, is that Lambda Functions will share ENIs where possible. This sharing is based on the VPC Subnet and Security Groups applied to the function.

This explains why only some Lambda Functions are listed in the ENI descriptions: they happened to be the first  Lambda Function that needed that particular Subnet-Security Group combination.

Armed with this knowledge, we can create a rough map of our network:

Network Diagram

Reachability Analyzer Take #2

Armed with this knowledge it's time to find the source of our problems. My first attempt was the connection between one of the Lambda ENIs with the Internet Gateway:

Reachability Analysis: Lambda ENI -> Internet Gateway

I'll be honest: this left trace left me stumped for quite a while. We knew these ENIs had access to the internet when making HTTP requests, yet here there was no route?

Eventually I stumbled upon the (undocumented) answer when I attempted to break down the connection by first checking against the NAT Gateway.

Reachability Analysis: Lambda ENI -> NAT Gateway ENI

Success!

Firstly, this told me that I wasn't going insane and that the networking configuration I had spent an hour looking at was indeed correct.

Second: that Reachability Analyzer acts a little bit like a Layer 2 network in that it does not understand transformations that come as a result of the use of a NAT.

We can now run the second hop of our request:

Reachability Analysis: NAT Gateway ENI -> Internet Gateway ENI Failure

Problem identified! The Network ACLs of our Public Subnet does not allow the request.

Reachability Analysis: NAT Gateway ENI -> Internet Gateway ENI Success

One rule later and we have success!

Conclusion

I'm really impressed with Reachability Analyzer, it's a great tool that makes debugging cloud networking much simpler. I don't think I would have been able to solve this issue without it. Just be aware that each simulation does cost some money each time it is run.

Hopefully my adventure here will save you some time next time you have to debug Lambda Function networking.