One of the classic bugs is the memory leak. Even garbage collected languages like C# can leak memory, even if all the code is safe/managed. Sometimes such issues are easy to find and fix. Other times its difficult, for example when the memory leak only happens in production, only sometimes, in a containerized .NET service running in Kubernetes. This difficult case is the one I will cover in this post.

The Scenario

Let's say Service A is running in Kubernetes, specifically Azure Kubernetes Services (AKS) as is often the case for .NET code. Service A is built with C#/.NET8 and packaged as a container image. To reduce attack surface the image is very limited, no curl to download external tools, no sudo apt install to add debugging tools, and no dotnet-dump installed to dump memory with.

Sometimes Service A behaves strange, and its memory usage grows and grows. As any good admin we have set a limit on how much memory it's allowed to use. And after some time, maybe a week, it's killed by Kubernetes with exit code 137, reason OOMKilled.

What to do? Let's try a memory dump!

Prerequisites

kubectl and access

Most probably you already have kubectl installed and use it for other Kubernetes management tasks, but if not you need to install it and configure it to access the Kubernetes cluster. If, as in this theoretical example, the Kubernetes cluster is an AKS cluster running in Azure, instead follow Microsoft's instructions to both install kubectl and get credentials.

A debug container

The other piece that needs setup is a container image with the debug tools needed. Either you can install them on the fly each time you need to dump something, or you can prepare an image with the tools needed in advance. Since we need the dotnet-dump tool it's a good idea to start from the dotnet/sdk image by Microsoft, that comes with dotnet pre-installed. As of writing this post .NET 8 is the latest released version, so lets use mcr.microsoft.com/dotnet/sdk:8.0. If you plan to troubleshoot more than once (which in my experience is likely), then it's best to prepare your own Dockerfile with everything pre-installed, not just the SDK.

Locate the pod

First we need to find the pod with the memory leak. There are several ways, I prefer to list all the pods within the namespace and manually select one. You can list the pods with this kubectl command:


  ~$ kubectl get pods -n MY_NAMESPACE
  NAME              READY   STATUS    RESTARTS   AGE
  service_a-24ztc   1/1     Running   0          10h
  service_a-gvz82   1/1     Running   0          10h

Very likely this will list a lot of pods (if you only have two like in this example you probably do not even need Kubernetes and would be better off with something easier).

We can double-check that the pod indeed have a memory issue with kubectl top. At the same time we can get the container name, which we will need later:


  ~$ kubectl top pod service_a-24ztc --containers -n MY_NAMESPACE
  POD               NAME        CPU(cores)   MEMORY(bytes)
  service_a-24ztc   service_a   2m           999Mi

Yep, looks like a lot of memory for Service A which is a simple hello world service!

Launch debug container and dump

To troubleshoot we will connect to the pod and launch a debug container inside the pod, without restarting the original container. Once inside the debug container we need to install the debug tools unless those were already pre-installed, and finally dump the memory of the dotnet process in the main container.

To do all this we will use the command kubectl debug, which with the parameters looks like this:


  ~$ kubectl debug -it -n MY_NAMESPACE POD_NAME --image=DEBUG_IMAGE_NAME --target=CONTAINER_NAME --profile=general

Filling in the fields from above it becomes:


  ~$ kubectl debug -it -n MY_NAMESPACE service_a-24ztc --image=mcr.microsoft.com/dotnet/sdk:8.0 --target=service_a --profile=general

If it worked you should be inside the debug container with an info message and a prompt:


  Targeting container "service_a". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
  Defaulting debug container name to debugger-tdgrs.
  If you don't see a command prompt, try pressing enter.
  @service_a-24ztc:/$

Feel free to look around, but remember that this container runs together with the target container in the pod, and you do not want to crash it. The target container file system will be mounted under /proc/[PID]/root, most likely /proc/1/root.

Once connected the debugging tools needs to be installed. First we need to set three different environment variables:


  @service_a-24ztc:/$ export DOTNET_CLI_HOME="/tmp/DOTNET_CLI_HOME" && \
  export PATH="$PATH:/tmp/DOTNET_CLI_HOME/.dotnet/tools" && \
  export TMPDIR="/proc/1/root/tmp"

First we set the home folder for the dotnet cli, then add the location were the installed dump tool will be located, and finally we need to set the TMPDIR to where the target container has its tmp folder for dotnet-dump to be able to connect to it. As described above, this will be under /proc/[PID].

After the environment variables, it's time to install the dump tool:


  @service_a-24ztc:/$ dotnet tool install --global dotnet-dump

With the tool available, lets dump:


  @service_a-24ztc:/$ dotnet-dump collect --process-id=1 --type=Full -o /tmp/full.dmp

Here I used --type=Full to get as much data as possible. Since the full dump can be very big, it's often a good idea to make an experiment with --type=Mini first, which will produce a much smaller dump that is easier to deal with. Note here that process-id should be set to whatever processid we had above, and that the dump itself is written to /tmp on the target container and not our debug container. That means that to access it from the debug container the path will be /proc/1/root/tmp.

Compress and download dump

Now that the dump is created we can compress and download it. It's often a good idea to compress it first, since the dump itself is big but compresses well. To compress we can use gzip:


  @service_a-24ztc:/$ gzip /proc/1/root/tmp/full.dmp

If it succeeded, there will now be a file full.dmp.gz in the target container's tmp folder. Lets exit the debug container:


  @service_a-24ztc:/$ exit

To download the dump there's another kubectl command, cp, that allows copying files out of the container:


  ~$ kubectl cp service_a-24ztc:tmp/full.dmp.gz full.dmp.gz

The full.dmp.gz file can then be unzipped with Explorer in case you run Windows, or by gzip -d full.dmp.gz from the command line on Linux.

With the dump safely stored on our developer machine, diagnostics tools such as Visual Studio, WinDbg and LLDB can be used. How to efficiently use these are however too much to cover in this post. Especially WinDbg (or LLDB if you are on Linux) can be overwhelming the first times!

Closing thoughts

Apart from using the debug container to make dumps, you can also prepare it with other useful tools such as some of the other dotnet diagnostics tools (dotnet-counters, dotnet-gcdump, dotnet-trace, dotnet-stack), curl, top or (almost) any other tools you find useful.

Happy dumping and debugging!

Victor 2024-11-05