One of the classic bugs is the memory leak. Even garbage collected languages like C# can leak memory, even if all the code is safe/managed. Sometimes such issues are easy to find and fix. Other times its difficult, for example when the memory leak only happens in production, only sometimes, in a containerized .NET service running in Kubernetes. This difficult case is the one I will cover in this post.
Let's say Service A is running in Kubernetes, specifically
Azure Kubernetes Services (AKS)
as is often the case for .NET code. Service A is built with C#/.NET8 and
packaged as a container image. To reduce attack surface the image is very
limited, no curl to download external tools, no
sudo apt install
to add debugging tools, and no
dotnet-dump
installed to dump memory with.
Sometimes Service A behaves strange, and its memory usage grows and grows. As
any good admin we have set a limit on how much memory it's allowed to use. And
after some time, maybe a week, it's killed by Kubernetes with
exit code 137
, reason
OOMKilled.
What to do? Let's try a memory dump!
Most probably you already have kubectl
installed and use it for
other Kubernetes management tasks, but if not you need to
install it
and configure it to access the Kubernetes cluster. If, as in this theoretical
example, the Kubernetes cluster is an AKS cluster running in Azure, instead
follow
Microsoft's instructions
to both install kubectl
and get credentials.
The other piece that needs setup is a container image with the debug tools
needed. Either you can install them on the fly
each time you need to
dump something, or you can prepare an image with the tools needed in advance.
Since we need the dotnet-dump
tool it's a good idea to start from
the
dotnet/sdk
image by Microsoft, that comes with dotnet pre-installed. As of writing this
post .NET 8 is the latest released version, so lets use
mcr.microsoft.com/dotnet/sdk:8.0
. If you plan to troubleshoot
more than once (which in my experience is likely), then it's best to prepare
your own Dockerfile with everything pre-installed, not just the SDK.
First we need to find the pod with the memory leak. There are several ways, I
prefer to list all the pods within the namespace and manually select one. You
can list the pods with this kubectl
command:
~$ kubectl get pods -n MY_NAMESPACE
NAME READY STATUS RESTARTS AGE
service_a-24ztc 1/1 Running 0 10h
service_a-gvz82 1/1 Running 0 10h
Very likely this will list a lot of pods (if you only have two like in this example you probably do not even need Kubernetes and would be better off with something easier).
We can double-check that the pod indeed have a memory issue with
kubectl top
. At the same time we can get the container name,
which we will need later:
~$ kubectl top pod service_a-24ztc --containers -n MY_NAMESPACE
POD NAME CPU(cores) MEMORY(bytes)
service_a-24ztc service_a 2m 999Mi
Yep, looks like a lot of memory for Service A which is a simple hello world service!
To troubleshoot we will connect to the pod and launch a debug container inside the pod, without restarting the original container. Once inside the debug container we need to install the debug tools unless those were already pre-installed, and finally dump the memory of the dotnet process in the main container.
To do all this we will use the command
kubectl debug
, which with the parameters looks like this:
~$ kubectl debug -it -n MY_NAMESPACE POD_NAME --image=DEBUG_IMAGE_NAME --target=CONTAINER_NAME --profile=general
Filling in the fields from above it becomes:
~$ kubectl debug -it -n MY_NAMESPACE service_a-24ztc --image=mcr.microsoft.com/dotnet/sdk:8.0 --target=service_a --profile=general
If it worked you should be inside the debug container with an info message and a prompt:
Targeting container "service_a". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
Defaulting debug container name to debugger-tdgrs.
If you don't see a command prompt, try pressing enter.
@service_a-24ztc:/$
Feel free to look around, but remember that this container runs together with
the target container in the pod, and you do not want to crash it. The target
container file system will be mounted under /proc/[PID]/root
,
most likely /proc/1/root
.
Once connected the debugging tools needs to be installed. First we need to set three different environment variables:
@service_a-24ztc:/$ export DOTNET_CLI_HOME="/tmp/DOTNET_CLI_HOME" && \
export PATH="$PATH:/tmp/DOTNET_CLI_HOME/.dotnet/tools" && \
export TMPDIR="/proc/1/root/tmp"
First we set the home folder for the dotnet cli, then add the location were
the installed dump tool will be located, and finally we need to set the
TMPDIR
to where the target container has its tmp folder for
dotnet-dump to be able to connect to it. As described above, this will be
under /proc/[PID]
.
After the environment variables, it's time to install the dump tool:
@service_a-24ztc:/$ dotnet tool install --global dotnet-dump
With the tool available, lets dump:
@service_a-24ztc:/$ dotnet-dump collect --process-id=1 --type=Full -o /tmp/full.dmp
Here I used --type=Full
to get as much data as possible. Since
the full dump can be very big, it's often a good idea to make an experiment
with --type=Mini
first, which will produce a much smaller dump
that is easier to deal with. Note here that process-id should be set to
whatever processid we had above, and that the dump itself is written to
/tmp
on the target container and not our debug container.
That means that to access it from the debug container the path will be
/proc/1/root/tmp
.
Now that the dump is created we can compress and download it. It's often a
good idea to compress it first, since the dump itself is big but compresses
well. To compress we can use gzip
:
@service_a-24ztc:/$ gzip /proc/1/root/tmp/full.dmp
If it succeeded, there will now be a file full.dmp.gz
in the
target container's tmp folder. Lets exit the debug container:
@service_a-24ztc:/$ exit
To download the dump there's another kubectl
command,
cp
, that allows copying files out of the container:
~$ kubectl cp service_a-24ztc:tmp/full.dmp.gz full.dmp.gz
The full.dmp.gz
file can then be unzipped with Explorer in case
you run Windows, or by gzip -d full.dmp.gz
from the command line
on Linux.
With the dump safely stored on our developer machine, diagnostics tools such as Visual Studio, WinDbg and LLDB can be used. How to efficiently use these are however too much to cover in this post. Especially WinDbg (or LLDB if you are on Linux) can be overwhelming the first times!
Apart from using the debug container to make dumps, you can also prepare it with other useful tools such as some of the other dotnet diagnostics tools (dotnet-counters, dotnet-gcdump, dotnet-trace, dotnet-stack), curl, top or (almost) any other tools you find useful.
Happy dumping and debugging!
Victor 2024-11-05