How to Fix OOMKilled Error (Exit Code 137) in Kubernetes?

June 4, 2024

English

Priyansh Khodiyar

In this Kubernetes guide, we will dive deep into the OOMKilled error - what causes it, how to prevent it from happening, how to debug & fix it, and much more.

How to Fix OOMKilled Error (Exit Code 137) in Kubernetes? cover image

Kubernetes can host multiple complex applications within. Many errors could occur - CrashLoopBackOff, CreateContainerError & CreateConfigContainerError, and many more.

In this guide, we will talk about OOMKilled error (exit code 137) in Kubernetes.

Before we get into how to fix it, let's briefly look at Kubelet & OOMKilled error.

Kubelet & OOMKilled Error

Kubelet is a component that is responsible for assigning pods to nodes and ensuring that the required number of pods is up and running at all times.

But when Kubelet is unable to get the pods up and running due to some hardware incapabilities, it will return specific errors.

Kubelet is responsible for monitoring the utilization of containers on the node, like memory usage.

When Kubelet detects that a container's memory usage exceeds its defined memory limit (as specified in the pod's resource limits and requests), it recognizes that the container is at risk of causing an OOM event.

Kubelet reports this condition to the Kubernetes control plane components to take necessary action to resolve the issue.

In response to the OOM condition, the Kubernetes control plane may decide to terminate the container to free up memory resources and ensure the stability of the node.

What is an OOMKilled Kubernetes Error (Exit Code 137)?

OOMKilled is one such error that is returned when there is an Out of out-of-memory exception on the node. Since there is no sufficient amount of memory for the application to run, the containers terminate.

The "OOMKilled" error with exit code 137 happens when a container is killed by Kubernetes due to running out of memory (OOM stands for "Out Of Memory").

This typically happens when a container exceeds its memory limit, and the Kubernetes OOM killer terminates the container to prevent it from consuming all the free memory available on the node.

The primary purpose is to protect system stability. Therefore, it's crucial to carefully manage resource allocation and set appropriate memory limits and requests for containers in Kubernetes to prevent OOM events and OOMKilled errors.

How Does OOMKiller Work?

This error is more of a safety mechanism for the node. This error will make sure to delete the containers which try to exhaust the resources of a node making room for more containers to be allocated and for the node resources to be distributed equally within the node.

Here's the background process of how OOMKilled works:

1. Memory Exhaustion:

Since this error occurs on Linux OS when it detects that it is running out of physical memory and swap space, it faces a critical resource shortage.

This can occur due to several factors, like total memory usage by all running processes and the provided memory limits.

2. OOM Event Detection:

The Linux kernel monitors the system's memory usage and when it detects that the memory pressure is too high, it initiates an OOM event.

3. OOM Score Calculation:

It is a mechanism in Linux used to decide which processes need to be terminated.

So, once the OOM event is initiated, the OOM score is determined by factors like process priority, resource limits, and many more.

4. OOM Killer Activation:

The OOM killer is activated as a last step to free up memory on the system and prevent the system memory from getting over-utilized. It selects multiple processes to terminate based on their OOM scores.

5. Process Termination:

The OOM killer selects the process with the highest OOM score and sends a signal to terminate it. This signal can be identified by SIGKILL.

This abrupt termination frees up memory occupied by the process. The kernel logs information about the OOM event, including which process was killed and the reason for the termination.

This information is typically recorded in system logs, making it available for system administrators to review.

6. Error Messages:

If a container consumes more memory than is allocated to it (resource limit), the container can be terminated by the OOM killer.

In the context of Kubernetes, this event is often reported as an "OOMKilled" error in the container's logs.

After the OOM killer terminates the selected process(es), memory is freed up, and the system can continue to function, avoiding a complete system crash.

Reasons behind the OOMKilled Error

Let's look at top reasons behind the OOMKilled Error in Kubernetes.

1. Excessive Memory Consumption

This is the main reason for the OOMKilled error. It identifies the container/process that is trying to consume more memory than the configuration specified.

This can happen due to memory leaks, inefficient resource management, or the sudden demand for more memory than anticipated.

This directly means that when the resource requests and limits are specified for a container, the engineer must assign the correct amount of resource limits so that Kubernetes can make the right decisions and ensure harmony in the infrastructure.

2. Resource Contention

Multiple processes may compete for limited memory resources on a single node. This is the most important part of configuration on Kubernetes which is why it is essential to plan the infrastructure better.

3. Memory Leaks

Memory leaks occur when a program or process allocates memory but doesn't release it when it's no longer needed (often referred to as, garbage collection).

Over the course of time, this can cause memory usage to grow until the system runs out of memory and triggers the OOM killer.

4. Fork Bombs

There are malicious or unintentional scripts that rapidly create new processes which, in turn, can consume a significant amount of memory.

If left unchecked, this can lead to memory exhaustion and OOM events. Some workloads or applications may experience sudden spikes in memory usage due to increased demand.

If the system doesn't have sufficient memory available to handle these spikes, it can trigger OOM events.

5. DDoS Attacks

In some cases, DDoS attacks can cause a sudden influx of requests that consume a large amount of memory, potentially triggering OOM events if not mitigated effectively.

6. Inadequate Monitoring and Resource Management

Lack of proper monitoring and resource management practices can make it challenging to detect and address memory-related issues proactively, increasing the likelihood of OOMKilled errors.

Which Container Triggered the OOMKilled Error?

So how can you identify the containers that got the OOMKill trigger and why?

1. View Kubernetes Events

Use the following command to view the Kubernetes events related to the pod.

kubectl describe pod <pod-name>

You will get an entire picture of what this container does and the different events that occurred. Look for events that mention OOM events or container terminations.

2. Check Container Logs

You can also check the container logs to identify which container encountered the OOMKilled error.

Use the following command to view the container logs for a specific pod:

kubectl logs <pod-name> -c <container-name>

3. View Pod's Status

Run the following command to get the status of the pod:

kubectl get pod <pod-name> -o json

In the output, look for the status.containerStatuses section. It will show the status of each container within the pod, including whether any container was terminated due to an OOM event.

In the Kubernetes events, you may see messages like "Killing container with ID" or "OOMKilled," which tell that a container was terminated due to running out of memory.

The event description should also include the container name.

In the container logs, you should see log entries related to the OOMKilled event that might provide additional information about why the container encountered the error.

In the pod's status, you can check the status.containerStatuses section for each container's state and reasons for termination.

By inspecting the events, logs, and pod status, you should be able to determine which specific container within the pod triggered the OOMKilled error.

This information is valuable for diagnosing and addressing the root cause of the memory-related issue.

5 Ways to Fix OOMKilled Error

1. Increase Memory Limits of Pods

When we create a component in Kubernetes, it is a good practice to specify the resource requests and limits needed for the application to run smoothly.

This helps you plan and manage infrastructure better. Increase the memory limit for the container in your pod's YAML configuration to ensure it has enough memory to run without hitting the OOM condition.

Let's say you have a small application that runs fine under 300Mi. You can specify the memory requests and limits as shown in the below manifest.

resources:
  limits:
    memory: 500Mi
  requests:
    memory: 300Mi

This will ensure to assign your applications with a sufficient amount of memory even during peak times.

If you've set resource requests higher than necessary, consider reducing them to free up memory for other containers on the node.

2. Optimize Application Code

Review and optimize your code to minimize memory consumption. Look for memory leaks, inefficient data structures, and unnecessary caching.

Use profiling tools to identify memory bottlenecks and address them.

3. HPA and VPA

Instead of allocating more memory to a single container, consider deploying multiple instances of your application (horizontal scaling).

Distributing the load across multiple pods can alleviate memory pressure.

If you've exhausted horizontal scaling options and your application genuinely requires more memory, consider using nodes with higher memory capacity (vertical scaling).

4. Implement Rate Limiting and Caching

If your application experiences traffic spikes, implement rate limiting to control the number of incoming requests, preventing excessive memory usage during high loads.

Implement caching mechanisms to store frequently accessed data in memory, reducing the need for repeated expensive computations or database queries.

5. Allocate More Node Memory

If your nodes are consistently running out of memory, consider adding more memory to your Kubernetes nodes. This can help accommodate the needs of your containers.

Also, limit the number of containers running on each node to reduce resource contention. This can be achieved by adjusting Kubernetes Pod placement policies and node selectors.

6. Implement Affinity, Taints, and nodeSelector Strategies

It is important to control the applications running on different nodes. By default when these strategies aren't configured, Kubernetes will allocate your apps to any node in the cluster.

This works fine in a small-scale application infrastructure. But when you consider using a complex infrastructure with multiple components of applications running, it is important to segregate your applications into different nodes.

Final Words: Fix OOMKilled Error

The specific solution to an OOMKilled error will depend on the unique characteristics of your application, its resource requirements, and the underlying infrastructure.

It's essential to monitor your system and continually fine-tune resource allocation and application code to maintain optimal performance and prevent OOM-related issues.

Frequently Asked Questions

1. Why is Kubernetes killing my pod?

It is due to insufficient memory space in the node according to the limits specified for the pod.

2. How do I check memory leaks in Kubernetes?

Use monitoring and profiling tools like Prometheus and Grafana to analyze memory consumption patterns and identify abnormalities in your application containers.

3. What happens if a pod dies in Kubernetes?

The Kubernetes control plane automatically restarts the pod to maintain the desired state defined in the deployment or other controller.

4. What happens when a pod runs out of memory?

Then you get the OOMKilled error which can be easily viewed on pod events.

5. What causes CPU throttling in Kubernetes?

When containers exceed their CPU resource limits, it causes the CPU to be artificially slowed down to enforce resource constraints.

Written by Priyansh Khodiyar

Twitter LinkedIn

Priyansh is the founder of UnYAML and a software engineer with a passion for writing. He has good experience with writing and working around DevOps tools and technologies, APMs, Kubernetes APIs, etc and loves to share his knowledge with others.

Posted onstorywith tags:

#devops #kubernetes #bug-fix #OOMKilled #troubleshooting

All Blogs