Beyond Python Heap: Fixing Memory Retention with Jemalloc

Press release

FinBox raises $40M Series B to power faster, fairer, and more inclusive credit

Read Blog

Solutions

Products

Resources

Request demo

Beyond Python Heap: Fixing Memory Retention with Jemalloc

Ankit Malik

Senior Software Engineer

Feb 6, 2026

Introduction

In the DeviceConnect product, we encountered a recurring production issue where application memory usage would spike sharply over time. Eventually, the service would hit memory limits and require manual restarts to recover. This behavior was not aligned with traffic patterns and clearly indicated a memory management problem.

Our initial assumption, like most backend teams, was straightforward:

This must be a memory leak in our application code.

However, as this investigation unfolded, we learned an important lesson—not all memory leaks originate at the code level. In this article, I will walk through how we diagnosed the issue, why conventional heap profiling was not sufficient, and how switching the memory allocator ultimately resolved the problem.

What Is Memray

To validate whether the leak originated in our codebase, we started with memory profiling.

After evaluating several Python memory profilers, we shortlisted Memray, a high-performance memory profiler developed by Bloomberg. Memray provides:

Precise tracking of Python heap allocations
Native (C/C++) allocation visibility
Low runtime overhead compared to traditional profilers
Clear differentiation between heap usage and overall process memory (RSS)

Practical Challenge: Gunicorn + Multiple Workers

Our service runs behind Gunicorn with multiple workers, which complicates profiling:

Memray cannot easily attach to all workers automatically
Each worker is a separate process with its own memory space

The most reliable approach was to:

Identify the process IDs of individual Gunicorn workers
Attach Memray manually to each worker
Collect allocation reports independently

While this setup was non-trivial, it gave us accurate, worker-level visibility.

How to install memray

Find gunicorn workers

ps -ef | grep

Attach Memray to a Running Worker Process

memray attach <WORKER_PID> --output

Example:

memray attach 12345 --output

Results from Memray

After running Memray under sustained load, the results were surprising.

Heap size remained largely stable
No unbounded growth in Python object allocations
Garbage collection was functioning as expected

However, when we correlated this with system-level metrics, we observed something critical:

Resident Set Size (RSS) was continuously increasing
Memory was not being returned to the OS
RSS growth did not correspond to heap growth

This behavior is clearly illustrated in the graph above:

The heap size (orange line) remains flat
The RSS (blue line) steadily increases over time

Conclusion from Profiling

At this point, we ruled out:

Python object leaks
Reference cycles
Application-level memory retention

The issue was below the Python runtime.

Understanding the Root Cause: glibc Memory Behavior

Python relies on the system C library (glibc) for memory allocation. While glibc is robust, it has known characteristics

Memory arenas per thread/process
Fragmentation under long-running workloads
Memory often not returned to the OS, even after being freed

This is not technically a “leak,” but from an operational perspective, it behaves like one.

Given our workload pattern—multiple workers, long uptime, bursty allocations—the default allocator was retaining memory and which was look as a memory leak.

How to install Jemalloc on ubuntu

Jemalloc is a high-performance memory allocator designed to reduce memory fragmentation and improve memory reuse compared to the default glibc allocator. It manages how applications allocate and release memory at the OS level, making it particularly effective for long-running, multi-threaded services. In Python applications, jemalloc is commonly used when heap usage remains stable but overall process memory (RSS) keeps growing, as it can return unused memory back to the operating system more predictably when properly configured.

a. Install jemalloc

sudo apt update
sudo apt install -y

b. Verify Installation

ldconfig -p | grep

Expected output (example):

libjemalloc.so.2 (libc6,x86-64) => /lib/x86_64-linux-gnu/libjemalloc.so.2

c. Preload jemalloc (System-wide or per process)

export LD_PRELOAD

d. (Optional) jemalloc Custom Configuration

export MALLOC_CONF="narenas:1,tcache:false,dirty_decay_ms:10000,muzzy_decay_ms:10000,background_thread:true"

e. Start Application (Example: Gunicorn)

gunicorn app:app --workers 12 --bind 0

f. Verify jemalloc Is Being Used

cat /proc/$(pgrep -f gunicorn |head -n1)/maps | grep

Expected output contains:

libjemalloc.so.2

Load Testing - Jemalloc with Default Settings

To address allocator-level behavior, we replaced the default allocator with jemalloc, a widely used alternative designed to reduce fragmentation.

We enabled jemalloc without any custom tuning and ran load tests.

Observations

Jemalloc was successfully loaded
Memory usage increased significantly
RSS reached nearly 25–30 GB
Memory was still not released effectively

In fact, this was worse than the default allocator, where memory usage peaked around 11–17 GB.

This confirmed an important point:

Simply switching allocators is not sufficient—allocator configuration matters.

Load Testing - Jemalloc with Custom Settings

Next, we tuned jemalloc specifically for our workload and I ran load testing two times so that I can observe if there is any increment in memory for 2nd load test. As expected it was also stable during the second time also.

Configuration Used

ENV MALLOC_CONF="narenas:1,tcache:false,dirty_decay_ms:10000,muzzy_decay_ms:10000,background_thread:true"

Why These Settings Matter

narenas:1

Reduces arena fragmentation in low-concurrency, multi-process setups.
tcache:false

Prevents per-thread caching from hoarding memory.
dirty_decay_ms / muzzy_decay_ms

Forces freed memory pages to decay and return to the OS within a bounded time.
background_thread:true

Allows jemalloc to reclaim memory asynchronously.

Results

The improvement was immediate and measurable:

RSS stabilized around 3 GB, even under load
Memory usage decreased after load completion
No manual restarts required
Memory graphs became predictable and flat

This confirmed that the “leak” was allocator-level retention, not application leakage.

Additional Experiments: Uvicorn vs Gunicorn

We also evaluated different server configurations:

Uvicorn + jemalloc (custom config)

Stable memory behavior, higher baseline per worker, but predictable.
Uvicorn without jemalloc

Memory spiked to ~17 GB and did not recover.
Uvicorn + jemalloc (default config)

Worst behavior, memory exceeding 30 GB.

These experiments reinforced the conclusion that allocator tuning was the decisive factor, not the application server itself.

Conclusion

This investigation reinforced several important engineering lessons:

Not all memory leaks are heap leaks

Stable heap + growing RSS usually points to allocator or OS behavior.
Heap profilers alone are insufficient

Tools like Memray are essential to separate Python-level issues from system-level ones.
glibc memory retention can appear as leaks

Especially in long-running, multi-process Python services.
Jemalloc works—but only when tuned correctly

Default settings may worsen memory usage.
Allocator choice is a production concern

It should be tested, benchmarked, and documented like any other infrastructure dependency.

By combining Memray for diagnosis and jemalloc with tailored configuration for mitigation, we were able to permanently resolve the memory issue in DeviceConnect—without changing a single line of application code.

python, memory-leak, heap-profiling, jemalloc-tuning, backend-performance, troubleshooting, glibc, memray

Powering Credit Infrastructure at Scale