Press release

FinBox raises $40M Series B to power faster, fairer, and more inclusive credit

Solutions

Products

Resources

Beyond Python Heap: Fixing Memory Retention with Jemalloc

Ankit Malik

Senior Software Engineer

|

Feb 6, 2026

Introduction

In the DeviceConnect product, we encountered a recurring production issue where application memory usage would spike sharply over time. Eventually, the service would hit memory limits and require manual restarts to recover. This behavior was not aligned with traffic patterns and clearly indicated a memory management problem.

Our initial assumption, like most backend teams, was straightforward:

This must be a memory leak in our application code.

However, as this investigation unfolded, we learned an important lesson—not all memory leaks originate at the code level. In this article, I will walk through how we diagnosed the issue, why conventional heap profiling was not sufficient, and how switching the memory allocator ultimately resolved the problem.


What Is Memray

To validate whether the leak originated in our codebase, we started with memory profiling.

After evaluating several Python memory profilers, we shortlisted Memray, a high-performance memory profiler developed by Bloomberg. Memray provides:

  • Precise tracking of Python heap allocations

  • Native (C/C++) allocation visibility

  • Low runtime overhead compared to traditional profilers

  • Clear differentiation between heap usage and overall process memory (RSS)


Practical Challenge: Gunicorn + Multiple Workers

Our service runs behind Gunicorn with multiple workers, which complicates profiling:

  • Memray cannot easily attach to all workers automatically

  • Each worker is a separate process with its own memory space


The most reliable approach was to:

  1. Identify the process IDs of individual Gunicorn workers

  2. Attach Memray manually to each worker

  3. Collect allocation reports independently

While this setup was non-trivial, it gave us accurate, worker-level visibility.


How to install memray

Find gunicorn workers

ps -ef | grep


Attach Memray to a Running Worker Process

memray attach <WORKER_PID> --output

Example:

memray attach 12345 --output


Results from Memray

After running Memray under sustained load, the results were surprising.

  • Heap size remained largely stable

  • No unbounded growth in Python object allocations

  • Garbage collection was functioning as expected


However, when we correlated this with system-level metrics, we observed something critical:

  • Resident Set Size (RSS) was continuously increasing

  • Memory was not being returned to the OS

  • RSS growth did not correspond to heap growth

This behavior is clearly illustrated in the graph above:

  • The heap size (orange line) remains flat

  • The RSS (blue line) steadily increases over time


Conclusion from Profiling

At this point, we ruled out:

  • Python object leaks

  • Reference cycles

  • Application-level memory retention

The issue was below the Python runtime.


Understanding the Root Cause: glibc Memory Behavior

Python relies on the system C library (glibc) for memory allocation. While glibc is robust, it has known characteristics

  • Memory arenas per thread/process

  • Fragmentation under long-running workloads

  • Memory often not returned to the OS, even after being freed


This is not technically a “leak,” but from an operational perspective, it behaves like one.

Given our workload pattern—multiple workers, long uptime, bursty allocations—the default allocator was retaining memory and which was look as a memory leak.


How to install Jemalloc on ubuntu

Jemalloc is a high-performance memory allocator designed to reduce memory fragmentation and improve memory reuse compared to the default glibc allocator. It manages how applications allocate and release memory at the OS level, making it particularly effective for long-running, multi-threaded services. In Python applications, jemalloc is commonly used when heap usage remains stable but overall process memory (RSS) keeps growing, as it can return unused memory back to the operating system more predictably when properly configured.


a. Install jemalloc

sudo apt update
sudo apt install -y

b. Verify Installation

ldconfig -p | grep

Expected output (example):

libjemalloc.so.2 (libc6,x86-64) => /lib/x86_64-linux-gnu/libjemalloc.so.2

c. Preload jemalloc (System-wide or per process)

export LD_PRELOAD

d. (Optional) jemalloc Custom Configuration

export MALLOC_CONF="narenas:1,tcache:false,dirty_decay_ms:10000,muzzy_decay_ms:10000,background_thread:true"

e. Start Application (Example: Gunicorn)

gunicorn app:app --workers 12 --bind 0

f. Verify jemalloc Is Being Used

cat /proc/$(pgrep -f gunicorn |head -n1)/maps | grep

Expected output contains:

libjemalloc.so.2


Load Testing - Jemalloc with Default Settings

To address allocator-level behavior, we replaced the default allocator with jemalloc, a widely used alternative designed to reduce fragmentation.

We enabled jemalloc without any custom tuning and ran load tests.

Observations


  • Jemalloc was successfully loaded

  • Memory usage increased significantly

  • RSS reached nearly 25–30 GB

  • Memory was still not released effectively

In fact, this was worse than the default allocator, where memory usage peaked around 11–17 GB.

This confirmed an important point:


Simply switching allocators is not sufficient—allocator configuration matters.


Load Testing - Jemalloc with Custom Settings

Next, we tuned jemalloc specifically for our workload and I ran load testing two times so that I can observe if there is any increment in memory for 2nd load test. As expected it was also stable during the second time also.


Configuration Used

ENV MALLOC_CONF="narenas:1,tcache:false,dirty_decay_ms:10000,muzzy_decay_ms:10000,background_thread:true"

Why These Settings Matter


  • narenas:1


    Reduces arena fragmentation in low-concurrency, multi-process setups.


  • tcache:false


    Prevents per-thread caching from hoarding memory.


  • dirty_decay_ms / muzzy_decay_ms


    Forces freed memory pages to decay and return to the OS within a bounded time.


  • background_thread:true


    Allows jemalloc to reclaim memory asynchronously.


Results

The improvement was immediate and measurable:

  • RSS stabilized around 3 GB, even under load

  • Memory usage decreased after load completion

  • No manual restarts required

  • Memory graphs became predictable and flat

This confirmed that the “leak” was allocator-level retention, not application leakage.

Additional Experiments: Uvicorn vs Gunicorn

We also evaluated different server configurations:

  • Uvicorn + jemalloc (custom config)


    Stable memory behavior, higher baseline per worker, but predictable.


  • Uvicorn without jemalloc


    Memory spiked to ~17 GB and did not recover.


  • Uvicorn + jemalloc (default config)


    Worst behavior, memory exceeding 30 GB.


These experiments reinforced the conclusion that allocator tuning was the decisive factor, not the application server itself.


Conclusion


This investigation reinforced several important engineering lessons:


  1. Not all memory leaks are heap leaks


    Stable heap + growing RSS usually points to allocator or OS behavior.


  2. Heap profilers alone are insufficient


    Tools like Memray are essential to separate Python-level issues from system-level ones.


  3. glibc memory retention can appear as leaks


    Especially in long-running, multi-process Python services.


  4. Jemalloc works—but only when tuned correctly


    Default settings may worsen memory usage.


  5. Allocator choice is a production concern


    It should be tested, benchmarked, and documented like any other infrastructure dependency.


By combining Memray for diagnosis and jemalloc with tailored configuration for mitigation, we were able to permanently resolve the memory issue in DeviceConnect—without changing a single line of application code.

python, memory-leak, heap-profiling, jemalloc-tuning, backend-performance, troubleshooting, glibc, memray

FinBox raises $40M Series B

FinBox raises $40M Series B

FinBox raises $40M Series B

FinBox raises $40M Series B