When Linux Runs Out of Memory
Pages: 1, 2
Inside the Allocator
The real work actually takes place inside the glibc memory allocator. The allocator hands out blocks to the application, carving them from the heap that comes (however infrequently) from the kernel.
The allocator is the manager, while the kernel is the worker. With this in mind, it's easy to understand that maximum efficiency comes from a good allocator, not from the kernel.
glibc uses an allocator named ptmalloc. Wolfram Gloger created it as a modified version of the original malloc library created by Doug Lea. The allocator manages the allocated blocks in terms of "chunks." Chunks represent the memory block you actually requested, but not its size. There is an extra header added inside this chunk besides the user data.
The allocator uses two functions to get a chunk of memory from the kernel:
brk()sets the end of the process's data segment.mmap()creates a new VMA and passes it to the allocator.
Of course, malloc() uses these functions only if there are no more free chunks in the current pool.
The decision on whether to use brk() or mmap() requires one simple check. If the request is equal or larger than M_MMAP_THRESHOLD, the allocator uses mmap(). If it is smaller, the allocator calls brk(). By default, M_MMAP_THRESHOLD is 128KB, but you may freely change it by using mallopt().
In the OOM context, how ptmalloc frees memory blocks is interesting. Blocks allocated via mmap() get freed via an unmap() call, and thus become completely released. Freeing blocks allocated via brk() means marking them as free, but they remain under the allocator's control. It can reassign free chunks to satisfy another malloc() if the request's size is less than or equal to the chunk's size. The allocator can consolidate multiple free chunks, as long as they are adjacent. It may even split a free chunk into smaller chunks to satisfy smaller future requests.
This implies that a free chunk may go abandoned if the allocator cannot fit future requests within it. Failure to coalesce free chunks may also trigger faster OOM. This is usually an indication of moderate to bad memory fragmentation.
Recovery
Once an OOM situation occurs, now what? The kernel will terminate one process for sure. Why kill? This is the only way to stop further memory requests. The kernel can not assume there is a sophisticated mechanism inside the process to stop further requests automatically, so it has no other choice but to kill it.
How does the kernel know exactly which process to kill? The answer lies inside mm/oom_kill.c of the Linux source code. This C code represents the so-called OOM killer of the Linux kernel. The function badness() give a score to each existing processes. The one with highest score will be the victim. The criteria are:
- VM size. This is not the sum of all allocated pages, but the sum of the size of all VMAs owned by the process. The bigger the VM size, the higher the score.
- Related to #1, the VM size of the process's children are important too. The VM size is cumulative if a process has one or more children.
- Processes with task priorities smaller than zero (niced processes) get more points.
- Superuser processes are important, by assumption; thus they have their scores reduced.
- Process runtime. The longer it runs, the lower the score.
- Processes that perform direct hardware access are more immune.
- The swapper (pid 0) and init (pid 1) processes, as well as any kernel threads immune from the list of potential victims.
The process with the biggest score "wins" the election and the OOM killer will kill it very soon.
The heuristic isn't perfect, but usually it works well for most situations. Criteria #1 and #2 clearly show that it is the VMA size that matters, not the number of actual pages a process has. You might think that measuring VMA size will trigger a false alarm, but luckily it doesn't. The badness() call occurs inside the page allocation functions when there are few free pages left and page frame reclamation fails, so the VMA size closely matches the number of pages owned by the process.
Why not just count the actual number of pages? That would require more time and require the use of locks, thus making the procedure too expensive to make a fast decision. Knowing that OOM killer isn't perfect, you must be ready for a wrong kill.
The kernel uses the SIGTERM signal to inform the target process that it should stop.
How to Reduce OOM Risk
The simple rule to avoid OOM risk is actually simple: don't allocate beyond the machine's current free space. However, many factors come into play, so there are further refinements to the strategy.
Reduce Fragmentation by Properly Ordering Allocation
There is no need to use any sophisticated allocator. You can reduce fragmentation by properly ordering memory allocation and deallocation. As holes easily happen, employ the LIFO strategy: the last one you allocate is the first you need to free.
For example, instead of doing:
void *a;
void *b;
void *c;
............
a = malloc(1024);
b = malloc(5678);
c = malloc(4096);
......................
free(b);
b = malloc(12345);
It's better to do:
a = malloc(1024);
c = malloc(4096);
b = malloc(5678);
......................
free(b);
b = malloc(12345);
This way, there won't be any hole between the a and c chunks. You can also consider realloc() to resize any existingmalloc()-ed blocks.
Two example programs (fragmented1.c and fragmented2.c) demonstrate the effect of allocation rearrangement. Reports at the end of both programs give the number of bytes allocated by the system (kernel and glibc allocator) and the number of bytes actually used. For example, on kernel 2.6.11.1, with glibc 2.3.3-27 and executing without giving an explicit parameter, fragmented1 wasted 319858832 bytes (about 305 MB) while fragmented2 wasted 2089200 bytes (about 2MB). That's 152 times smaller!
You can do further experiments by passing various numbers as the program parameter. This parameter acts as the request size of the malloc() call.
Tweak Kernel's Overcommit Behavior
You can change the behavior of the Linux kernel through the /proc filesystem, as documented in Documentation/vm/overcommit-accounting in the Linux kernel's source code. You have three choices when tuning kernel overcommit, expressed as numbers in /proc/sys/vm/overcommit_memory:
0means that the kernel will use predefined heuristics when deciding whether to allow such an overcommit. This is the default.1always overcommits. Perhaps you now realize the danger of this mode.2prevents overcommit from exceeding a certain watermark. The watermark is also tunable through /proc/sys/vm/overcommit_ratio. Within this mode, the total commit can not exceed the swap space(s) size + overcommit_ratio percent * RAM size. By default, the overcommit ratio is 50.
The default mode usually work quite fine in most situation, but mode #2 offers better protection toward overcommit. On the other hand, mode #2 requires you to predict carefully how much space all running applications need. You certainly don't want to see your application unable to get more memory chunks just because the limit is too strict. However, mode #2 is a best way to avoid having a program killed suddenly.
Suppose that you have 256MB of RAM and 256MB of swap and you want to limit overcommit at 384MB. That means 256 + 50 percent * 256MB, so put 50 on /proc/sys/vm/overcommit_ratio.
Check for NULL Pointer after Memory Allocation and Audit for Memory Leak
This is a simple rule, but it sometimes goes omitted. By checking for NULL, at least you know that the allocator could extend the memory area, although there is no obvious guarantee that it will allocate the needed pages later. Usually, you need to bail out or delay the allocation for a moment, depending on your scenarios. Together with overcommit tunables, you have a decent tool to anticipate OOM because malloc() will return NULL if it believes that it cannot acquire free pages later.
Memory leak is also a source of unnecessary memory consumption. A leaked memory block is one that the application no longer tracks, but that the kernel will not reclaim because, from the kernel's point of view, the task still has it under control. Valgrind is a nice tool to find out such occurrences inside your code without the need to re-code.
Always Consult Memory Allocation Statistics
The Linux kernel provides /proc/meminfo as a way to find complete information about memory conditions. This /proc entry is also an information source for utilities such as top, free, and vmstat.
What you need to check is the free and reclaimable memory. The word "free" needs no further explanation, but what does "reclaimable" mean? It refers to buffers and page caches--the disk cache. They are reclaimable because, when memory is tight, the Linux kernel can simply flush them out back to the disk. These are file-backed pages. I've lightly edited this example of memory statistics:
$ cat /proc/meminfo
MemTotal: 255944 kB
MemFree: 3668 kB
Buffers: 13640 kB
Cached: 171788 kB
SwapCached: 0 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 255944 kB
LowFree: 3668 kB
SwapTotal: 909676 kB
SwapFree: 909676 kB
Based on this above output, the free virtual memory is MemFree + Buffers + Cached + SwapFree = 1098772 kB.
I failed to find any formalized C (glibc) function to find out free (including reclaimable) memory space. The closest I found is by using get_avphys_pages() or sysconf() (with the _SC_AVPHYS_PAGES parameter). They only report the amount of free memory, not the free + reclaimable amount.
That means to get precise information, you must programmatically parse the /proc/meminfo and calculate it by yourself. If you're lazy, take the procps source package as a reference on how to do it. This package contains tools such as ps, top, and free. It is available under the GPL.
Experiments with Alternative Memory Allocators
Different allocators yield different ways to manage memory chunks and to shrink, expand, and create virtual memory areas. Hoard is one example. Emery Berger from the University of Massachusetts wrote it as a high performance memory allocator. Hoard seems to work best for multi-threaded applications; it introduces the concept of per-CPU heap.
Use 64-bit Platforms
Users who need larger user address spaces can consider using 64-bit platforms. The Linux kernel no longer uses the 3:1 VM split for these machines. In other words, user space becomes quite large. It can be a good match for machines with more than 4GB of RAM.
This has no connection to extended addressing schemes, such as Intel's Physical Address Extension (PAE), which allows a 32-bit Intel processor to address up to 64GB of RAM. This addressing deals with physical address, while in the virtual address context, the user space itself is still 3GB (assuming the 3:1 VM split). This extra memory is reachable, but not all mappable into the address space. Unmappable portions of RAM are unusable.
Consider Packed Types on Structures
Packed attributes can help to squeeze the size of structs, enums, and unions. This is a way to save more bytes, especially for array of structs. Here is a declaration example:
struct test
{
char a;
long b;
} __attribute__ ((packed));
The con for this action is that it makes certain field(s) unaligned and thus it costs more CPU cycles to access the field. "Aligned" here means the variable's address is a multiple of its data type's natural size. The net result is that, depending on the data access frequency, the runtime may get relatively slower. However, take into account page alignment and cache coherence.
Use ulimit() for User Processes
With ulimit -v, you can limit the address space a process can allocate with mmap(). When you reach the limit, all mmap(), and hence malloc(), calls will return 0 and the kernel's OOM killer will never start. This is most useful in a multi-user environment where you cannot trust all of the users and want to avoid killing random processes.
Acknowledgement
The author gives credits to several people for their assistance and help: Peter Ziljtra, Wolfram Gloger, and Rene Hermant. Mr. Gloger also contributed the ulimit() technique.
References
- "Dynamic Storage Allocation: A Survey and Critical Review," by Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles. Proceeding 1995 International Workshop of Memory Management.
- Hoard: A Scalable Memory Allocator for Multithreaded Applications, by Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson
- "Once upon a
free()" by Anonymous, Phrack Volume 0x0b, Issue 0x39, Phile #0x09 of 0x12. - "Vudo: An Object Superstitiously Believed to Embody Magical Powers," by Michel "MaXX" Kaempf. Phrack Volume 0x0b, Issue 0x39, Phile #0x08 of 0x12.
- "Policy-Based Memory Allocation," by Andrei Alexandrescu and Emery Berger. C/C++ Users Journal.
- "Security of memory allocators for C and C++," by Yves Younan, Wouter Joosen, Frank Piessens, and Hans Van den Eynden. Report CW419, July 2005
- Lecture notes (CS360) about
malloc(), by Jim Plank, Dept. of Computer Science, University of Tennessee. - "Inside Memory Management: The Choices, Tradeoffs, and Implementations of Dynamic Allocation," by Jonathan Bartlett
- "The Malloc Maleficarum," by Phantasmal Phantasmagoria
- Understanding The Linux Kernel, 3rd edition, by Daniel P. Bovet and Marco Cesati. O'Reilly Media, Inc.
Mulyadi Santosa is a freelance writer who lives in Indonesia.
Return to the Linux DevCenter.
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 18 of 18.
-
Try testing your facts before posting an article
2007-07-20 09:28:12 docbillnet [Reply | View]
Neither ulimit -v or ulimit -m have any effect with the 2.6 Linux kernels. Processes can still use unlimited amounts of memory. -
Re: Try testing your facts before posting an article
2007-07-20 22:59:22 mulyadi_santosa [Reply | View]
Hi Bill...
Thanks for the critism. As soon as you mentioned this, I re-check the fact on my FC5 installation (2.6.15.x kernel). I re-ran loop-calloc.c, but first I change my uid (using su) to non root user (let's say abc). I did:
ulimit -v 51200
to limit the available virtual memory to uid abc up to 50 MB. loop-calloc gives me these:
.....
Currently allocating 47 MB
Currently allocating 48 MB
So malloc stops when it allocates 48 MB of VM. The rest is, of course, consumed by code and data section of the code (and the loader, shared lib, etc). pay attention that this limit is per session: applied immediately after you type it in shell session. Once you exit from this session, it's no longer true.
perhaps you get different result?
regards,
Mulyadi
-
kernel memory
2007-06-28 07:25:56 campbellmc [Reply | View]
Hi Mulyadi,
Excellent article - many thanks! Just curious: is it ever possible that low mem could get used up? I am guessing it would have to be bad coding in a driver or some other piece of kernel code, causing a memory leak or something, which is (hopefully) highly unlikely. I've heard that user-space processes also need some kernel memory, but I am guessing the kernel's memory manager would deny any requests that it could not fulfil, and the application would simply fail.
Thanks again,
Campbell -
kernel memory
2007-07-04 09:45:19 mulyadi_santosa [Reply | View]
hi..
big sorry, late reply. Uhm, "low mem"? I don't exactly understand what you do refer here. Maybe lowmem memory zone a.k.a ZONE_NORMAL?
But anyway, in general, memory could be used up (until the last drop). This is especially true when you do it in kernel mode. Nothing stops you in this case, because (Linux) kernel always trust itself. Of course, any sane kernel developers should catch this quirk at the first place before releasing any stable releases.
And..about user processes which allocate kernel mode memory. Actually, implicitly you already do that everytime. When you start a program, the kernel also allocates small amount of memory to store its task descriptor. When you're doing system call, usually some user memory content are copied to kernel memory area before further processed.
There is a more explicit example , assuming you know a bit about sound programming. IIRC if you prepare a PCM channel and ask for some amount of buffer, actually you are requesting kernel mode pages.
I hope it clarifies your doubts.
regards,
Mulyadi
-
kernel memory
2007-07-09 04:56:17 campbellmc [Reply | View]
Hi Mulyadi,
Thanks for the reply.
Yes, by 'low mem' I meant ZONE_NORMAL or the kernel memory area, i.e., the first 1GB (or 896MB) if you have up to 4GB on a 32-bit system (I'm just learning this stuff, so apologies in advance if some of my questions are a bit daft).
When you say that the kernel 'always trusts itself', what do you mean exactly?
Is it the case that (theoretically) the kernel will never allocate more RAM than is available in ZONE_NORMAL? You mentioned that it is possible/normal for the glibc memory allocator to overcommit memory allocation for user-space processes. Thus an application could allocate 3GB but the machine only has 80MB memory available (phys RAM + SWAP), and it can get away with this because the application may not actually end up *using* the memory. Now: does the memory allocator or the kernel apply a stricter policy on ZONE_NORMAL? Could it ever over-allocate memory in ZONE_NORMAL? Assuming that all of the 896MB of memory was allocated, would new kernel-space processes then not be able to start (e.g., loading a module) OR userspace applications would also fail, since they need some kernel memory too. If the kernel does overcommit, I imagine it could crash the machine if it used up all the phys low RAM (896MB). Part of my interest is to see whether low memory conditions can cause the machine to actually crash, as opposed to just causing user-space applications to fail. Also, I wonder if swap can assist system (not application) stability. If only user-space processes can be swapped out, and a user-space process running out of memory will not cause the machine to crash, just the application, then from a system stability point of view, swap is unnecessary, but it would help application stability, since it can allow more memory pages to be file-backed. Only in the case of heavy swapping or very large overcommitment would relying on swap be an issue. As you point out, memory allocation and commitment levels can be tuned.
Cheers,
campbell -
kernel memory
2007-07-13 04:52:51 mulyadi_santosa [Reply | View]
OK, to answer your first question. "Kernel trusts itself" means the kernel won't do any complicated check when it asks for memory. For example, you ask for 256 MB memory block (using kmalloc(), kernel-space version of malloc()). Then the allocator will give it to you if there are such amount of free pages. No allocation delay at all. Another example, you can allocate a big chunk and forget not to free it later. There isn't any garbage collector exists in kernel land, so this chunk will still marked as used until the end of life of the kernel.
Now the second question, could the kernel over-allocate? In practice, no. What you see as overcommit action actually just exists in user space. Recall that the actual page allocation only happens in the page fault (be it soft or hard one, "hard" means data must be read from backing storage). In kernel space, when you ask for RAM pages, you will either get them all at once or get nothing (in case of low free pages or heavy fragmented memory).
About the policy, I can't recall anything specific here. I just remember that in each zone (dma, normal and highmem), some % of free pages are reserved. No user mode allocation is allowed to drain this reserved pages, unless its effective user ID is root. Another policy that I could recall is the way the allocator prioritize the zones. IIRC, first it tries to grab pages from highmem zone, then normal. As the last resort, it will try DMA zone.
About the importance of swap, this is kinda subjective answer. Theoritically, you won't need swap if you own very big RAM, let's say 64GB RAM (it can be addressed in 32 bit using PAE mode). But that's rare. Nowadays, most PC owns 256 MB - 2GB RAM. Sure it's big, but the applications also grow bigger too and consumes more RAM. So, 2GB is likely eaten fast in certain workloads. If you don't own swap, once that 2GB is used, you're out of luck. No more allocation is possible. Swap is acting as life saver here, allows you to allocates a bit more without being rejected. It also permits the kernel to swap out inactive pages, so RAM pagea are freed up for more important jobs.
Does this clear your doubts?
regards,
Mulyadi
-
Parsing /proc/meminfo before 'malloc'ing
2007-03-22 21:20:36 Unna.KB [Reply | View]
Sorry. I came across this article after nearly 4 months since it is written. I found this article very useful. I need a help to improve the memory performance of my embedded system.
I have a memory performance testcase scenario wherein I allocate two chunks of 32 MB , memset chunk1 to all 1s. Then, memcpy chunk1 to chunk2. In this process, i get my running process killed when it is performing memcpy at 27MB. So, as suggested in the article, i tried to parse the /proc/meminfo file for free usable memory before a malloc operation. Usable Memory = MemFree + Cached, as Buffers and SwapFree are not applicable in my case. MemTotal = 71MB.
Before allocation of chunk 1, it showed 60MB of Usable Memory. I allocated chunk 1 using malloc. When i again checked for usable memory for allocating chunk 2 it again showed 60 MB. Should not it alerted me by showing 28 MB, so that i would have averted from allocating the next chunk of 32 MB. Please advice me on how to avoid such memory allocations. Which other field i need to check in the /proc/meminfo.?
I cannot use strict overcommit in my system, as application has been on its final stage of development, tuning to strict overcommit leads to lot of NO MEMORY errors.
-
Re: Parsing /proc/meminfo before 'malloc'ing
2007-03-24 10:36:40 mulyadi_santosa [Reply | View]
Hi Unna..
Sorry for this late reply. I am also confused why it can actually report there was still 60 MB of free memory after you do malloc()+memset() the first 32 MB block. The things I can suggest are:
1. Please make sure you get a valid memory report. There is a chance you are reading not-up-to-date information (kinda delayed).
2. Find out more about your OS. Is it Linux? BSD? else? Pay attention for things like how they actually do memory allocation. I also forgot to tell the reader one thing (more because I haven't done closer research about it), kernel actually reserves some amount of RAM for special purpose. I don't know the exact amount, so you probably hit this "unseen" area.
Feel free to reply on this thread... and anyone may CMIIW.
regards,
Mulyadi -
Re: Parsing /proc/meminfo before 'malloc'ing
2008-11-17 22:35:39 dee_ [Reply | View]
Hi Mulyadi,
Thanks for this great article. It is really helpful.
I am facing one problem. I want that my application should be able to find two things:
1) available heap memory
2) largest block of heap memory that it can allocate.
Calculating available free pages and available user space will help me how?
Can you help me in this?
Regards,
Deepak
-
Re: Parsing /proc/meminfo before 'malloc'ing
2008-12-05 20:12:08 santosam [Reply | View]
Hi Deepak
Sorry, I know no built-in glibc function that is able to do what you ask. Seems like you have to parse /proc/the-pid/maps by yourself and from there determine the largest VMA block you can allocate.
Pay attention that you could split the size you ask into smaller parts, since in real situation it's a bit hard to find big continous virtual memory area.
Sorry if it doesn't help you a lot and thanks for reading my article. Glad you find it useful.
regards,
Mulyadi.
-
OOM Behaviour
2006-12-04 06:41:45 Teresa3455 [Reply | View]
First of all, thanks for the article. I found it very useful.
I compiled loop-callo.c on my desktop machine. I ran it, and suddenly my firefox session was killed by OOM.
I don't understand why firefox was killed instead of the newly-created memory-eating process. Firefox was started much before loop-call was.
Is there anyother way to adjust OOM behaviour other than placing "OOM_DISABLE" on /proc/<pid>/oomadj ?
Thanks
Teresa
P.S : Good link -> http://linux-mm.org/OOM_Killer
-
OOM Behaviour
2006-12-05 22:53:09 mulyadi_santosa [Reply | View]
Hi Teresa
It's good to hear that you found this article to be useful. Please don't be hesitate to write further comments about it.
About FF (Firefox) got killed instead ofloop-calloc, I have a pretty good guess that FF had bigger VM size thanloop-callocwhen OOM killer was working in your case. Recall that running time is just one of the killer's criteria, so to predict which application get killed, you need to carefully check all those 7 criterias I have mentioned in the article.
I have tried to mimic your case, simply by doingtail -f /dev/zerowhile FF 1.0.4 was running. Instead of killingtailright away, FF was killed together with other application such as KDE panel and KMail.Topreported that FF consumed ~33MB of virtual memory while free+reclaimable RAM was about ~30 MB before loop-calloc was started.
My suggestion, instead of looking ways to disable OOM killer, is to useulimitwhenever you want to start a memory hogger application. Start a shell (possibly via xterm or Konsole),ulimit -v <some amount of memory>and start the application after that. Assuming the application does proper checking aftermalloc()and strict overcommit is enabled, there is no need for OOM-killer to randomly kills application..which is in your case, ended up with killing innocent FF.
regards,
Mulyadi
-
Source needs some changes to compile
2006-11-30 18:26:14 cdonges [Reply | View]
if ! (myblock) break;
should be
if (! myblock) break;
and
print("Currently allocating %d MB\n", ++count);
should be
print f("Currently allocating %d MB\n", ++count); -
Source needs some changes to compile
2006-12-01 10:06:18 chromatic |
[Reply | View]
Thanks; I updated the article. -
Source needs some changes to compile
2006-12-01 03:50:13 mulyadi_santosa [Reply | View]
Hi...
Thank you for the clarification. However, IIRC, both !(myblock) or (!myblock) should yield same result. I am away from dedicated Linux box right now, but I'll confirm it ASAP.
regards,
Mulyadi
-
Source needs some changes to compile
2007-06-21 16:33:31 deStilaDo [Reply | View]
IAACP and no, it's wrong.
Even if it's accepted by any compiler, it shouldn't.
The grammar of the C language states that parenthesis are mandatory part of "if statement", formally know as "selection_statement":
selection_statement: in statement
IF '(' expression ')' statement
| IF '(' expression ')' statement ELSE statement
http://www.vendian.org/mncharity/ccode/grammar/html/degener_symb.txt.html#selection_statement







