I'm doing some rather long computations, which can easily span a few days. In the course of these computations, sometimes Mathematica will run out of memory. To this end, I've ended up resorting to something along the lines of:
ParallelEvaluate[$KernelID]; (* Force the kernels to launch *)
kernels = Kernels[];
Do[
If[Mod[iteration, n] == 0,
CloseKernels[kernels];
LaunchKernels[kernels];
ClearSystemCache[]];
(* Complicated stuff here *)
Export[...], (* If a computation ends early I don't want to lose past results *)
{iteration, min, max}]
This is great and all, but over time the main kernel accumulates memory. Currently, my main kernel is eating up roughly 1.4 GB of RAM. Is there any way I can force Mathematica to clear out the memory it's using? I've tried littering Share and Clear throughout the many Modules I'm using in my code, but the memory still seems to build up over time.
I've tried also to make sure I have nothing big and complicated running outside of a Module, so that something doesn't stay in scope too long. But even with this I still have my memory issues.
Is there anything I can do about this? I'm always going to have a large amount of memory being used, since most of my calculations involve several large and dense matrices (usually 1200 x 1200, but it can be more), so I'm wary about using MemoryConstrained.
Update:
The problem was exactly what Alexey Popkov stated in his answer. If you use Module, memory will leak slowly over time. It happened to be exacerbated in this case because I had multiple Module[..] statements. The "main" Module was within a ParallelTable where 8 kernels were running at once. Tack on the (relatively) large number of iterations, and this was a breeding ground for lots of memory leaks due to the bug with Module.
Since you are using Module extensively, I think you may be interested in knowing this bug with non-deleting temporary Module variables.
Example (non-deleting unlinked temporary variables with their definitions):
In[1]:= $HistoryLength=0;
a[b_]:=Module[{c,d},d:=9;d/;b===1];
Length#Names[$Context<>"*"]
Out[3]= 6
In[4]:= lst=Table[a[1],{1000}];
Length#Names[$Context<>"*"]
Out[5]= 1007
In[6]:= lst=.
Length#Names[$Context<>"*"]
Out[7]= 1007
In[8]:= Definition#d$999
Out[8]= Attributes[d$999]={Temporary}
d$999:=9
Note that in the above code I set $HistoryLength = 0; to stress this buggy behavior of Module. If you do not do this, temporary variables can still be linked from history variables (In and Out) and will not be removed with their definitions due to this reason in more broad set of cases (it is not a bug but a feature, as Leonid mentioned).
UPDATE: Just for the record. There is another old bug with non-deleting unreferenced Module variables after Part assignments to them in v.5.2 which is not completely fixed even in version 7.0.1:
In[1]:= $HistoryLength=0;$Version
Module[{L=Array[0&,10^7]},L[[#]]++&/#Range[100];];
Names["L$*"]
ByteCount#Symbol##&/#Names["L$*"]
Out[1]= 7.0 for Microsoft Windows (32-bit) (February 18, 2009)
Out[3]= {L$111}
Out[4]= {40000084}
Have you tried to evaluate $HistoryLength=0; in all subkernels and as well as in the master kernel? History tracking is the most common source for going out of memory.
Have you tried do not use slow and memory-consuming Export and use fast and efficient Put instead?
It is not clear from your post where you evaluate ClearSystemCache[] - in the master kernel or in subkernels? It looks like you evaluate it in the master kernel only. Try to evaluate it in all subkernels too before each iteration.
Related
The recent leak from Wikileaks has the CIA doing the following:
DO explicitly remove sensitive data (encryption keys, raw collection
data, shellcode, uploaded modules, etc) from memory as soon as the
data is no longer needed in plain-text form.
DO NOT RELY ON THE OPERATING SYSTEM TO DO THIS UPON TERMINATION OF
EXECUTION.
Me being a developer in the *nix world; I'm seeing this as merely changing the value of a variable (ensuring I do not pass by value; and instead by reference); so if it's a string thats 100 characters; writing 0's thats 101 characters. Is it really this simple? If not, why and what should be done instead?
Note: There are similar question that asked this; but it's in the C# and Windows world. So, I do not consider this question a duplicate.
Me being a developer in the *nix world; I'm seeing this as merely
changing the value of a variable (ensuring I do not pass by value; and
instead by reference); so if it's a string thats 100 characters;
writing 0's thats 101 characters. Is it really this simple? If not,
why and what should be done instead?
It should be this simple. The devil is in the details.
memory allocation functions, such as realloc, are not guaranteed to leave memory alone (you should not rely on their doing it one way or the other - see also this question). If you allocate 1K of memory, then realloc it to 10K, your original K might still be there somewhere else, containing its sensitive payload. It might then be allocated by another insensitive variable or buffer, or not, and through the new variable, it might be possible to access a part or all of the old content, much as it happened with slack space on some filesystems.
manually zeroing memory (and, with most compilers, bzero and memset count as manual loops) might be blithely optimized out, especially if you're zeroing a local variable ("bug" - actually a feature, with workaround).
some functions might leave "traces" in local buffers or in memory they allocate and deallocate.
in some languages and frameworks, whole portions of data could end up being moved around (e.g. during so-called "garbage collection", as noticed by #gene). You may be able to tell the GC not to process your sensitive area or otherwise "pin" it to that effect, and if so, must do so. Otherwise, data might end up in multiple, partial copies.
information might have come through and left traces you're not aware of (trivial example: a password sent through the network might linger in the network library read buffer).
live memory might be swapped out to disk.
Example of realloc doing its thing. Memory gets partly rewritten, and with some libraries this will only "work" if "a" is not the only allocated area (so you need to also declare c and allocate something immediately after a, so that a is not the last object and left free to grow):
int main() {
char *a;
char *b;
a = malloc(1024);
strcpy(a, "Hello");
strcpy(a + 200, "world");
printf("a at %08ld is %s...%s\n", a, a, a + 200);
b = realloc(a, 10240);
strcpy(b, "Hey!");
printf("a at %08ld is %s...%s, b at %08ld is %s\n", a, a, a + 200, b, b);
return 0;
}
Output:
a at 19828752 is Hello...world
a at 19828752 is 8????...world, b at 19830832 is Hey!
So the memory at address a was partly rewritten - "Hello" is lost, "world" is still there (as well as at b + 200).
So you need to handle reallocations of sensitive areas yourself; better yet, pre-allocate it all at program startup. Then, tell the OS that a sensitive area of memory must never be swapped to disk. Then you need to zero it in such a way that the compiler can't interfere. And you need to use a low-level enough language that you're sure doesn't do things by itself: a simple string concatenation could spawn two or three copies of the data - I'm fairly certain it happened in PHP 5.2.
Ages ago I wrote myself a small library - there wasn't valgrind yet - inspired by Steve Maguire's Writing Solid Code, and apart from overriding the various memory and string functions, I ended up overwriting memory and then calculating the checksum of the overwritten buffer. This not for security, I used it to track buffer over/under flows, double frees, use of freed memory -- this kind of things.
And then you need to ensure your failsafes work - for example, what happens if the program aborts? Might it be possible to make it abort on purpose?
You need to implement defense in depth, and always look at ways to keep as little information around as possible - for example clearing the intermediate buffers during a calculation rather than waiting and freeing the whole lot in one fell swoop at the very end, or just when exiting the program; keeping hashes instead of passwords when at all possible; and so on.
Of course all this depends on how sensitive the information is and what the attack surface is likely to be (mandatory xkcd reference: here). Rebooting the PC with a memtest86 image could be a viable alternative. Think of a dual-boot computer with memtest86 set to test memory and power down the PC as default boot option. When you want to turn off the system... you reboot it instead. The PC will reboot, enter memtest86 by default, and before powering off for good, it'll start filling all available RAM with marching troops of zeros and ones. Good luck freeze-booting information from that.
Zeroing out secrets (passwords, keys, etc) immediately after you are done with them is fairly standard practice. The difficulty is in dealing with language and platform features that can get in your way.
For example, C++ compilers can optimize out calls to memset if it determines that the data is not read after the write. Or operating systems may have paged the memory out to disk, potentially leaving the data available that way.
I have posted on this before, but thought I had tracked it down to the NW extension, however, memory leakage still occurs in the latest version. I found this thread, which discusses a similar issues, but attributes it to Behavior Space:
http://netlogo-users.18673.x6.nabble.com/Behaviorspace-Memory-Leak-td5003468.html
I have found the same symptoms. My model starts out at around 650mb, but over each run the private working set memory rises, to the point where it hits the 1024 limit. I have sufficient memory to raise this, but in reality it will only delay the onset. I am using the table output, as based on previous discussions this helps, and it does, but it only slows the rate of increase. However, eventually the memory usage rises to a point where the PC starts to struggle. I am clearing all data between runs so there should be no hangover. I noticed in the highlighted thread that they were going to run headless. I will try this, but I wondered if anyone else had noticed the issue? My other option is to break the BehSpc simulation into a few batches so the issues never arises, bit i would be nice to let the model run and walk away as it takes around 2 hours to go through.
Some possible next steps:
1) Isolate the exact conditions under which the problem does or not occur. Can you make it happen without involving the nw extension, or not? Does it still happen if you remove some of the code from your model? What if you keep removing code — when does the problem go away? What is the smallest code that still causes the problem? Almost any bug can be demonstrated with only a small amount of code — and finding that smallest demonstration is exactly what is needed in order to track down the cause and fix it.
2) Use standard memory profiling tools for the JVM to see what kind of objects are using the memory. This might provide some clues to possible causes.
In general, we are not receiving other bug reports from users along these lines. It's routine, and has been for many years now, for people to use BehaviorSpace (both headless and not) and do experiments that last for hours or even for days. So whatever it is you're experiencing almost certainly has a more specific cause -- mostly likely, in the nw extension -- that could be isolated.
While compiling under linux I use flag -j16 as i have 16 cores. I am just wondering if it makes any sense to use sth like -j32. Actually this is a quesiton about scheduling of processor time and if it is possible to put more pressure on particular process than any other this way (let say i have like to pararell compilations each with -j16 and what if one would be -j32?).
I think it does not make much sense but I am not sure as do not know how kernel solves such things.
Kind regards,
I use a non-recursive build system based on GNU make and I was wondering how well it scales.
I ran benchmarks on a 6-core Intel CPU with hyper-threading. I measured compile times using -j1 to -j20. For each -j option make ran three times and the shortest time was recorded. Using -j9 results in shortest compile time, 11% better than -j6.
In other words, hyper-threading does help a little, and an optimal formula for Intel processors with hyper-threading is number_of_cores * 1.5:
Chart data is here.
The rule of thumb is to use the number of processors+1. Hyper-Thready counts, so a quad core CPU with HT should have -j9
Setting the value too high is counter-productive, if you do want to speed up compile times consider ccache to cache compiled objects that do not change in each compilation, and distcc to distribute the compilation across several machines.
We have a machine in our shop with the following characteristics:
256 core sparc solaris
~64gb RAM
Some of that memory used for a ram drive for /tmp
Back when it was originally setup, before other users discovered its existence, I ran some timing tests to see how far I could push it. The build in question is non-recursive, so all jobs are kicked off from a single make process. I also cloned my repo into /tmp to take advantage of the ram drive.
I saw improvements up to -j56. Beyond that my results flat lined much like Maxim's graph, until somewhere above (roughly) -j75 where performance began to degrade. Running multiple parallel builds I could push it beyond the apparent cap of -j56.
The primary make process is single-threaded; after running some tests I realized the ceiling I was hitting had to do with how many child processes the primary thread could service -- which was further hampered by anything in the makefiles that either required extra time to parse (eg., using = instead of := to avoid unnecessary delayed evaluation, complex user defined macros, etc) or used things like $(shell).
These are the things I've been able to do to speed up builds that have a noticeable impact:
Use := wherever possible
If you assign to a variable once with :=, then later with +=, it'll continue to use immediate evaluation. However, ?= and +=, when a variable hasn't been assigned previously, will always delay evaluation.
Delayed evaluation doesn't seem like a big deal until you have a large enough build. If a variable (like CFLAGS) doesn't change after all the makefiles have been parsed, then you probably don't want to use delayed evaluation on it (and if you do, you probably already know enough about what I'm talking about anyway to ignore my advice).
If you create macros you execute with the $(call) facility, try to do as much of the evaluation ahead of time as possible
I once got it in my head to create macros of the form:
IFLINUX = $(strip $(if $(filter Linux,$(shell uname)),$(1),$(2)))
IFCLANG = $(strip $(if $(filter-out undefined,$(origin CLANG_BUILD)),$(1),$(2)))
...
# an example of how I might have made the worst use of it
CXXFLAGS = ${whatever flags} $(call IFCLANG,-fsanitize=undefined)
This build produces over 10,000 object files, about 8,000 of which are from C++ code. Had I used CXXFLAGS := (...), it would only need to immediately replace ${CXXFLAGS} in all of the compile steps with the already evaluated text. Instead it must re-evaluate the text of that variable once for each compile step.
An alternative implementation that can at least help mitigate some of the re-evaluation if you have no choice:
ifneq 'undefined' '$(origin CLANG_BUILD)'
IFCLANG = $(strip $(1))
else
IFCLANG = $(strip $(2))
endif
... though that only helps avoid the repeated $(origin) and $(if) calls; you'd still have to follow the advice about using := wherever possible.
Where possible, avoid using custom macros inside recipes
The reasoning should be pretty obvious here after the above; anything that requires a variable or macro to be repeatedly evaluated for every compile/link step will degrade your build speed. Every macro/variable evaluation occurs in the same thread as what kicks off new jobs, so any time spent parsing is time make delays kicking off another parallel job.
I put some recipes in custom macros whenever it promotes code re-use and/or improves readability, but I try to keep it to a minimum.
I'm hunting for some memory-leaks in a long runing service (using F#) right now.
The only "strange" thing I've seen so far is the following:
I use a MailboxProcessor in a subsystem with an algebraic-datatype named QueueChannelCommands (more or less a bunch of Add/Get commands - some with AsyncReplyChannels attached)
when I profile the service (using Ants Memory Profiler) I see instances of arrays of mentioned type (most having lenght 4, but growing) - all empty (null) whose references seems to be held by Control.Mailbox:
I cannot see any reason in my code for this behaviour (your standard code you can find in every Mailbox-example out there - just a loop with a let! = receive and a match to follow ended with a return! loop()
Has anyone seen this kind of behaviour before or even knows how to handle this?
Or is this even a (known) bug?
Update: the growing of the arrays is really strange - seems like there is additional space appended without beeing used properly:
I am not a F# expert by any means but maybe you can look at the first answer in this thread:
Does Async.StartChild have a memory leak?
The first reply mentions a tutorial for memory profiling on the following page:
http://moiraesoftware.com/blog/2011/12/11/fixing-a-hole/
But they mention this open source version of F#
https://github.com/fsharp/fsharp/blob/master/src/fsharp/FSharp.Core/control.fs
And I am not sure it is what you are looking for (about this open source version of F# in the last point), but maybe it can help you to find the source of the leak or prove that it is actually leaking memory.
Hope that helps somehow maybe ?
Tony
.NET has its own garbage collector, which works quite nicely.
The most common way to cause memory leaks in .NET technologies is by setting up delegates, and not removing them on object deconstructors.
I have read several threads about memory issues in R and I can't seem to find a solution to my problem.
I am running a sort of LASSO regression on several subsets of a big dataset. For some subsets it works well, and for some bigger subsets it does not work, with errors of type "cannot allocate vector of size 1.6Gb". The error occurs at this line of the code:
example <- cv.glmnet(x=bigmatrix, y=price, nfolds=3)
It also depends on the number of variables that were included in "bigmatrix".
I tried on R and R64 for both Mac and R for PC but recently went onto a faster virtual machine on Linux thinking I would avoid any memory issues. It was better but still had some limits, even though memory.limit indicates "Inf".
Is there anyway to make this work or do I have to cut a few variables in the matrix or take a smaller subset of data ?
I have read that R is looking for some contiguous bits of memory and that maybe I should pre-allocate the matrix ? Any idea ?
Let me build slightly on what #richardh said. All of the data you load with R chews up RAM. So you load your main data and it uses some hunk of RAM. Then you subset the data so the subset is using a smaller hunk. Then the regression algo needs a hunk that is greater than your subset because it does some manipulations and gyrations. Sometimes I am able to better use RAM by doing the following:
save the initial dataset to disk using save()
take a subset of the data
rm() the initial dataset so it is no longer in memory
do analysis on the subset
save results from the analysis
totally dump all items in memory: rm(list=ls())
load the initial dataset from step 1 back into RAM using load()
loop steps 2-7 as needed
Be careful with step 6 and try not to shoot your eye out. That dumps EVERYTHING in R memory. If it's not been saved, it'll be gone. A more subtle approach would be to delete the big objects that you are sure you don't need and not do the rm(list=ls()).
If you still need more RAM, you might want to run your analysis in Amazon's cloud. Their High-Memory Quadruple Extra Large Instance has over 68GB of RAM. Sometimes when I run into memory constraints I find the easiest thing to do is just go to the cloud where I can be as sloppy with RAM as I want to be.
Jeremy Anglim has a good blog post that includes a few tips on memory management in R. In that blog post Jeremy links to this previous StackOverflow question which I found helpful.
I don't think this has to do with continuous memory, but just that R by default works only in RAM (i.e., can't write to cache). Farnsworth's guide to econometrics in R mentions package filehash to enable writing to disk, but I don't have any experience with it.
Your best bet may be to work with smaller subsets, manage memory manually by removing variables you don't need with rm (i.e., run regression, store results, remove old matrix, load new matrix, repeat), and/or getting more RAM. HTH.
Try bigmemory package. It is very easy to use. The idea is that data are stored in file on HDD and you create an object in R as reference to this file. I have tested this one and it works very well.
There are some alternative as well, like "ff". See CRAN Task View: High-Performance and Parallel Computing with R for more information.