Why is something like this runs extremely slow in Haskell?
test = [x|a<-[1..100],b<-[1..100],c<-[1..100],d<-[1..100],let x = a]
print $ length test
There are only about 10^8 numbers to run, it should be done in a blink of eye, but it seems like running forever and almost crashed.
Are you running this in ghci or in a compiled program? It makes a big difference.
If in ghci, then ghci will keep the computed value of test around in case you want to use it later. Normally this is a good idea, but not in this case where test is a huge value that would be cheap to recompute anyways. How huge? For starters it's a list of 10^8 elements, and (on a 64-bit system) a list costs 24 bytes per element, so that's 2.4G already. Then there is space usage of the values themselves. One might think the values are all taken from [1..100], so they should be shared and use a negligible amount of space in total. But the values in the list are really of the form x, which could depend on a, b, c and d, and length never examines the values in the list as it traverses it. So each element is going to be represented as a closure that refers to a, b, c and d, which takes at least 8*(4+1) = 40 more bytes, bringing us to a total of 6.4G.
That's rather a lot, and the garbage collector has to do quite a lot of copying when you allocate 6.4G of data, all of it permanently live. That's what takes so long, not actually computing the list or its length.
If you compile the program
test = [x|a<-[1..100],b<-[1..100],c<-[1..100],d<-[1..100],let x = a]
main = print $ length test
then test does not have to be kept live as its length is being computed, as clearly it is never going to be used again. So now the GC has almost no work to do, and the program runs in a couple seconds (reasonable for ~10^8 list node allocations and computations on Integer).
You're not just running a loop 10^8 times, you're creating a list with 10^8 elements. Since you're using length, Haskell has to actually evaluate the entire list to return its length. Each element in the list takes one word, which might be 32 bits or might be 64 bits. On the optimistic assumption that it's 32 bits (4 bytes), you've just allocated 400 MB (about 381.5 MiB) of memory. If it's 64 bits, then that's 800 MB (about 763 MiB) of memory you've just allocated. Depending on what else is going on on your system, you might have just hit the swap file / swap partition by allocating that much RAM at a chunk.
If there are other subtleties going on, I'm not aware of them, but memory usage is my first suspicion for why this is so slow.
Related
I want to create an empty list of lists:
shape = (70000,70000)
corr = np.empty(shape).tolist()
How can I know how much RAM I need to hold this list using windows operating system (64 bit)?
This will create a list-of-lists-of-floats. About half of the RAM used is for the floats themselves and half is for the references to them. The size of each reference is 8 bytes and the size of each float is also 8 bytes. That makes 70000 * 70000 * 8 * 2 bytes (approx 80G).
Lists look like this in memory:
image source: here
The 70001 lists objects themselves also have overhead (they maintain pointer into storage array, and their own length), but this will be negligible in comparison (probably ~4 MB).
Also note that Python lists overallocate space by an implementation-dependent factor, so consider these numbers a lower bound. Memory is over-allocated so that there are always some free slots available, which makes appends and inserts faster. The space allocated increased by about 12.5% when full.
(defn sum [numbers]
(reduce + numbers))
(def numbers (into [] (range 0 100000000)))
(time (sum numbers))
The above was the code that was ran.
Simply adding up a lot of numbers.
This line was executed in the repl multiple times:
(time (sum numbers))
Each time it almost gets all cores fully running.
Looking at jvisualvm, there were not a lot of threads created.
But this code used all 12 of the hyperthreads that were available on my 6 core laptop.
What was happening behind the scene that made this possible?
Thanks to the comments.
It has to do with the size of the range.
On my laptop, when it's around 70 million numbers, all is fine.
When it gets around 80 millions, the heap size grows a lot, the time taken grows very significantly, and all cores get to work. And visual vm shows more GC activity was happening.
So the comments above are probably right, it has to do with GC.
I am looking for a way to directly read the content of a file into the provided uninitialized byte array.
Currently, I have a code like the following:
use std::fs::File;
use std::mem::MaybeUninit;
let buf: MaybeUninit<[u8; 4096]> = MaybeUninit::zeroed();
let f = File::open("some_file")?;
f.read(buf.as_mut_ptr().as_mut().unwrap())?;
The code does work, except that it unnecessarily initializes the byte array with 0. I would like to replace MaybeUninit::zeroed() with MaybeUninit::uninit() but doing so will trigger an undefined behavior according to the document of MaybeUninit. Is there a way to initialize an uninitialized memory region with the content of the file without first reading the data to somewhere else, by only using the standard library? Or do we need to go for the OS-specific API?
The previous shot at the answer is kept below for posterity. Let's deal with the actual elephant in the room:
Is there a way to initialize an uninitialized memory region with the content of the file without first reading the data to somewhere else, by only using the standard library? Or do we need to go for the OS-specific API?
There is: Read::read_to_end(&mut self, &mut Vec<u8>)
This function will drain your impl Read object, and depending on the underlying implementation will do one or more reads, extending the Vec provided as it goes and appending all bytes to it.
It then returns the number of bytes read. It can also be interrupted, and this error needs to be handled.
You are trying to micro-optimize something based on heuristics you think are the case, when they are not.
The initialization of the array is done in one go as low-level as it can get with memset, all in one chunk. Both calloc and malloc+memset are highly optimized, calloc relies on a trick or two to make it even more performant. Somebody on codereview pitted "highly optimized code" against a naive implementation and lost as a result.
The takeaway is that second-guessing the compiler is typically fraught with issues and, overall, not worth micro-optimizing for unless you can put some real numbers on the issues.
The second takeaway is one of memory logic. As I am sure you are aware, allocation of memory is dramatically faster in some cases depending on the position of the memory you are allocating and the size of the contiguous chunk you are allocating, due to how memory is laid out in atomic units (pages). This is a much more impactful factor, to the point that below the hood, the compiler will often align your memory request to an entire page to avoid having to fragment it, particularly as it gets into L1/L2 caches.
If anything isn't clear, let me know and I'll generate some small benchmarks for you.
Finally, MaybeUninit is not at all the tool you want for the job in any case. The point of MaybeUninit isn't to skip a memset or two, since you will be performing those memsets yourself by having to guarantee (by contract due to assume_init) that those types are sane. There are cases for this, but they're rare.
In larger cases
There is an impact on performance in uninitializing vs. initializing memory, and we're going to show this by taking an absolutely perfect scenario: we're going to make ourselves a 64M buffer in memory and wrap it in a Cursor so we get a Read type. This Read type will have latency far, far inferior to most I/O operations you will encounter in the wild, since it is almost guaranteed to reside entirely in L2 cache during the benchmark cycle (due to its size) or L3 cache (because we're single-threaded). This should allow us to notice the performance loss from memsetting.
We're going to run three versions for each case (the code):
One where we define out buffer as [MaybeUninit::uninit().assume_init(); N], i.e. we're taking N chunks of MaybeUninit<u8>
One where out MaybeUninit is a contiguous N-element long chunk
One where we're just mapping straight into an initialized buffer
The results (on a core i9-9900HK laptop):
large reads/one uninit time: [1.6720 us 1.7314 us 1.7848 us]
large reads/small uninit elements
time: [2.1539 us 2.1597 us 2.1656 us]
large reads/safe time: [2.0627 us 2.0697 us 2.0771 us]
small reads/one uninit time: [4.5579 us 4.5722 us 4.5893 us]
small reads/small uninit elements
time: [5.1050 us 5.1219 us 5.1383 us]
small reads/safe time: [7.9654 us 7.9782 us 7.9889 us]
The results are as expected:
Allocating N MaybeUninit is slower than one huge chunk; this is completely expected and should not come as a surprise.
Small, iterative 4096-byte reads are slower than a huge, single, 128M read even when the buffer only contains 64M
There is a small performance loss in reading using initialized memory, of about 30%
Opening anything else on the laptop while testing causes a 50%+ increase in benchmarked time
The last point is particularly important, and it becomes even more important when dealing with real I/O as opposed to a buffer in memory. The more layers of cache you have to traverse, the more side-effects you get from other processes impacting your own processing. If you are reading a file, you will typically encounter:
The filesystem cache (may or may not be swapped)
L3 cache (if on the same core)
L2 cache
L1 cache
Depending on the level of the cache that produces a cache miss, you're more or less likely to have your performance gain from using uninitialized memory dwarfed by the performance loss in having a cache miss.
So, the (unexpected TL;DR):
Small, iterative reads are slower
There is a performance gain in using MaybeUninit but it is typically an order of magnitude less than any I/O opt
I'm trying to implement a simple programming language. I want to make the user of it not have to manage the memory, so I decided to implement a garbage collector. The simplest way I can think of after checking out some material is like this:
There are two kinds of heap zones. The first is for storing big objects(bigger than 85,000 bytes), the other is for small objects. In the following I use BZ for the first, SZ for the second.
The BZ uses the mark and sweep algorithm, because moving a big object is expensive. I don't compact, so there will be fragmentation.
The SZ uses generations with mark-compact. There are three generations: 0, 1, and 2. Allocation requests go directly to generation 0, and when generation 0 is full, I will do garbage collection on it, the survivals will be promoted to generation 1. generation 1 and generation 2 will also do garbage collection when full.
When the virtual machine starts, it will allocate a big memory from the OS to be used as a heap zone in the virtual machine The BZ and every generation in SZ will occupy a fixed portion of memory, and when an allocation request can't be satisfied, the virtual machine will give an error OTM (out of memory). This has a problem: when the virtual machine starts, even getting the program to run on it should need only a little memory, but it still uses a lot. A better way will be for the virtual machine get a small amount of memory from the OS, and then when the program needs more memory the virtual machine will get more from the OS. I am going to allocate a larger memory for the generation 2 in SZ, and then copy all the things in generation 2 to the new memory zone. And do the same thing for the BZ.
The other problem occurs when the BZ is full and SZ is empty, I would be silly not be able to satisfy a big object allocation request even though we in fact have enough free heap size for the big object in SZ. How to deal with this problem?
I am trying to understand your methodology . Since you didnt mention your strategy completely, i am having some assumption.
NOTE : Following is my hypothetical analysis and may not be practically possible .So please skip answer if you don't have time .
Your are trying to use Generational GC with changes ; There are classifications of 2 types
(1) big size objects BZ and
(2)small size objects SZ .
SZ perform generational GC with compaction( CMS )
From above understanding we know that the SZG2 has long lived objects .I am expecting that GC in szG2 is not as frequent as SZG1 or SZG0 since long lived object generally tend to live longer so less dead collection and size of SZG2 will be more as time lapses, so GC'ing takes lot of time traversing all elements ,so doing frequent GC on SZG2 is less productive(long GC spike ,so notable delay for user) compared to that of SZG1 or SZG0 .
And similarly for BZ there might be large memory requirement (as big object occupy more space) . So in order to address you query
"The other problem occurs when the BZ is full and SZ is empty, I would be silly not be able to satisfy a big object allocation request even though we in fact have enough free heap size for the big object in SZ. How to deal with this problem?"
Since you said that " when the program needs more memory the virtual machine will get more from the OS"
I have a small idea, may not be productive or may not be possible to implement and completely dependent on your implementation of GCHeap structure .
Let your virtual machine allocate memory like follows
Coming to the possibility (i borrowed idea from "memory segments of program" as shown below) below is possible at low level .
As shown in above figure a GCHeap structure has to be defined in such a way that SZG0 and BZ expand towards each other .For implementing GCHeap structure mentioned in figure a, figure b we need to have proper convention of memory growth in zones SZG[0-2] size and BZ .
So if you want to divide your heap for application into multiple heaps then you can pile figure A over figure B to decrease fragmentation (when i say fragmentation it means "when the BZ is full and SZ is empty, I would be silly not be able to satisfy a big object allocation request even though we in fact have enough free heap size for the big object in SZ." ).
So effective structure will be
B
|
B
|
B
|
B
|
A
Now its completely depends on heuristics to decide whether to consider GCHeap data structure in multiple GCHeap structures like GCHeapA , GCHeapB or take it as single heap based on requirement .
If you don't want to have multiple heaps then you can use figure A with small correction by Setting base address of **SZG2** to top of heap
The key reason behind Figure a is as follows :
we know that SZG0 get GC'ed frequently so it can have more free space compared to SZG1 and SZG2 ,since dead objects are removed and survived object gets moved to SZG1 and SZG2 .So if allocation of BZ is more it can grow towards SZG0 .
In the figure a base address of SZG1 and SZG2 are contigious because SZG2 is more prone to out of memory error as old generation object tend to live longer and GC'ing doesnt sweep much (NOTE : it is just my assumption and opinion ) so SZG2 is kept bound outwards .
i came across an interview question which asks
while searching a value in an array using 2 perallel threads
which method would be more efficent
(1) read each half of the array on a different thread (spliting it in half)
(2) reading the array on odd and even places (a thread which reads the odd places
and one which reads the even places in the array ).
i don't understand why one would be more efficent then the other
appricate it if someone would clearify this for me
thanks in advance.
Splitting the array in half is almost certainly the way to go. It will almost never be slower, and may be substantially faster.
The reason is fairly simple: when you're reading data from memory, the processor will normally read an entire cache line at a time. The exact size varies between processors, but doesn't matter a whole lot (though, in case you care, something like 64 bytes would be in the ballpark) -- the point is that it reads a contiguous chunk of several bytes at a time.
That means with the odd/even version, both processors running both threads will have to read all the data. By splitting the data in half, each core will read only half the data. If your split doesn't happen to be at a cache line boundary, each will read a little extra (what it needs rounded up to the size of a cache line). On average that will add half a cache line to what each needs to read though.
If the "processors" involved are really two cores on the same processor die, chances are that it won't make a whole lot of difference either way though. In this case, the bottleneck will normally be reading the data from main memory into the lowest-level processor cache. Even with only one thread, you'll (probably) be able to search through the data as fast as you can read it from memory, and adding more threads (no matter how you arrange their use of the data) isn't going to improve things much (if at all).
The difference is that in the case of half split, the memory is accessed linearly by each thread from left to right, searching from index 0 -> N/2 and N/2 -> N respectively, which maximizes the cache usage, since prefetching of memory is done linearly ahead.
In the second case (even-odd) the cache performance would be worse, not only because you would be prefetching items that you are not using (thread 0 takes element 0, 1, etc. but only uses half of them), but also because of cache ping-pong effects (in case of writing, but this is not done in your example).