I struggle finding a word describing a solution that is not streaming but processes everything in phases or stages resulting in keeping everything in memory. Processing something in bulk is not the right solution.
To give you an example we currently have a mechanism that has a list of identifiers. The list has like a million entries and can be loaded into memory, processed and the memory is freed or in a streaming solution the list is loaded line by line, each line is instantly processed and the memory footprint is therefore superior to the first solution.
So what is the word to describe the first algorithm / solution.
I’m facing a similar naming issue, and decided to settle on “buffered”.
Related
I have a general question about dask.compute() that's motivated by a memory buildup I've been experiencing with the function. I'm using dask.compute() and map_partitions() (have tried with dask.distributed and dask.multiprocessing (the later with both with pool=ThreadPool and pool=multiprocessing.pool)) to apply a function that performs a series of operations to chunks of a dask dataframe. The output of the function is a relatively small matrix, but the operations within the function involve really large intermediate matrices. Despite deleting these intermediates, I get a memory buildup over time that eventually causes my kernel to die. This makes me wonder if dask is allocating jobs based only on the expected size of the final output variable, and not on the large calculations within the function, leading to too many jobs being sent and the memory blow-up. Is this possible? Thanks for any insight on what might be going wrong.
There are a number of similar issues around (e.g., https://github.com/dask/distributed/issues/1795 and elsewhere). As in that issue, you may want to run typical python memory monitoring tools on the function first, to see if this is intrinsic behaviour.
Essentially, people have been experiencing memory build-up when creating and deleting a large number of pandas dataframes, and this appears to be a pandas problem unrelated to dask, or maybe even a deeper-level malloc thing. You can do typical things like strongly ensure that you do not keep references live, and you can call gc.collect() within your code.
I read this question but it didn't really help.
First and most important thing: time performances are the focus in the application that I'm developing
We have a client/server model (even distributed or cloud if we wish) and a data structure D hosted on the server. Each client request consists in:
Read something from D
Eventually write something on D
Eventually delete something on D
We can say that in this application the relation between the number of received operations can be described as delete<<write<<read. In addition:
Read ops cannot absolutely wait: they must be processed immediately
Write and delete can wait some time, but sooner is better.
From the description above, any lock-mechanism is not acceptable: this would imply that read operations could wait, which is not acceptable (sorry if I stress it so much, but it's really a crucial point).
Consistency is not necessary: if a write/delete operation has been performed and then a read operation doesn't see the write/delete effect it's not a big deal. It would be better, but it's not required.
The solution should be data-structure-independent, so it shouldn't matter if we write on a vector, list, map or Donald Trump's face.
The data structure could occupy a big amount of memory.
My solution so far:
We use two servers: the first server (called f) has Df, the second server (called s) has Ds updated.
f answers clients requests using Df and sends write/delete operations to s. Then s applies write/delete operations Ds sequentially.
At a certain point, all future client requests are redirected to s. At the same time, f copies s updated Ds into its Df.
Now, f and s roles are swapped: s will answer clients request using Ds and f will keep an updated version of Ds. The swapping process is periodically repeated.
Notice that I omitted on purpose A LOT of details for simplicity (for example, once the swap has been done, f has to finish all the pending client requests before applying the write/delete operations received from s in the meantime).
Why do we need two servers? Because the data structure is potentially too big to fit into one memory.
Now, my question is: there is some similar approach in literature? I came up with this protocol in 10 minutes, I find strange that no (better) solution similar to this one has been already proposed!
PS: I could have forgot some application specs, don't hesitate to ask for any clarification!
The scheme that you have works. I don't see any particular problem with it. This is basically like many databases run their HA solution. They apply a log of writes to replicas. This model affords a great deal of flexibility in how the replicas are formed, accessed and maintained. Failovers are easy, too.
An alternative technique is to use persistent datastructures. Each write returns you a new and independent version of the data. All versions can be read in a stable and lock-free way. Versions can be kept or discarded at will. Versions share as much of the underlying state as possible.
Usually, trees underlie such persistent datastructures because it is easy to update a small part of the tree and reuse most of the old tree.
A reason you might not have found a more sophisticated approach is that your problem is extremely general: You want this to work with any data structure at all and the data can be big.
SQL Server Hekaton uses a quite sophisticated data structure to achieve lock-free, readable, point in time snapshots of any database contents. Maybe it's worth a look how they are doing it (they released a paper describing every detail of the system). They also allow for ACID transactions, serializability and concurrent writes. All lock-free.
At the same time, f copies s updated Ds into its Df.
This copy will take a long time because the data is big. It will block readers. A better approach is to apply the log of writes to the writable copy before accepting new writes there. That way reads can be accepted continuously.
The switchover also is a short period where reads might have a slightly higher latency than normal.
When developing a Parser for C++ using ANTLR, we made a batch parsing test case where a new parser is constructed to parse each C++ source file. The performance is acceptable at start - about 15 seconds per file. But after parsing some 150+ files the parsing of each file takes longer and longer and finally jvm throws a "GC limit exceeded" error.
Using JProfiler we found there are many ATNConfig objects being accumulated progressively after parsing each file. Starting from about 70M, they steadily pile up to beyond 500M until the heap is near full and GC takes 100% CPU time. The biggest objects (those that retain the most objects in heap) recognized by JProfiler include a DFA[] and a PredictionContextCache.
One thing to note is that we used 2 threads to run the parsing tasks concurrently. Although the threads don't share any parser or parse tree objects, we notice there are static fields in use by the parser generated by ANTLR, which may contribute to the memory issue in a multi-thread setup. Anyway it is just a suspect.
Does anyone have a clue what's the cause of "ATNConfig being accumulated"? Is there a solution already?
The ATNConfig instances are used for the dynamic DFA construction at runtime. The number of instances required is a function of the grammar and input. There are a few solutions available:
Increase the amount of memory you provide to the application. Try -Xmx12g as a starting point to see if the problem is too little memory for the application.
Each ATNConfig belongs to a DFA instance which represents the DFA for a particular decision. If you know the decision(s) which contain the most ATNConfig instances, you can work to simplify those decisions in the grammar.
Periodically clear the cached DFA, by calling Recognizer.clearDFA(). Note that clearing the DFA too often will hurt performance (if possible, do not clear the DFA at all).
You can use the "optimized" fork of the ANTLR 4. This fork of the project is designed to reduce the memory footprint, which can tremendously help performance for complicated grammars at the expense of speed for certain simple grammars.
I've found myself recently using the SemaphoreSlim class to limit the work in progress of a parallelisable operation on a (large) streamed resource:
// The below code is an example of the structure of the code, there are some
// omissions around handling of tasks that do not run to completion that should be in production code
SemaphoreSlim semaphore = new SemaphoreSlim(Environment.ProcessorCount * someMagicNumber);
foreach (var result in StreamResults())
{
semaphore.Wait();
var task = DoWorkAsync(result).ContinueWith(t => semaphore.Release());
...
}
This is to avoid bringing too many results into memory and the program being unable to cope (generally evidenced via an OutOfMemoryException). Though the code works and is reasonably performant, it still feels ungainly. Notably the someMagicNumber multiplier, which although tuned via profiling, may not be as optimal as it could be and isn't resilient to changes to the implementation of DoWorkAsync.
In the same way that thread pooling can overcome the obstacle of scheduling many things for execution, I would like something that can overcome the obstacle of scheduling many things to be loaded into memory based on the resources that are available.
Since it is deterministically impossible to decide whether an OutOfMemoryException will occur, I appreciate that what I'm looking for may only be achievable via statistical means or even not at all, but I hope that I'm missing something.
Here I'd say that you're probably overthinking this problem. The consequences for overshooting are rather high (the program crashes). The consequences for being too low are that the program might be slowed down. As long as you still have some buffer beyond a minimum value, further increases to the buffer will generally have little to no effect, unless the processing time of that task in the pipe is extraordinary volatile.
If your buffer is constantly filling up it generally means that the task before it in the pipe executes quite a bit quicker than the task that follows it, so even without a fairly small buffer it is likely to always ensure the task following it has some work. The buffer size needed to get 90% of the benefits of a buffer is usually going to be quite small (a few dozen items maybe) whereas the side needed to get an OOM error are like 6+ orders of magnate higher. As long as you're somewhere in-between those two numbers (and that's a pretty big range to land in) you'll be just fine.
Just run your static tests, pick a static number, maybe add a few percent extra for "just in case" and you should be good. At most, I'd move some of the magic numbers to a config file so that they can be altered without a recompile in the event that the input data or the machine specs change radically.
I'm returning A LOT (500k+) documents from a MongoDB collection in Node.js. It's not for display on a website, but rather for data some number crunching. If I grab ALL of those documents, the system freezes. Is there a better way to grab it all?
I'm thinking pagination might work?
Edit: This is already outside the main node.js server event loop, so "the system freezes" does not mean "incoming requests are not being processed"
After learning more about your situation, I have some ideas:
Do as much as you can in a Map/Reduce function in Mongo - perhaps if you throw less data at Node that might be the solution.
Perhaps this much data is eating all your memory on your system. Your "freeze" could be V8 stopping the system to do a garbage collection (see this SO question). You could Use V8 flag --trace-gc to log GCs & prove this hypothesis. (thanks to another SO answer about V8 and Garbage collection
Pagination, like you suggested may help. Perhaps even splitting up your data even further into worker queues (create one worker task with references to records 1-10, another with references to records 11-20, etc). Depending on your calculation
Perhaps pre-processing your data - ie: somehow returning much smaller data for each record. Or not using an ORM for this particular calculation, if you're using one now. Making sure each record has only the data you need in it means less data to transfer and less memory your app needs.
I would put your big fetch+process task on a worker queue, background process, or forking mechanism (there are a lot of different options here).
That way you do your calculations outside of your main event loop and keep that free to process other requests. While you should be doing your Mongo lookup in a callback, the calculations themselves may take up time, thus "freezing" node - you're not giving it a break to process other requests.
Since you don't need them all at the same time (that's what I've deduced from you asking about pagination), perhaps it's better to separate those 500k stuff into smaller chunks to be processed at the nextTick?
You could also use something like Kue to queue the chunks and process them later (thus not everything in the same time).