Does joblib's memoization support being called from multiple tasks? - scikit-learn

if a function that is being memoized called in parallel from two jobs, what happens? One call's result is saved and other is retrieved or both run without using each other results? Or this case is not supported at all?
Couldn't find a reference to this in the documentation

If a result has already been computed and saved (by the same process or by a concurrent process) it is reused.
If 2 concurrent processes compute the same result for the first time, the first process to complete saves the result on the drive for later reuse and the second process use its own computation result the first time and later can reuse the cached result.
Also the cache is preserved on the hard drive after a Python program ends so that it can be reused when the same script / program is restarted later.

Related

adding elements to a query using multithreading

Consider you are running simulations and each simulation writes results to a output.txt file. I want to run thousands of simulations using multithreading, while doing that even though if i use locking, unlocking, i was still having errors when multiple threads access to the file at the same time.
To solve this, I am going to add result texts to a query that stores them. That is, each thread will add result to this query instead of writing it to the output.txt file. And in the end, i'll take stored texts from query and write to output.txt
My question here is: whenever multiple threads are adding such items to a query, do you think an error might happen in the end, like, missing simulation ? I've come up to this question because whenever you increase single value by multithreads, that value will not be incremented as much as you want if you don't be careful. (i.e, in a multithreaded for loop, add +1 to previously declared int a for 1000 times; then in the end, a will not be 1000 (ofc this can be prevented by other things))

How to cache imported modules once off for multiple runs of a python script?

I have a python script that I would like to run periodically on a 1 minute basis using a cron job. The script imports some python modules and config files each time it is run. The issue is, there is some large overhead (1-2 minutes) with all the imports. (The modules and files are relatively small; the total size is only 15 MB, so can easily fit in memory).
Once everything is imported, the rest of the script runs relatively quickly (about 0.003 seconds; it's not computationally demanding).
Is it possible to cache all the imports, once, the very first time the script is run, so that all subsequent times the script is run there is no need to import the modules and files again?
No, you can't. You would have to use persistent storage, like shelve, or something in-memory such as SQLite, where you'd store any expensive computations that persist between sessions, and subsequently you'd just read those results from memory/disk, depending on your chosen storage.
Moreover, do note modules are, in fact, being cached upon import in order to improve load time, however, not in memory, but rather on disk, as .pyc files under __pychache__ and the import time per se is insignificant in general, so your imports take that long not because of the import itself, but because of the computations inside those modules, so you might want to optimise those.
The reason you can't do what you want is because in order to keep data in memory, the process must keep running. Memory belongs to the process running the script, and once that script finished, the memory is freed. See here for additional details regarding your issue.
You can't just run a script and fill the memory with whatever computations you have until you might run it another time, because, firstly, the memory wouldn't know when that other time would be (it might be 1 min later, it might be 1 year later) and secondly, if you would be able to do that, then imagine how shortly you'd run out of memory when different scripts from different applications across the OS (it's not just your program out there) would fill the memory with the results of their computations.
So you can either run your code in an indefinite loop with sleep (and keep the process active) or you can use a crontab and store your previous results somewhere.

Bi-Threaded processing in Matlab

I have a Large-Scale Gradient Descent optimization problem that I am running using Matlab. The code has got two parts:
A Sequential update part that fires every iteration that updates the parameter vector.
A validation error computation part that fires every 10 iterations or so using the parameter value at the end of the corresponding iteration in which its fired.
The way that I am running this now is to do (1) and (2) sequentially. But (2) takes a lot of time and its not the core part of my routine - I made it just to check the progress and plot the error of my model. Is it possible in Matlab to run (2) in a parallel manner to (1) ? Please note that (1) cannot be run in parallel since it performs sequential update. So a simple 'parfor' usage is not a solution, unless there is a really smart way of doing that.
I don't think Matlab has any way of multi-threading outside of the (rather restricted) parallel computing toolbox. There is a work over which may help you though:
Open 2 sessions of Matlab, sessions A and B (or instances, or workspaces, however you call it)
Matlab session A:
Calculate the 10 iterations of your sequential process (1)
Saves the result in a file (adequately and uniquely named)
Goes on to calculate the next 10 iterations (back to the top of this loop basically)
In parralel:
Matlab session B:
Check periodically for the existence of the file written by process A (define a timer that will do that at the time interval which make sense for your process, a few seconds or a few minutes ...)
If the file exist => load it then do the validation computation (your process (2)) and display/report the results.
note: This only works if process (1) doesn't need the result of process (2) to run its iterations, but if it is the case I don't know how you could parallelise anyway.
If you have multiple cores on your machine that should run smoothly, if you have a single core then the 2 sessions will have to share and you will see a performance impact.

Synchronized calls across different computers

I have three Linux boxes, each running my program.
The program needs to call a certain callback at regular intervals, and each call must happen at the exact same time across the three boxes. I don't need any other synchronization except for the calls.
If it helps, the three boxes have their clocks synchronized by NTP (one of the boxes is the master).
Is there a way to accomplish this with good precision? Preferably non Linux specific. To make things simple, the callback must be called each N ms even if a previous call hasn't completed yet.
How about you send a request to execute the function far enough ahead of time including a timestamp when the function should be executed? The receiving application would sleep/wait the remaining time (some time is lost due to latency), then execute the function at the precise timestamp you requested.
If the called function itself takes longer than your interval, you should probably consider using threads. If the function executes quickly but transfer of the results takes longer then you should get away with something like select() without additional threads.

Good approaches for queuing simultaneous NodeJS processes

I am building a simple application to download a set of XML files and parse them into a database using the async module (https://npmjs.org/package/node-async) for flow control. The overall flow is as follows:
Download list of datasets from API (single Request call)
Download metadata for each dataset to get link to XML file (async.each)
Download XML for each dataset (async.parallel)
Parse XML for each dataset into JSON objects (async.parallel)
Save each JSON object to a database (async.each)
In effect, for each dataset there is a parent process (2) which sets of a series of asynchronous child processes (3, 4, 5). The challenge that I am facing is that, because so many parent processes fire before all of the children of a particular process are complete, child processes seem to be getting queued up in the event loop, and it takes a long time for all of the child processes for a particular parent process to resolve and allow garbage collection to clean everything up. The result of this is that even though the program doesn't appear to have any memory leaks, memory usage is still too high, ultimately crashing the program.
One solution which worked was to make some of the child processes synchronous so that they can be grouped together in the event loop. However, I have also seen an alternative solution discussed here: https://groups.google.com/forum/#!topic/nodejs/Xp4htMTfvYY, which pushes parent processes into a queue and only allows a certain number to be running at once. My question then is does anyone know of a more robust module for handling this type of queueing, or any other viable alternative for handling this kind of flow control. I have been searching but so far no luck.
Thanks.
I decided to post this as an answer:
Don't launch all of the processes at once. Let the callback of one request launch the next one. The overall work is still asynchronous, but each request gets run in series. You can then pool up a certain number of the connections to be running simultaneously to maximize I/O throughput. Look at async.eachLimit and replace each of your async.each examples with it.
Your async.parallel calls may be causing issues as well.

Resources