Is there a way to use fastText's word representation process in parallel? - nlp

I am new to fastText, a library for efficient learning of word representations and sentence classification. I am trying to generate word-vector for huge data set. But in single process it's taking significantly long time.
So let me put my questions clearly:
Are there any options which I can use to speedup the single fastText process?
Is there any way to generate word-vector in parallel fastText processes?
Are there any other implementation or workaround available which can solve the problem, as I read caffe2 implementation is available, but I am unable to find it.
Thanks

I understand your questions that you like to distribute fastText and do parallel training.
As mentioned in Issue #144
... a future feature we might consider implementing. For now it's not on our list of priorities, but it might very well soon.
Except for the there also mentioned Word2Vec Spark implementation, I am not aware of any other implementations.

The original FastText release by Facebook includes a command-line option thread, default 12, which controls the number of worker threads which will do parallel training (on a single machine). If you have more CPU cores, and haven't yet tried increasing it, try that.
The gensim implementation (as gensim.models.fasttext.FastText) includes an initialization parameter, workers, which controls the number of worker threads. If you haven't yet tried increasing it, up to the number of cores, it may help. However, due to extra multithreading bottlenecks in its Python implementation, if you have a lot of cores (especially 16+), you might find maximum throughput with fewer workers than cores – often something in the 4-12 range. (You have to experiment & watch the achieved rates via logging to find the optimal value, and all cores won't be maxed.)
You'll only get significant multithreading in gensim if your installation is able to make use of its Cython-optimized routines. If you watch the logging when you install gensim via pip or similar, there should be a clear error if this fails. Or, if you are watching logs/output when loading/using gensim classes, there will usually be a warning if the slower non-optimized versions are being used.
Finally, often in the ways people use gensim, the bottleneck can be in their corpus iterator or IO rather than the parallelism. To minimize this slowdown:
Check to see how fast your corpus can iterate over all examples separate from passing it to the gensim class.
Avoid doing any database-selects or complicated/regex preprocessing/tokenization in the iterator – do it once, and save the easy-to-read-as-tokens resulting corpus somewhere.
If the corpus is coming from a network volume, test if streaming it from a local volume helps. If coming from a spinning HD, try an SSD.
If the corpus can be made to fit in RAM, perhaps on a special-purpose giant-RAM machine, try that.

Related

How to properly combine multithreading and multiprocessing?

I'm using a Python program which basically reads and fits an observed curve, for context, the light spectrum of many stars, applying spectrum models so I can retrieve parameters such as the mean age of this group of stars, and save them to a file.
I need to apply this program over many spectra while additionally calculating errors using Monte Carlo simulations, which means each fit or calculation has to be done x100. In summary, my program takes a lot of computing power and time. Therefore, I'm trying to implement multiprocessing and multithreading in the code and optimize the mechanism to make it as fast as possible.
I don't know much and that's why I'm here. I would like to know what are some common implementations that could help my program. I read that multiprocessing is used mainly in CPU-bound tasks, and so I was thinking of creating a pool processes for the Monte Carlo simulations. Maybe you can give me some more tips.
Python has no multithreading support, thanks to GIL (global interpreter lock).
You might be looking for multiprocessing solution, since you won’t gain much from async (you’re doing a lot of computations which is not IO related).
There’s the library multiprocessing, and you might consider task query mechanism such as RabbitMQ, if you have multiple servers, for scale.

Multi-thread usage in Dymola slow down solution speed

Does using multi-core functionality in Dymola2020x always speed up the solution? My observation is using Advanced.ParallelizeCode=true for a model with DOF~23k; compiling time is comparable with single thread however the solution time with default solver is slower.
Any comments are appreciated!
Multi-core functionality of a single model does not always speed up execution.
There are a number of possible explanations:
There are so many dependencies that it isn't possible to parallelize at all. (Look at translation log - this is fairly clear).
It's only possible to parallelize a small part of the model. (Look at translation log - this takes more time).
The model uses many external function (or FMUs), and by default Dymola treat them as critical sections. (See release notes and manual on __Dymola_ThreadSafe and __Dymola_CriticalRegion).
In versions before Dymola 2020x you might have to set the environment variable OMP_WAIT_POLICY=PASSIVE. (Shouldn't be needed in your version.)
Using decouple as described in https://www.claytex.com/tech-blog/decouple-blocks-and-model-parallelisation/ can help for the first two.
Note that an alternatives to parallelization within the model is to parallelize a sweep of parameters (if that is your scenario). That is done automatically for sweep parameters, and without any of these drawbacks.

CogComp-NLP: Is it multi-threadable?

Is it possible to run the CogComp-NLP pipeline on lots of corpus on a multi-threaded fashion? I don't see any mentions of thread-safetiy in their readme, unfortunately. Thoughts on this issue is appreciated.
Speaking only to the Named Entity Recognition feature, it is thread safe, I have used it in a parallel workflow engine to process millions (like 90 or so) of documents without problem. I can't speak authoritatively on the other capabilities in that system, and there are many. I would further characterize NER's multi-threading capabilities as "re-entrant", meaning you can reuse a single instance across multiple threads. The feature vectors tend to be large with these sorts of system, so save yourself some memory footprint, and share a single instance of the NER model across multiple threads.

Use SimPy to simulate Chord distributed system

I am doing some research on several distributed systems such as Chord, and I would like to be able to write algorithms and run simulations of the distributed system with just my desktop.
In the simulation, I need to be able to have each node execute independently and communicate with each other, while manually inducing elements such as lag, packet loss, random crashes etc. And then collect data to estimate the performance of the system.
After some searching, I find SimPy to be a good candidate for my purpose.
Would SimPy be a suitable library for this task?
If yes, what are some suggestions/caveats for implementing such a system?
I would say yes.
I used SimPy (version 2) for simulating arbitary communication networks as part of my doctorate. You can see the code here:
https://github.com/IncidentNormal/CommNetSim
It is, however, a bit dense and not very well documented. Also it should really be translated to SimPy version 3, as 2 is no longer supported (and 3 fixes a bunch of limitations I found with 2).
Some concepts/ideas I found to be useful:
Work out what you want out of the simulation before you start implementing it; communication network simulations are incredibly sensitive to small design changes, as you are effectively trying to monitor/measure emergent behaviours from the system.
It's easy to start over-engineering the simulation, using native SimPy objects is almost always sufficient when you strip away the noise from your design.
Use Stores to simulate mediums for transferring packets/payloads. There is an example like this for simulating latency in the SimPy docs: https://simpy.readthedocs.io/en/latest/examples/latency.html
Events are tricky - as they can only fire once per simulation step, so often this can be the source of bugs as behaviour is effectively lost if multiple things fire the same event in a step. For robustness, try not to use them to represent behaviour in communication networks (you rarely need something that low-level), as mentioned above - use Stores instead as these act like queues by design.
Pay close attention to the probability distributions you use to generating randomness. Expovariate distributions are usually closer to simulating natural systems than uniform distributions, but make sure to check every distribution you use for sanity. Generating network traffic usually follows a Poisson distribution, for example, and data volume often follows a Power Law (Pareto) distribution.

FMU co-simulation using openMP or pThread

Say I have a vehicle model, the chassis will be used as a master FMU, its engine, transmission, tires, etc are from 3rd parties and I want to used them as slave FMUs. I want to parallel the model in this way, the master FMU is put on the main thread, and fork everything else on other threads.
I want to know if this simple idea is achievable by using FMUs exported from Dymola...
If possible, is it worthwhile doing it? I wander if the parallel model is as efficient as as a sequential one at the physics level. (I understand that a badly paralleled program is slower than a sequential one, but I just need to know if it is physically slower or faster)
The latest Dymola has built in the openMP features, has anyone ever used it? What does it look like?
I found a paper about this: Master for Co-Simulation Using FMI http://www.ep.liu.se/ecp/063/014/ecp11063014.pdf
I think it can make perfect sense to launch several FMU in parallel if they can do their job separately. What is difficult in co-simulation is to understand when the simulators must be synchronized (for instance to exchange information). These synchronization should be minimal to increase efficiency but enough to avoid track back the simulator states (when possible). Also, it has chance to work when you have causal relations between your FMUs. If you have acausal relations, this is a different story...
technically, I would say:
for 1), you can always launch a FMU in a thread if you want, no problem with that
for 2), it mainly depends on the number and frequency of the synchronizations required between the different FMUs
for 3) I do not know but I think you should distinguish between launching different FMU in parallel and making one FMU parallel...
my two cents

Resources