Multithreading in spacy v3.0 - multithreading

I'm trying to use spacy to train a named entity recogniser and it seems to be using only one thread. I did some research and it seems to be a Cython and GIL problem with a way to change the number of threads located in the 'pipe' method. I also found this very useful article: https://explosion.ai/blog/multithreading-with-cython
The problem is this is written for spacy v2 and v3.0 completely changed their format with config files and whatnot. I'm using Linux on a machine with 8 threads. How would I successfully multithread with spacy v3.0?

Multithreading is not supported as of spacy v2. The current alternative is to use ray for distributed training with the package spacy-ray: https://spacy.io/usage/training#parallel-training

Related

Running stable-diffusion on graphcore IPU's

I have been looking for a version of Stable-Diffusion which would be able to run on the IPU's. Currently (due to the high availability) so far I can find CUDA based ones only.
Now I wonder if there is a way to run scripts/trainers/learning etc that are Cuda based on IPU? For example a translation program in between.
I doubt there is, and I bet as I cannot find a IPU version I'll have to modify the scripts :(.
There is the HuggingFace optimum library which acts as the interoperability layer for transformers to run on IPUs. You can find Stable Diffusion there.
For other models that are not supported in the library, there's a guide on how you could modify your script to make it IPU-compatible here

Doc2Vec' object has no attribute 'neg_labels' when trying to use pretrained model

So I'm trying to use a pretrained Doc2vec for my semantic search project. I tried with this one https://github.com/jhlau/doc2vec (English Wikipedia DBOW) and with the forked version of Gensim (0.12.4) and python 2.7
It works fine when I use most_similar but when i try to use infer_vector I get this error:
AttributeError: 'Doc2Vec' object has no attribute 'neg_labels'
what can i do to make this work?
For reasons given in this other answer, I'd recommend against using a many-years-old custom fork of Gensim, and also find those particular pre-trained models a little fishy in their sizes to actually contain all the purported per-article vectors.
But also: that error resembles a very-old bug which only showed up if Gensim was not fully installed to have the necessary Cython-optimized routines for fast training/inference operations. (That caused some older, seldom-run code to be run that had a dependency on the missing neg_labels. Newer versions of Gensim have eliminated that slow code-path entirely.)
My comment on an old Gensim issue has more details, and a workaround that might help - but really, the much better thing to do for quality results & speedy code is to use a current Gensim, & train your own model.

What are Torch Scripts in PyTorch?

I've just found that PyTorch docs expose something that is called Torch Scripts. However, I do not know:
When they should be used?
How they should be used?
What are their benefits?
Torch Script is one of two modes of using the PyTorch just in time compiler, the other being tracing. The benefits are explained in the linked documentation:
Torch Script is a way to create serializable and optimizable models from PyTorch code. Any code written in Torch Script can be saved from your Python process and loaded in a process where there is no Python dependency.
The above quote is actually true both of scripting and tracing. So
You gain the ability to serialize your models and later run them outside of Python, via LibTorch, a C++ native module. This allows you to embed your DL models in various production environments like mobile or IoT. There is an official guide on exporting models to C++ here.
PyTorch can compile your jit-able modules rather than running them as an interpreter, allowing for various optimizations and improving performance, both during training and inference. This is equally helpful for development and production.
Regarding Torch Script specifically, in comparison to tracing, it is a subset of Python, specified in detail here, which, when adhered to, can be compiled by PyTorch. It is more laborious to write Torch Script modules instead of tracing regular nn.Module subclasses, but it allows for some extra features over tracing, most notably flow control like if statements or for loops. Tracing treats such flow control as "constant" - in other words, if you have an if model.training clause in your module and trace it with training=True, it will always behave this way, even if you change the training variable to False later on.
To answer your first question, you need to use jit if you want to deploy your models outside Python and otherwise you should use jit if you want to gain some execution performance at the price of extra development effort (as not every model can be straightforwardly made compliant with jit). In particular, you should use Torch Script if your code cannot be jited with tracing alone because it relies on some features such as if statements. For maximum ergonomy, you probably want to mix the two on a case-by-case basis.
Finally, for how they should be used, please refer to all the documentation and tutorial links.

Parallelizing python3 program with huge complex objects

Intro
I have a quite complex python program (say more than 5.000 rows) written with Python 3.6. This program parses a huge dataset of more than 5.000 files, processes them creating an internal representation of the dataset and then creates statistics. Since I have to test the model, I need to save the dataset representation and at now I'm doing it by using serialization through dill (in the representation there are objects that pickle does not support). The serialization of the whole dataset, not compressed, takes about 1GB.
The problem
Now, I would like to speed up computation by parallelization. The perfect way would be a multithreading approach but GIL forbid that. multiprocessing module (and multiprocess - which is dill compatible - too) uses serialization to share complex objects between processes so that, in the best case I managed to invent, parallelization is ininfluent for me on time performance because of the huge size of the dataset.
The question
What is the best way to manage this situation?
I know about posh, but it seems to be only x86 compatible, ray but it uses serialization too, gilectomy (a version of python without gil) but I'm not able to make it parallelize threads and Jython which has no GIL but is not compatible with python 3.x.
I am open to any alternative, any language, however complex it may be, but I can't rewrite the code from scratch.
Best solution I found is change dill to a custom pickling module based on standard pickle. See here: Python 3.6 pickling custom procedure

scikit learn task managment library

Update:
after some extra search. I thin I am overuse scikit-learn. if I want a production ML tools. I should use something like mahout which built on hadoop. scikit-learn is more like a toy tools for experiment ideas.
I am new to scikit-learn. I try to use scikit-learn to train a model, I want to experiment different feature combinationes and data pre-processing techniques. Each experiment will takes few hours(in order to minimize error, I will run every experiment 10 times with different train-test split), So I wrote some python script to run experiment one by one automatically, when an experiment is done, it will send me an email.
It works well, I found another server that is available to run my experiment today, it seems reasonable I should write some script that can run experiments in a distribution-fashion. There are big data platforms like hadoop, but I find that it is not for python and scikit-learn(please point out to me If my understanding of hadoop is wrong).
Because scikit-learn is an "old" library, so I think there should have existing libraries that have these capabilities that I want. or I am running in wrong direction of scikit-learn?
I try to google "scikit-learn task managment", But nothing I want turn out. other key word to search is also very welcome.
See "Experimentation frameworks" at http://scikit-learn.org/dev/related_projects.html

Resources