Parallelizing python3 program with huge complex objects

Parallelizing python3 program with huge complex objects - multithreading

Intro
I have a quite complex python program (say more than 5.000 rows) written with Python 3.6. This program parses a huge dataset of more than 5.000 files, processes them creating an internal representation of the dataset and then creates statistics. Since I have to test the model, I need to save the dataset representation and at now I'm doing it by using serialization through dill (in the representation there are objects that pickle does not support). The serialization of the whole dataset, not compressed, takes about 1GB.
The problem
Now, I would like to speed up computation by parallelization. The perfect way would be a multithreading approach but GIL forbid that. multiprocessing module (and multiprocess - which is dill compatible - too) uses serialization to share complex objects between processes so that, in the best case I managed to invent, parallelization is ininfluent for me on time performance because of the huge size of the dataset.
The question
What is the best way to manage this situation?
I know about posh, but it seems to be only x86 compatible, ray but it uses serialization too, gilectomy (a version of python without gil) but I'm not able to make it parallelize threads and Jython which has no GIL but is not compatible with python 3.x.
I am open to any alternative, any language, however complex it may be, but I can't rewrite the code from scratch.

Best solution I found is change dill to a custom pickling module based on standard pickle. See here: Python 3.6 pickling custom procedure

Related

Is it possible to vectorize a function in NodeJS the same way it can be done in Python with Pandas?

To be more specific, I am talking about performing operations over whole rows or columns or matrices instead of scalars, in a (very) efficient way (no need to iterate over the items of the object).
I'm pretty new to NodeJS and I'm coming from Python so sorry if this is something obvious. Are there any equivalent libraries to Pandas in NodeJS that allow to do this?
Thanks

Javascript doesn't give direct access to all SIMD instructions in your computer. Those are the instructions that allow parallel computation on multiple elements of an array.
it offers some packages like math.js for clear expression of your algorithms, debugged code, and some optimization work. maht.js's expression of matrices is done with arrays-of-arrays, so it may or may not be the best way to go.
it has really good just-in-time compilation.
the compilation is friendly to loop unrolling.
If you absolutely positively need screamingly fast performance in the Javascript world, there's always WebAssembly: It offers some SIMD instructions. But it takes a lot of tooling.
An attempt to add SIMD to the Javascript standard has been abandoned in favor of WebAssembly.

How to incorporate custom functions into tf.data pipe-lining process for maximum efficiency

So tf.image for example has some elementary image processing methods already implemented which i'd assumed are optimized. The question is as I'm iterating through a large dataset of images what/how is the recommended way of implementing a more complex function on every image, in batches of course, (for example a a patch 2-D DCT) for it to go as best as possible with the whole tf.data framework.
Thanks in advance.
p.s. of course I could use the "Map" method but i'm asking beyond that. like if I'm passing a pure numpy written function to pass to "map", it wouldn't help as much.

The current best approach (short of writing custom ops in C++/CUDA) is probably to use https://www.tensorflow.org/api_docs/python/tf/contrib/eager/py_func. This allows you to write any TF eager code and use Python control flow statements. With this you should be able to do most of the things you can do with numpy. The added benefit is that you can use your GPU and the tensors you produce in tfe.py_func will be immediately usable in your regular TF code - no copies are needed.

Why devectorization in Julia is encouraged?

Seems like writing devectorized code is encouraged in Julia.
There is even a package that tries to do that for you.
My question is why?
First of all, speaking from the user experience aspect, vectorized code is more concise (less code, then less likelihood of bugs), more clear (hence easier to debug), more natural way of writing code (at least for someone who comes from scientific computing background, whom Julia tries to cater to). Being able to write something like vector'vector or vector'Matrix*vector is very important, because it corresponds to actual mathematical representation, and this is how scientific computing guys think of it in their head (not in nested loops). And I hate the fact that this is not the best way to write this, and reparsing it into loops will be faster.
At the moment it seems like there is a conflict between the goal of writing the code that is fast and the code that is concise/clear.
Secondly, what is the technical reason for this? Ok, I understand that vectorized code creates extra temporaries, etc., but vectorized functions (for example, broadcast(), map(), etc.) have a potential of multithreading them, and I think that the benefit of multithreading can outweigh the overhead of temporaries and other disadvantages of vectorized functions making them faster than regular for loops.
Do current implementations of vectorized functions in Julia do implicit multithreading under the hood?
If not, is there work / plans to add implicit concurrency to vectorized functions and to make them faster than loops?

For easy reading I decided to turn my comment marathon above into an answer.
The core development statement behind Julia is "we are greedy". The core devs want it to do everything, and do it fast. In particular, note that the language is supposed to solve the "two-language problem", and at this stage, it looks like it will accomplish this by the time v1.0 hits.
In the context of your question, this means that everything you are asking about is either already a part of Julia, or planned for v1.0.
In particular, this means that if your programming problem lends itself to vectorized code, then write vectorized code. If it is more natural to use loops, use loops.
By the time v1.0 hits, most vectorized code should be as fast, or faster, than equivalent code in Matlab. In many cases, this development goal has already been achieved, since many vector/matrix operations in Julia are sent to the appropriate BLAS routines by the compiler.
Regarding multi-threading, native multi-threading is currently being implemented for Julia, and I believe an experimental set of routines is already available on the master branch. The relevant issue page is here. Implicit multithreading for some vector/matrix operations is already in theory available in Julia, since Julia calls BLAS. I'm not sure if it is switched on by default.
Be aware though, that many vectorized operations will still (currently) be much faster in MATLAB, since MATLAB have been writing specialised multi-threaded C libraries for years and then calling them under the hood. Once Julia has native multi-threading, I expect Julia will overtake MATLAB, since at that point the entire dev community can scour the standard Julia packages and upgrade them to take advantage of native multi-threading wherever possible.
In contrast, MATLAB does not have native multi-threading, so you are relying on Mathworks to provide specialised multi-threaded routines in the form of underlying C libraries.

You can and should write vector'*matrix*vector (or perhaps dot(vector, matrix*vector) if you prefer a scalar output). For things like matrix multiplication, you're much better off using vectorized notation, because it calls the underlying BLAS libraries which are more heavily optimized than code produced by basically any language/compiler combination.
In other places, as you say you can benefit from devectorization by avoiding temporary intermediates: for example, if x is a vector, the expression
y = exp(x).*x + 5
creates 3 temporary vectors: one for a = exp(x), one for b = a.*x and one for y = b + 5. In contrast,
y = [exp(z)*z+5 for z in x]
creates no temporary intermediates. Since loops and comprehensions in julia are fast, there is no disadvantage to writing the devectorized version, and in fact it should perform slightly better (especially with performance annotations like #simd, where appropriate).
The arrival of threads may change things (making vectorized exp faster than a "naive" exp), but in general I'd say you should regard this as an "orthogonal" issue: julia will likely make multithreading so easy to use that you yourself might write operations using multiple threads, and consequently the vectorized "library" routine still has no advantage over code you might write yourself. In other words, you might use multiple threads but still write devectorized code to avoid those temporaries.
In the longer term, a "sufficiently smart compiler" may avoid temporaries by automatically devectorizing some of these operations, but this is a much harder route, with potential traps for the unwary, than it may seem.
Your statement that "vectorized code is always more concise, and easier to understand" is, however, not true: many times while writing Matlab code, you have to go to extremes to come up with a vectorized way of writing what are actually simple operations when thought of in terms of loops. You can search the mailing lists for countless examples; one that I recall on SO is How to find connected components in a matrix using Julia.

Implementing Stackless Python

I really admire the functionality of Stackless Python, and I've been looking around for a way to emulate its syntax while still using the standard Python 3 interpreter. An article by Alex J. Champandard in a gamedev blog made it look as though the greenlet library could provide this functionality. I slightly modified his code, but the best makeshift tasklet wrapper I could come up with was a class holding a greenlet inside a variable, as such:
class tasklet():
def __init__(self,function=None,*variables):
global _scheduled
self.greenlet = greenlet.greenlet(function,None)
self.functioncall = function # Redundant backup
self.variables = variables
_scheduled.append(self)
self.blocked = False
The function then emulates Stackless' scheduling by passing the variables to the greenlet when calling its switch() method.
So far this appears to work, but I'd like to be able to call the tasklets in original Stackless syntax, e.g. tasklet(function)(*args), as opposed to the current syntax of tasklet(function,*args). I'm not sure where to look in the documentation to find out how to accomplish this. Is this even possible, or is it part of Stackless' changes to the interpreter?

According to this article from 2010-01-08 (with fixed links):
Stackless Python is an extended version of the Python language (and
its CPython reference implementation). New features include
lightweight coroutines (called tasklets), communication primitives
using message passing (called channels), manual and/or automatic
coroutine scheduling, not using the C stack Python function calls, and
serialization of coroutines (for reloading in another process).
Stackless Python could not be implemented as a Python extension module
– the core of the CPython compiler and interpreter had to be patched.
greenlet is an extension module to CPython providing coroutines and
low-level (explicit) scheduling. The most important advantage of
greenlet over Stackless Python is that greenlet could be implemented
as a Python extension module, so the whole Python interpreter doesn't
have to be recompiled in order to use greenlet. Disadvantages of
greenlet include speed (Stackless Python can be 10%, 35% or 900%
faster, depending on the workflow); possible memory leaks if
coroutines have references to each other; and that the provided
functionality is low-level (i.e. only manual coroutine scheduling, no
message passing provided).
greenstackless, the Python module I've recently developed, provides
most of the (high-level) Stackless Python API using greenlet, so it
eliminates the disadvantage of greenlet that it is low-level. See the
source code and some tests (the latter with tricky corner cases).
Please note that although greenstackless is optimized a bit, it can be
much slower than Stackless Python, and it also doesn't fix the memory
leaks. Using greenstackless is thus not recommended in production
environments; but it can be used as a temporary, drop-in replacement
for Stackless Python if replacing the Python interpreter is not
feasible.
Some other software that emulates Stackless using greenlet:
Concurrence: doesn't support stackless.main, tasklet.next,
tasklet.prev, tasklet.insert, tasklet.remove,
stackless.schedule_remove, doesn't send exceptions properly. (Because
of these features missing, it doesn't pass the unit test above.)
PyPy: doesn't support stackless.main, tasklet.next, tasklet.prev,
doesn't pass the unit test above.

API compatibility between scala and python?

I have read a dozen pages of docs, and it seems that:
I can skip learning the scala part
the API is completely implemented in python (I don't need to learn scala for anything)
the interactive mode works as completely and as quickly as the scala shell and troubleshooting is equally easy
python modules like numpy will still be imported (no crippled python environment)
Are there fall-short areas that will make it impossible?

In recent Spark releases (1.0+), we've implemented all of the missing PySpark features listed below. A few new features are still missing, such as Python bindings for GraphX, but the other APIs have achieved near parity (including an experimental Python API for Spark Streaming).
My earlier answers are reproduced below:
Original answer as of Spark 0.9
A lot has changed in the seven months since my original answer (reproduced at the bottom of this answer):
Spark 0.7.3 fixed the "forking JVMs with large heaps" issue.
Spark 0.8.1 added support for persist(), sample(), and sort().
The upcoming Spark 0.9 release adds partial support for custom Python -> Java serializers.
Spark 0.9 also adds Python bindings for MLLib (docs).
I've implemented tools to help keep the Java API up-to-date.
As of Spark 0.9, the main missing features in PySpark are:
zip() / zipPartitions.
Support for reading and writing non-text input formats, like Hadoop SequenceFile (there's an open pull request for this).
Support for running on YARN clusters.
Cygwin support (Pyspark works fine under Windows powershell or cmd.exe, though).
Support for job cancellation.
Although we've made many performance improvements, there's still a performance gap between Spark's Scala and Python APIs. The Spark users mailing list has an open thread discussing its current performance.
If you discover any missing features in PySpark, please open a new ticket on our JIRA issue tracker.
Original answer as of Spark 0.7.2:
The Spark Python Programming Guide has a list of missing PySpark features. As of Spark 0.7.2, PySpark is currently missing support for sample(), sort(), and persistence at different StorageLevels. It's also missing a few convenience methods added to the Scala API.
The Java API was in sync with the Scala API when it was released, but a number of new RDD methods have been added since then and not all of them have been added to the Java wrapper classes. There's a discussion about how to keep the Java API up-to-date at https://groups.google.com/d/msg/spark-developers/TMGvtxYN9Mo/UeFpD17VeAIJ. In that thread, I suggested a technique for automatically finding missing features, so it's just a matter of someone taking the time to add them and submit a pull request.
Regarding performance, PySpark is going to be slower than Scala Spark. Part of the performance difference stems from a weird JVM issue when forking processes with large heaps, but there's an open pull request that should fix that. The other bottleneck comes from serialization: right now, PySpark doesn't require users to explicitly register serializers for their objects (we currently use binary cPickle plus some batching optimizations). In the past, I've looked into adding support for user-customizable serializers that would allow you to specify the types of your objects and thereby use specialized serializers that are faster; I hope to resume work on this at some point.
PySpark is implemented using a regular cPython interpreter, so libraries like numpy should work fine (this wouldn't be the case if PySpark was written in Jython).
It's pretty easy to get started with PySpark; simply downloading a pre-built Spark package and running the pyspark interpreter should be enough to test it out on your personal computer and will let you evaluate its interactive features. If you like to use IPython, you can use IPYTHON=1 ./pyspark in your shell to launch Pyspark with an IPython shell.

I'd like to add some points about why many people who have used both APIs recommend the Scala API. It's very difficult for me to do this without pointing out just general weaknesses in Python vs Scala and my own distaste of dynamically typed and interpreted languages for writing production quality code. So here are some reasons specific to the use case:
Performance will never be quite as good as Scala, not by orders, but by fractions, this is partly because python is interpreted. This gap may widen in future as Java 8 and JIT technology becomes part of the JVM and Scala.
Spark is written in Scala, so debugging Spark applications, learning how Spark works, and learning how to use Spark is much easier in Scala because you can just quite easily CTRL + B into the source code and read the lower levels of Spark to suss out what is going on. I find this particularly useful for optimizing jobs and debugging more complicated applications.
Now my final point may seem like just a Scala vs Python argument, but it's highly relevant to the specific use case - that is scale and parallel processing. Scala actually stands for Scalable Language and many interpret this to mean it was specifically designed with scaling and easy multithreading in mind. It's not just about lambda's, it's head to toe features of Scala that make it the perfect language for doing Big Data and parallel processing. I have some Data Science friends that are used to Python and don't want to learn a new language, but stick to their hammer. Python is a scripting language, it was not designed for this specific use case - it's an awesome tool, but the wrong one for this job. The result is obvious in the code - their code is often 2 - 5x longer than my Scala code as Python lacks a lot of features. Furthermore they find it harder to optimize their code as they are further away from the underlying framework.
Let me put it this way, if someone knows both Scala and Python, then they will nearly always choose to use the Scala API. The only people IME that use Python are those that simply do not want to learn Scala.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string