I've tried to read through and understand exactly where the speed up is coming from in spark when I run python libraries like pandas or scikit learn but I don't see anything particularly informative. If I can get the same speedup without using the pyspark dataframes can I just deploy code using pandas and it will perform roughly the same?
I suppose my question is:
If I have working pandas code should I translate it to PySpark for efficiency or not?
If you ask if you get any speedup by starting arbitrary Python code on the driver node the answer is negative. Driver is a plain Python interpreter, it doesn't affect you code in "magic" way.
If I have working pandas code should I translate it to PySpark for efficiency or not?
If you want to get benefits of distributed computing then you have to rewrite your code using distribute primitives. However it is not a free lunch:
You problem might not distribute well.
Even if does, amount of data might not justify distribution - How to add a <br/> after each result, but not last result?
In other words - if your code works just fine with Pandas or Scikit Learn, there is little chance you'll get anything from rewriting it to Spark.
Related
I understand that when vectorization is involved, pyspark.sql.functions.pandas_udf will be faster than pyspark.sql.functions.udf.
But what if vectorization isn't involved, are the two supposed to be similar in performance? Is there any guideline for choosing between the two?
Pandas UDFs should be faster in the most cases, primarily because of the more effective encoding of data between Spark JVM and Python process, so it's recommended to use Pandas UDFs as much as possible.
The "normal" UDFs could be used in case when Pandas UDFs couldn't be used, for example, right now they don't work with MapType, arrays of TimestampType, and nested StructType.
P.S. Also, when using PySpark, maybe it makes sense to evaluate a use of Koalas, In my own tests, Koalas was ~2 times faster than similar code that used Pandas UDFs, although carefully written PySpark code was still faster.
Is there a way to know how much time a code will take to finish? or an approximation
I am thinking something like when you are coping a file in windows, it says how much time is left, or for example when you download something, it tells you approximately how much time it will take
Is there a way to do this for a spark code? from something very simple like queries, to more complex code
Thanks
Spark themselves have considered implementing this but decided against it due to uncertainties in predicting the completion time of stragglers. See the discussion in this spark issue https://issues.apache.org/jira/browse/SPARK-5216
So you will not get that information from spark. Instead you must implement your own estimation model.
> data2_tbl <- copy_to(sc, FB_tbl) #sc as spark connection
> idx <- tk_index(data2_tbl)
Warning message:
In tk_index.default(data2_tbl) :
`tk_index` is not designed to work with objects of class tbl_spark.
I have a couple of questions to the group:
Does sparklyr have support on time series like they have on the other ml_* algorithms?
We also tried and found the spark-ts package that supports time series in Spark.
I have not found good materials on how to use it. Does anyone have some documentations or experience on this?
Does sparklyr have support on time series like they have on the other ml_* algorithms?
It doesn't, because Spark doesn't. All ml_ or ft_ methods are just simple wrappers around corresponding Spark algorithms.
We also tried and found the spark-ts package that supports time series in Spark.
At this moment there is no actively developed, open source, times series analysis tool for Spark. Both spark-timeseries and flint don't seem to be maintained anymore.
This partially reflects Spark computing model, which is a poor fit for time series processing. Expressing sequential relationships in Spark is hard and usually expensive, and many time series analysis techniques, are just a bad fit for distributed processing due to their global dependencies.
Intro
I have a quite complex python program (say more than 5.000 rows) written with Python 3.6. This program parses a huge dataset of more than 5.000 files, processes them creating an internal representation of the dataset and then creates statistics. Since I have to test the model, I need to save the dataset representation and at now I'm doing it by using serialization through dill (in the representation there are objects that pickle does not support). The serialization of the whole dataset, not compressed, takes about 1GB.
The problem
Now, I would like to speed up computation by parallelization. The perfect way would be a multithreading approach but GIL forbid that. multiprocessing module (and multiprocess - which is dill compatible - too) uses serialization to share complex objects between processes so that, in the best case I managed to invent, parallelization is ininfluent for me on time performance because of the huge size of the dataset.
The question
What is the best way to manage this situation?
I know about posh, but it seems to be only x86 compatible, ray but it uses serialization too, gilectomy (a version of python without gil) but I'm not able to make it parallelize threads and Jython which has no GIL but is not compatible with python 3.x.
I am open to any alternative, any language, however complex it may be, but I can't rewrite the code from scratch.
Best solution I found is change dill to a custom pickling module based on standard pickle. See here: Python 3.6 pickling custom procedure
I have read a dozen pages of docs, and it seems that:
I can skip learning the scala part
the API is completely implemented in python (I don't need to learn scala for anything)
the interactive mode works as completely and as quickly as the scala shell and troubleshooting is equally easy
python modules like numpy will still be imported (no crippled python environment)
Are there fall-short areas that will make it impossible?
In recent Spark releases (1.0+), we've implemented all of the missing PySpark features listed below. A few new features are still missing, such as Python bindings for GraphX, but the other APIs have achieved near parity (including an experimental Python API for Spark Streaming).
My earlier answers are reproduced below:
Original answer as of Spark 0.9
A lot has changed in the seven months since my original answer (reproduced at the bottom of this answer):
Spark 0.7.3 fixed the "forking JVMs with large heaps" issue.
Spark 0.8.1 added support for persist(), sample(), and sort().
The upcoming Spark 0.9 release adds partial support for custom Python -> Java serializers.
Spark 0.9 also adds Python bindings for MLLib (docs).
I've implemented tools to help keep the Java API up-to-date.
As of Spark 0.9, the main missing features in PySpark are:
zip() / zipPartitions.
Support for reading and writing non-text input formats, like Hadoop SequenceFile (there's an open pull request for this).
Support for running on YARN clusters.
Cygwin support (Pyspark works fine under Windows powershell or cmd.exe, though).
Support for job cancellation.
Although we've made many performance improvements, there's still a performance gap between Spark's Scala and Python APIs. The Spark users mailing list has an open thread discussing its current performance.
If you discover any missing features in PySpark, please open a new ticket on our JIRA issue tracker.
Original answer as of Spark 0.7.2:
The Spark Python Programming Guide has a list of missing PySpark features. As of Spark 0.7.2, PySpark is currently missing support for sample(), sort(), and persistence at different StorageLevels. It's also missing a few convenience methods added to the Scala API.
The Java API was in sync with the Scala API when it was released, but a number of new RDD methods have been added since then and not all of them have been added to the Java wrapper classes. There's a discussion about how to keep the Java API up-to-date at https://groups.google.com/d/msg/spark-developers/TMGvtxYN9Mo/UeFpD17VeAIJ. In that thread, I suggested a technique for automatically finding missing features, so it's just a matter of someone taking the time to add them and submit a pull request.
Regarding performance, PySpark is going to be slower than Scala Spark. Part of the performance difference stems from a weird JVM issue when forking processes with large heaps, but there's an open pull request that should fix that. The other bottleneck comes from serialization: right now, PySpark doesn't require users to explicitly register serializers for their objects (we currently use binary cPickle plus some batching optimizations). In the past, I've looked into adding support for user-customizable serializers that would allow you to specify the types of your objects and thereby use specialized serializers that are faster; I hope to resume work on this at some point.
PySpark is implemented using a regular cPython interpreter, so libraries like numpy should work fine (this wouldn't be the case if PySpark was written in Jython).
It's pretty easy to get started with PySpark; simply downloading a pre-built Spark package and running the pyspark interpreter should be enough to test it out on your personal computer and will let you evaluate its interactive features. If you like to use IPython, you can use IPYTHON=1 ./pyspark in your shell to launch Pyspark with an IPython shell.
I'd like to add some points about why many people who have used both APIs recommend the Scala API. It's very difficult for me to do this without pointing out just general weaknesses in Python vs Scala and my own distaste of dynamically typed and interpreted languages for writing production quality code. So here are some reasons specific to the use case:
Performance will never be quite as good as Scala, not by orders, but by fractions, this is partly because python is interpreted. This gap may widen in future as Java 8 and JIT technology becomes part of the JVM and Scala.
Spark is written in Scala, so debugging Spark applications, learning how Spark works, and learning how to use Spark is much easier in Scala because you can just quite easily CTRL + B into the source code and read the lower levels of Spark to suss out what is going on. I find this particularly useful for optimizing jobs and debugging more complicated applications.
Now my final point may seem like just a Scala vs Python argument, but it's highly relevant to the specific use case - that is scale and parallel processing. Scala actually stands for Scalable Language and many interpret this to mean it was specifically designed with scaling and easy multithreading in mind. It's not just about lambda's, it's head to toe features of Scala that make it the perfect language for doing Big Data and parallel processing. I have some Data Science friends that are used to Python and don't want to learn a new language, but stick to their hammer. Python is a scripting language, it was not designed for this specific use case - it's an awesome tool, but the wrong one for this job. The result is obvious in the code - their code is often 2 - 5x longer than my Scala code as Python lacks a lot of features. Furthermore they find it harder to optimize their code as they are further away from the underlying framework.
Let me put it this way, if someone knows both Scala and Python, then they will nearly always choose to use the Scala API. The only people IME that use Python are those that simply do not want to learn Scala.