Spark computeSVD Alternative - apache-spark

Thanks in advance for any help on this. I am working on a project to do some system log anomaly detection on some very large data sets (we aggregate ~100gb per day of syslogs). The method/road we have chosen requires the need of singular decomposition value on a matrix of identifiers for each log message. As we progressed we found that Spark 2.2 provides a computeSVD function (we are using Python API - we are aware that this is available in Scala and Java, but our target is to use Python), but we are running Spark 2.1.1 (HortonWorks HDP 2.6.2 distribution). I asked about upgrading our 2.1.1 version in place, but the 2.2 version has not been tested against HDP yet.
We toyed with the idea of using Numpy straight from Python for this, but we are afraid we'll break the disinterestedness of Spark and possibly overload worker nodes by going outside of the Spark API. Are there any alternatives in the Spark 2.1.1 Python API for SVD? Any suggestion or pointers would greatly be appreciated. Thanks!
Another though I forgot about in the initial posting - is there a way we can write our machine learning primarily in the Python API, but maybe call that Scala function we need, return that result and continue with Python? I don't know if that is a thing or not....

To bring this to a close, we ended up writing our own SVD function based on the example at:
Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?
There were some minor tweaks and I will post them as soon as we have them finalized, but overall it was the same. This was posted for Spark 1.5 and we are using Spark 2.1.1. However, it was noted that Spark 2.2 contains a computeSVD() function - unfortunately, at the time of the posting on this, the HDP distribution we are using did not support 2.2. Yesterday (11.1.2017), HDP 2.6.3 was announced and had support for Spark 2.2. Once we upgrade, we'll be converting the code to take advantage of the built-in computeSVD() function that Spark 2.2 provides. Thanks for all the help and pointers to the link above, they helped greatly!

Related

Does John Snow Labs’ NLP library built on top of Apache Spark support Java

John Snow Labs’ NLP library built on top of Apache Spark and Spark ML library.
All its examples are provided in scala and python. Does it support java? If yes where can I find the related guides? If not is there any plan to support java?
In general, Scala libraries only need a dedicated Java API if their API (not the implementation) exposes functionality with no Java equivalent. Unfortunately, standard Scala function types are an example, at least until Scala 2.12 and Java 8. E.g. Spark makes a lot of use of ClassTags and implicits, which makes it hard to use directly from Java.
But this library is based on Spark ML, which doesn't have a separate Java API, and from a quick look, doesn't seem to need one (at least for the new DataFrame-based API). You can see its examples in Java at https://spark.apache.org/docs/2.3.0/ml-pipeline.html.
So the NLP library just creates instances of Transformer, Pipeline and other Spark ML types, and the code for creating them is trivially translatable to Java. You just need to know that Array(...) corresponds to new T[] { ... } (where T is the type of arguments). From this it doesn't seem to need a Java API, even if it could benefit from giving examples in Java. Unfortunately, it doesn't appear to provide even a Scaladoc link so I could see whether there is something in the API which is problematic to use from Java.

Spark 1.6 vs spark 2.0 productivity

Databricks team has talked a lot about why spark 2.x is faster than 1.6.
But why operating on DataFrames in spark 2.x it can produce lower level bytecode? Why was it impossible with RDD API?
Also why is it so important to make Tungsten only from 2.0? What's wrong with doing it in spark 1.6?
Spark 2.0 impruvments
For starters the first "Tungsten" optimizations have been introduced in Spark 1.4 and extended in 1.5 and 1.6.
Spark 2.0 introduces backward incompatible changes which wouldn't be acceptable in 1.x due to project management policy.
Structured data and restricted language requires simpler optimization rules. This is why linear algebra libraries or relational databases feature very aggressive optimizations, and your-arbitrary-code doesn't.
It is impossible with RDD API for the same reason why your-favorite-compiler™ doesn't apply the same optimizations out-of-the-box. It is close to mpossible to do it right (have you noticed that code used with Dataframe has to be deterministic and has to contribute to execution plan, otherwise will be eliminated?).

Whole-Stage Code Generation in Spark 2.0

I heard about Whole-Stage Code Generation for sql to optimize queries.
through p539-neumann.pdf & sparksql-sql-codegen-is-not-giving-any-improvemnt
But unfortunately no one gave answer to above question.
Curious to know about what are the scenarios to use this feature of Spark 2.0. But didn't get proper use-case after googling.
Whenever we are using sql, can we use this feature? if so, any proper use case to see this working?
When you are using Spark 2.0, code generation is enabled by default. This allows for most DataFrame queries you are able to take advantage of the performance improvements. There are some potential exceptions such as using Python UDFs that may slow things down.
Code generation is one of the primary components of the Spark SQL engine's Catalyst Optimizer. In brief, the Catalyst Optimizer engine does the following:
(1) analyzing a logical plan to resolve references,
(2) logical plan optimization
(3) physical planning, and
(4) code generation
A great reference to all of this are the blog posts
Deep Dive into Spark SQL’s Catalyst
Optimizer
Apache Spark as a Compiler: Joining a Billion Rows per Second on a
Laptop
HTH!

Evaluating Spark-Notebook

I am evaluating Spark Notebook and found three different products;
1. Hue 3.9 comes with Spark notebook (beta)
2. Apache zeppelin
3. andypetrella/spark-notebook.
Can you please help me understand pros and cons of each product
Thanks
Pani
I have only played with Hue and Jupyter.
Hue is kind of new but offer more than just a Spark Notebook, it integrates with all the Hadoop components (Oozie, Solr, Impala, HBase, Pig...).
Jupyter is great if you want an advanced editor for Pyspark. The Python editor is really good and it is very popular in the Python community.
Jupyter is a well established project whereas Spark Notebook is a great but individual effort with good fairly recent explanation here from the author himself, and Zeppelin is incubating at Apache, so on that consideration we have the modern version of "no one ever got fired for buying IBM" (until they did haha) and Jupyter is the IBM in the room.
It may help to look over some of the docs on Cloudera, for example http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ (note Jupyter used to be called iPython Notebook)
If you could post more about your use case it would help people answer your question, and perhaps post what research you have already done, StackOverflow has specific requirements for good questions and a big emphasis is trying something first and posting code. Your question may be a better fit for another StackExchange site.
If you look here you'll get more interesting information, like Zeppelin being more focused on running on top of Hadoop (and Tachyon? which I guess is a transparent layer) and Zeppelin provides a pluggable interface so you can develop with more languages.

API compatibility between scala and python?

I have read a dozen pages of docs, and it seems that:
I can skip learning the scala part
the API is completely implemented in python (I don't need to learn scala for anything)
the interactive mode works as completely and as quickly as the scala shell and troubleshooting is equally easy
python modules like numpy will still be imported (no crippled python environment)
Are there fall-short areas that will make it impossible?
In recent Spark releases (1.0+), we've implemented all of the missing PySpark features listed below. A few new features are still missing, such as Python bindings for GraphX, but the other APIs have achieved near parity (including an experimental Python API for Spark Streaming).
My earlier answers are reproduced below:
Original answer as of Spark 0.9
A lot has changed in the seven months since my original answer (reproduced at the bottom of this answer):
Spark 0.7.3 fixed the "forking JVMs with large heaps" issue.
Spark 0.8.1 added support for persist(), sample(), and sort().
The upcoming Spark 0.9 release adds partial support for custom Python -> Java serializers.
Spark 0.9 also adds Python bindings for MLLib (docs).
I've implemented tools to help keep the Java API up-to-date.
As of Spark 0.9, the main missing features in PySpark are:
zip() / zipPartitions.
Support for reading and writing non-text input formats, like Hadoop SequenceFile (there's an open pull request for this).
Support for running on YARN clusters.
Cygwin support (Pyspark works fine under Windows powershell or cmd.exe, though).
Support for job cancellation.
Although we've made many performance improvements, there's still a performance gap between Spark's Scala and Python APIs. The Spark users mailing list has an open thread discussing its current performance.
If you discover any missing features in PySpark, please open a new ticket on our JIRA issue tracker.
Original answer as of Spark 0.7.2:
The Spark Python Programming Guide has a list of missing PySpark features. As of Spark 0.7.2, PySpark is currently missing support for sample(), sort(), and persistence at different StorageLevels. It's also missing a few convenience methods added to the Scala API.
The Java API was in sync with the Scala API when it was released, but a number of new RDD methods have been added since then and not all of them have been added to the Java wrapper classes. There's a discussion about how to keep the Java API up-to-date at https://groups.google.com/d/msg/spark-developers/TMGvtxYN9Mo/UeFpD17VeAIJ. In that thread, I suggested a technique for automatically finding missing features, so it's just a matter of someone taking the time to add them and submit a pull request.
Regarding performance, PySpark is going to be slower than Scala Spark. Part of the performance difference stems from a weird JVM issue when forking processes with large heaps, but there's an open pull request that should fix that. The other bottleneck comes from serialization: right now, PySpark doesn't require users to explicitly register serializers for their objects (we currently use binary cPickle plus some batching optimizations). In the past, I've looked into adding support for user-customizable serializers that would allow you to specify the types of your objects and thereby use specialized serializers that are faster; I hope to resume work on this at some point.
PySpark is implemented using a regular cPython interpreter, so libraries like numpy should work fine (this wouldn't be the case if PySpark was written in Jython).
It's pretty easy to get started with PySpark; simply downloading a pre-built Spark package and running the pyspark interpreter should be enough to test it out on your personal computer and will let you evaluate its interactive features. If you like to use IPython, you can use IPYTHON=1 ./pyspark in your shell to launch Pyspark with an IPython shell.
I'd like to add some points about why many people who have used both APIs recommend the Scala API. It's very difficult for me to do this without pointing out just general weaknesses in Python vs Scala and my own distaste of dynamically typed and interpreted languages for writing production quality code. So here are some reasons specific to the use case:
Performance will never be quite as good as Scala, not by orders, but by fractions, this is partly because python is interpreted. This gap may widen in future as Java 8 and JIT technology becomes part of the JVM and Scala.
Spark is written in Scala, so debugging Spark applications, learning how Spark works, and learning how to use Spark is much easier in Scala because you can just quite easily CTRL + B into the source code and read the lower levels of Spark to suss out what is going on. I find this particularly useful for optimizing jobs and debugging more complicated applications.
Now my final point may seem like just a Scala vs Python argument, but it's highly relevant to the specific use case - that is scale and parallel processing. Scala actually stands for Scalable Language and many interpret this to mean it was specifically designed with scaling and easy multithreading in mind. It's not just about lambda's, it's head to toe features of Scala that make it the perfect language for doing Big Data and parallel processing. I have some Data Science friends that are used to Python and don't want to learn a new language, but stick to their hammer. Python is a scripting language, it was not designed for this specific use case - it's an awesome tool, but the wrong one for this job. The result is obvious in the code - their code is often 2 - 5x longer than my Scala code as Python lacks a lot of features. Furthermore they find it harder to optimize their code as they are further away from the underlying framework.
Let me put it this way, if someone knows both Scala and Python, then they will nearly always choose to use the Scala API. The only people IME that use Python are those that simply do not want to learn Scala.

Resources