Spark structured streaming: java.lang.NoClassDefFoundError for GroupStateTimeout [duplicate] - apache-spark

This question already has answers here:
Resolving dependency problems in Apache Spark
(7 answers)
Closed 4 years ago.
I'm trying to use mapGroupsWithState in spark structured streaming as defined in https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.streaming.GroupState
However I get the error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/streaming/GroupStateTimeout
Seems like the GroupStateTimeout class definition was not found in the package
I'm using the JAR for spark-sql_2.11_2.2.0 from:
https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.11/2.2.0
When I open up the JAR, there is no class definition of GroupStateTimeout. I'm not sure if it's something that I'm missing here, because the mapGroupsWithState seem to be a pretty well documented feature. How is a class definition missing in the package?

GroupStateTimeOut is the part of spark-catalyst module. Please have a look here: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/streaming/GroupStateTimeout.java
So you need to add spark-catalyst dependency https://mvnrepository.com/artifact/org.apache.spark/spark-catalyst_2.11/2.2.0 in your project. Hope it will solve your problem

Related

Why classes that comes from spark-libs has precedence before classes from jar in spark?

Let's say we have fat jar with spark application. In other words, it contains all dependencies in jar. However, it is not able to override/hide dependencies provided by spark-libs. I mean that it seems to be strange because classes in jar looks like more near/internal in comparison to "external" classes from spark-libs.
Could anyone explain me how does it work? And what idea is behind such precedence?

Could you help explain what # does in Python [duplicate]

This question already has answers here:
What does the "at" (#) symbol do in Python?
(14 answers)
how do decorated functions work in flask/python? (app.route)
(2 answers)
Closed 3 years ago.
I have tried a number of different tutorials in python and regularly come across the # symbol being used, but I am still not 100% sure what is going on.
This document describes how properties are used:
https://www.programiz.com/python-programming/property
#property
#temperature.setter
This document shows blueprints with flask
http://flask.pocoo.org/docs/1.0/tutorial/views/
#bp.route('/register', methods=('GET', 'POST'))
This is used in pytest:
https://docs.pytest.org/en/latest/fixture.html
#pytest.fixture
#pytest.mark.usefixtures("cleandir", "anotherfixture")
This discussion mentions another type
How do I correctly setup and teardown my pytest class with tests?
#classmethod
If you could offer any help, especially a good tutorial, it would be really appreciated.
Thanks
Mark

spark.createDataFrame() vs sqlContext.createDataFrame() [duplicate]

This question already has answers here:
Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?
(5 answers)
Closed 4 years ago.
Can someone explain to me the difference between spark.createDataFrame() and sqlContext.createDataFrame()? I have seen both used but do not understand the exact difference or when to use which.
I'm gonna assume you are using spark with a version over 2, because in the first method you seem to be referring to a SparkSession which is only available after version 2
spark.createDataFrame(...) is the preferred way to create a df in spark 2. Refer to the linked documentation to see possible usages, as it is an overloaded method.
sqlContext.createDataFrame(...) (spark version - 1.6) was the used way to create a df in spark 1.x. As you can read in the linked documentation, it is deprecated in spark 2.x and only kept for backwards compatibility
The entry point for working with structured data (rows and columns) in Spark 1.x.
As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility.
So, to answer your question, you can use both ways in spark 2.x (although the second one is deprecated so it's strongly recommended to use the first one) and you can only use the second one provided you are stuck with spark 1.x
Edit: SparkSession implementation (i.e the source code) and SQLContext implementation

Spark computeSVD Alternative

Thanks in advance for any help on this. I am working on a project to do some system log anomaly detection on some very large data sets (we aggregate ~100gb per day of syslogs). The method/road we have chosen requires the need of singular decomposition value on a matrix of identifiers for each log message. As we progressed we found that Spark 2.2 provides a computeSVD function (we are using Python API - we are aware that this is available in Scala and Java, but our target is to use Python), but we are running Spark 2.1.1 (HortonWorks HDP 2.6.2 distribution). I asked about upgrading our 2.1.1 version in place, but the 2.2 version has not been tested against HDP yet.
We toyed with the idea of using Numpy straight from Python for this, but we are afraid we'll break the disinterestedness of Spark and possibly overload worker nodes by going outside of the Spark API. Are there any alternatives in the Spark 2.1.1 Python API for SVD? Any suggestion or pointers would greatly be appreciated. Thanks!
Another though I forgot about in the initial posting - is there a way we can write our machine learning primarily in the Python API, but maybe call that Scala function we need, return that result and continue with Python? I don't know if that is a thing or not....
To bring this to a close, we ended up writing our own SVD function based on the example at:
Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?
There were some minor tweaks and I will post them as soon as we have them finalized, but overall it was the same. This was posted for Spark 1.5 and we are using Spark 2.1.1. However, it was noted that Spark 2.2 contains a computeSVD() function - unfortunately, at the time of the posting on this, the HDP distribution we are using did not support 2.2. Yesterday (11.1.2017), HDP 2.6.3 was announced and had support for Spark 2.2. Once we upgrade, we'll be converting the code to take advantage of the built-in computeSVD() function that Spark 2.2 provides. Thanks for all the help and pointers to the link above, they helped greatly!

how to create a global variable in spark that can be used across all nodes [duplicate]

This question already has an answer here:
How to define a global read\write variables in Spark
(1 answer)
Closed 5 years ago.
I need to use a variable across all nodes in spark yarn cluster.
Broadcast variables in spark are immutable so, not useful in my case.
I need a similar approach that supports both read and write.
Regards,
Sorabh
You cannot, Spark is built on the principle of immutability, in fact any distributed framework works by leveraging the concepts of immutability.
Here is a similar question and beautiful explanation :
How to define a global read\write variables in Spark

Resources