Spark+cassandra java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.CassandraRDD - cassandra

I am trying to read data from cassandra 2.0.6 using Spark. I use datastax drivers.While reading I got an error like " Loss was due to java.lang.ClassNotFoundException
java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.CassandraRDD". But I included spark-cassandra-connector_2.10 in my pom.xml which has com.datastax.spark.connector.rdd.CassandraRDD class.Am i missing any other settings or environment variables.

You need to make sure that the connector is on the class-path for the executor using the -cp option or that it is a bundled jar in the spark context (using the SparkConf.addJars() ).
Edit for Modern Spark
In Spark > 1.X it's usually recommend that you use the spark-submit command to place your dependencies on the executor classpath. See
http://spark.apache.org/docs/latest/submitting-applications.html

Related

Overriding Apache Spark dependency (spark-hive)

Tech stack:
Spark 2.4.4
Hive 2.3.3
HBase 1.4.8
sbt 1.5.8
What is the best practice for Spark dependency overriding?
Suppose that Spark app (CLUSTER MODE) already have spark-hive (2.44) dependency (PROVIDED)
I compiled and assembled "custom" spark-hive jar that I want to use in Spark app.
There is not a lot of information about how you're running Spark, so it's hard to answer exactly.
But typically, you'll have Spark running on some kind of server or container or pod (in k8s).
If you're running on a server, go to $SPARK_HOME/jars. In there, you should find the spark-hive jar that you want to replace. Replace that one with your new one.
If running in a container/pod, do the same as above and rebuild your image from the directory with the replaced jar.
Hope this helps!

Spark doesn't load all the dependencies in the uber jar

I have a requirement to connect to Azure Blob Storage from a Spark application to read data. The idea is to access the storage using Hadoop filesystem support (i.e, using hadoop-azure and azure-storage dependencies, [https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-azure/2.8.5][1]).
We submit the job on a Spark on the K8S cluster. The embedded spark library doesn't come prepackaged with the required Hadoop-azure jar. So I am building a fat jar with all the dependencies. Problem is, even if the library is part of the fat jar, the spark doesn't seem to load it, and hence I am getting the error "java.io.IOException: No FileSystem for scheme: wasbs".
The spark version is 2.4.8 and the Hadoop version is 2.8.5. Is this behavior expected, that even though the dependency is part of the fat jar, Spark is not loading it? How to force the spark to load all the dependencies in the fat jar?
It happened the same with another dependency, and I had to manually pass it using the --jars option. However, the --jars option is not feasible if the application grows.
I tried adding the fat jar itself on the executor extraClassPath, however that causes a few other version conflicts.
Any information on this would be helpful.
Thanks & Regards,
Swathi Desai

Read data from Cassandra in spark-shell

I want to read data from cassandra node in my client node on :
This is what i tried :
spark-shell --jars /my-dir/spark-cassandra-connector_2.11-2.3.2.jar.
val df = spark.read.format("org.apache.spark.sql.cassandra")\
.option("keyspace","my_keyspace")\
.option("table","my_table")\
.option("spark.cassandra.connection.host","Hostname of my Cassandra node")\
.option("spark.cassandra.connection.port","9042")\
.option("spark.cassandra.auth.password","mypassword)\
.option("spark.cassandra.auth.username","myusername")\
.load
I'm getting this error: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.cassandra.DefaultSource$
and
java.lang.NoClassDefFoundError: org/apache/commons/configuration/ConfigurationException.
Am I missing any properties? What this error is for ? How would I resolve this ?
Spark-version:2.3.2, DSE version 6.7.8
The Spark Cassandra Connector itself depends on the number of other dependencies, that could be missing here - this happens because you're providing only one jar, and not all required dependencies.
Basically, in your case you need to have following choice:
If you're running this on the DSE node, then you can use built-in Spark, if the cluster has Analytics enabled - in this case, all jars and properties are already provided, and you only need to provide username and password when starting spark shell via dse -u user -p password spark
if you're using external Spark, then it's better to use so-called BYOS (bring your own spark) - special version of the Spark Cassandra Connector with all dependencies bundled inside, and you can download jar from DataStax's Maven repo, and use with --jars
you can still use open source Spark Cassandra Connector, but in this case, it's better to use --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.2 so Spark will able to fetch all dependencies automatically.
P.S. In case of open source Spark Cassandra Connector I would recommend to use version 2.5.1 or higher, although it requires Spark 2.4.x (although 2.3.x may work) - this version has improved support for DSE, plus a lot of the new functionality not available in the earlier versions. Plus for that version there is a version that includes all required dependencies (so-called assembly) that you can use with --jars if your machine doesn't have access to the internet.

Spark executor log in IntelliJ IDEA

I'm running a spark application which has dependency of spark in pom. And in IntelliJ IDEA, I can only see the log of driver side but no executor log. I find in the configuration I can add log files to be showed in the console, but I need to know where the log file is located...Please note it use the spark in dependency libraries but not my local spark environment...
Thanks,
Lionel
Have you set the executor logging. Take a look into this - http://shzhangji.com/blog/2015/05/31/spark-streaming-logging-configuration/

Spark 2.1 on MAPR 5.0

I am running Spark 2.1 on Mapr 5.0
I am getting following exception while launching Spark on local mode
My spark-default (Important configuration)
spark.sql.hive.metastore.version 0.13.1
spark.sql.hive.metastore.jars
/opt/mapr/lib/maprfs-5.0.0-mapr.jar:
/opt/mapr/hadoop/hadoop-0.20.2/conf:
/opt/mapr/hadoop/hadoop-0.20.2/lib/protobuf-java-2.5.0.jar:
/opt/hadoopgpl/lib/hadoop-lzo.jar:
/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.7.0-mapr-1506.jar:
/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-logging-1.1.1.jar:
/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-auth-2.7.0-mapr-1506.jar:
/opt/mapr/lib/libprotodefs-5.0.0-mapr.jar:
/opt/mapr/lib/baseutils-5.0.0-mapr.jar:
/opt/mapr/hadoop/hadoop-0.20.2/lib/guava-13.0.1.jar:
/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-common-2.7.0-mapr-1506.jar:
/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-configuration-1.6.jar
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni,com.mapr.fs.shim
java.lang.LinkageError: loader (instance of org/apache/spark/sql/hive/client/IsolatedClientLoader$$anon$1): attempted duplicate class definition for name: "com/mapr/fs/jni/MapRConstants"
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
... 104 more
java.lang.IllegalArgumentException: Error while instantiating
'org.apache.spark.sql.hive.HiveSessionState':
Please help me on that
I have experienced the same issue.
When running my Spark job using spark-submit everything worked but running it locally reproduced the same problem.
Digging out into MapR community yielded this post: https://community.mapr.com/thread/21262-spark-todf-returns-linkage-error-duplicate-class-definition#comments
In addition you will notice that in your cluster in this file spark-defaults.conf there is the following config key: spark.sql.hive.metastore.sharedPrefixes
So, adding this key spark.sql.hive.metastore.sharedPrefixes in SparkConf fixed my problem.
Here is an explanation of this key:
a comma separated list of class prefixes that should be loaded using
the classloader that is shared between Spark SQL and a specific
version of Hive. An example of classes that should be shared is JDBC
drivers that are needed to talk to the metastore. Other classes that
need to be shared are those that interact with classes that are
already shared. For example, custom appenders that are used by log4j.
You can read more about it in here: https://spark.apache.org/docs/2.1.0/sql-programming-guide.html

Resources