Spark + Cassandra on EMR LinkageError - apache-spark

I have Spark 1.6 deployed on EMR 4.4.0
I am connecting to datastax cassandra 2.2.5 deployed on EC2.
The connection works to save data into cassandra using spark-connector 1.4.2_s2.10 (Since it has guava 14) However reading data from cassandra fails using the 1.4.2 version of connector.
The right combination suggests to use 1.5.x and hence I started using 1.5.0.
First I faced the guava problem and I fixed it using userClasspathFirst solution.
spark-shell --conf spark.yarn.executor.memoryOverhead=2048
--packages datastax:spark-cassandra-connector:1.5.0-s_2.10
--conf spark.cassandra.connection.host=10.236.250.96
--conf spark.executor.extraClassPath=/home/hadoop/lib/guava-16.0.1.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
--conf spark.driver.extraClassPath=/home/hadoop/lib/guava-16.0.1.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
--conf spark.driver.userClassPathFirst=true
--conf spark.executor.userClassPathFirst=true
Now I get past Guava 16 error, however since I am using the userClassPathFirst i am facing another conflict, and I am not getting any way to resolve it.
Lost task 2.1 in stage 2.0 (TID 6, ip-10-187-78-197.ec2.internal): java.lang.LinkageError:
loader constraint violation: loader (instance of org/apache/spark/util/ChildFirstURLClassLoader) previously initiated loading for a different type with name "org/slf4j/Logger"
I am having the same trouble when I repeast the steps using Java code instead of spark-shell.
Any solution to get past it, or any other cleaner way?
Thanks!

I got the same error when using the 'userClassPathFirst' flag.
Remove these 2 flags from configuration, and just use the 'extraClassPath' paramter.
Detailed answer here:
https://stackoverflow.com/a/40235289/3487888

Related

How to connect documentdb to a spark application in an emr instance

I'm getting error while I'm trying to configure spark with mongodb in my EMR instance. Below is the command -
spark-shell --conf "spark.mongodb.output.uri=mongodb://admin123:Vibhuti21!#docdb-2021-09-18-15-29-54.cluster-c4paykiwnh4d.us-east-1.docdb.amazonaws.com:27017/?replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false" "spark.mongodb.output.collection="ecommerceCluster" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
I'm a beginner in Spark & AWS. Can anyone please help?
DocumentDB requires a CA bundle to be installed on each node where your spark executors will launch. As such you firstly need to install the CA certs on each instance, AWS has a guide under the JAVA section for this in two bash scripts which makes things easier.1
Once these certs are installed, your spark command needs to reference the truststores and its passwords using the configuration parameters you can pass to Spark. Here is an example that I ran and this worked fine.
spark-submit
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
--conf "spark.executor.extraJavaOptions=
-Djavax.net.ssl.trustStore=/tmp/certs/rds-truststore.jks
-Djavax.net.ssl.trustStorePassword=<yourpassword>" pytest.py
you can provide those same configuration options in both spark-shell as well.
One thing i did find tricky, was that the mongo spark connector doesnt appear to know the ssl_ca_certs parameter in the connection string, so i removed this to avoid warnings from Spark as the Spark executors would reference the keystore in the configuration anyway.

Read data from Cassandra in spark-shell

I want to read data from cassandra node in my client node on :
This is what i tried :
spark-shell --jars /my-dir/spark-cassandra-connector_2.11-2.3.2.jar.
val df = spark.read.format("org.apache.spark.sql.cassandra")\
.option("keyspace","my_keyspace")\
.option("table","my_table")\
.option("spark.cassandra.connection.host","Hostname of my Cassandra node")\
.option("spark.cassandra.connection.port","9042")\
.option("spark.cassandra.auth.password","mypassword)\
.option("spark.cassandra.auth.username","myusername")\
.load
I'm getting this error: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.cassandra.DefaultSource$
and
java.lang.NoClassDefFoundError: org/apache/commons/configuration/ConfigurationException.
Am I missing any properties? What this error is for ? How would I resolve this ?
Spark-version:2.3.2, DSE version 6.7.8
The Spark Cassandra Connector itself depends on the number of other dependencies, that could be missing here - this happens because you're providing only one jar, and not all required dependencies.
Basically, in your case you need to have following choice:
If you're running this on the DSE node, then you can use built-in Spark, if the cluster has Analytics enabled - in this case, all jars and properties are already provided, and you only need to provide username and password when starting spark shell via dse -u user -p password spark
if you're using external Spark, then it's better to use so-called BYOS (bring your own spark) - special version of the Spark Cassandra Connector with all dependencies bundled inside, and you can download jar from DataStax's Maven repo, and use with --jars
you can still use open source Spark Cassandra Connector, but in this case, it's better to use --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.2 so Spark will able to fetch all dependencies automatically.
P.S. In case of open source Spark Cassandra Connector I would recommend to use version 2.5.1 or higher, although it requires Spark 2.4.x (although 2.3.x may work) - this version has improved support for DSE, plus a lot of the new functionality not available in the earlier versions. Plus for that version there is a version that includes all required dependencies (so-called assembly) that you can use with --jars if your machine doesn't have access to the internet.

Pyspark and Cassandra Connection Error

I have stucked with a problem. When i write sample cassandra connection code while import cassandra connector gives error.
i am starting the script like below code (both of them gave error)
./spark-submit --jars spark-cassandra-connector_2.11-1.6.0-M1.jar /home/beyhan/sparkCassandra.py
./spark-submit --jars spark-cassandra-connector_2.10-1.6.0.jar /home/beyhan/sparkCassandra.py
But giving below error while
import pyspark_cassandra
ImportError: No module named pyspark_cassandra
Which part i did wrong ?
Note:I have already installed cassandra database.
You are mixing up DataStax' Spark Cassandra Connector (in the jar you add to spark submit), and TargetHolding's PySpark Cassandra project (which has the pyspark_cassandra module). The latter is deprecated, so you should probably use the Spark Cassandra Connector. Documention for this package can be found here.
To use it, you can add the following flags to spark submit:
--conf spark.cassandra.connection.host=127.0.0.1 \
--packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3
Of course use the IP address on which Cassandra is listening, and check what connector version you need to use: 2.0.0-M3 is the latest version and works with Spark 2.0 and most Cassandra versions. See the compatibility table in case you are using a different version of Spark. 2.10 or 2.11 is the version of Scala your Spark version is built with. If you use Spark 2, by default it is 2.11, before 2.x it was version 2.10.
Then the nicest way to work with the connector is to use it to read dataframes, which looks like this:
sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()
See the PySpark with DataFrames documentation for more details

spark kafka security kerberos

I try to use kafka(0.9.1) with secure mode. I would read data with Spark, so I must pass the JAAS conf file to the JVM. I use this cmd to start my job :
/opt/spark/bin/spark-submit -v --master spark://master1:7077 \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.conf=kafka_client_jaas.conf" \
--files "./conf/kafka_client_jaas.conf,./conf/kafka.client.1.keytab" \
--class kafka.ConsumerSasl ./kafka.jar --topics test
I still have the same error :
Caused by: java.lang.IllegalArgumentException: You must pass java.security.auth.login.config in secure mode.
at org.apache.kafka.common.security.kerberos.Login.login(Login.java:289)
at org.apache.kafka.common.security.kerberos.Login.<init>(Login.java:104)
at org.apache.kafka.common.security.kerberos.LoginManager.<init>(LoginManager.java:44)
at org.apache.kafka.common.security.kerberos.LoginManager.acquireLoginManager(LoginManager.java:85)
at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:55)
I think the spark does not inject the parameter Djava.security.auth.login.conf in the jvm !!
The main cause of this issue is that you have mentioned wrong property name. it should be java.security.auth.login.config and not -Djava.security.auth.login.conf. Moreover if you are using keytab file. make sure to make it available on all executors using --files argument in spark-submit. if you are using kerberos ticket make sure to set KRB5CCNAME on all executors using property SPARK_YARN_USER_ENV.
if you are using older version of spark 1.6.x or earlier. then there are some known issues with spark that this integration will not work then you have to write a custom receiver.
For spark 1.8 and later, you can see configuration here
Incase you need to create custom receiver you can see this

Connecting to cassandra using pyspark

I am a beginner learning to work with spark and cassandra.
I am trying to connect to cassandra using pyspark. I am running cassandra 2.1 and spark 1.3.
I have cloned this repo https://github.com/TargetHolding/pyspark-cassandra and followed instructions to get it working with spark shell as well as with spark-submit.
This is the command I am using ./bin/spark-submit --packages pyspark-cassandra:1.3 --conf spark.cassandra.connection.host=127.0.0.1:9042 cassandra_test.py
and similarly with pyspark replacing spark-submit (without the script in the end)
I am getting this error:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: pyspark-cassandra:1.3
I have tried to look for this error and go through related questions, but not able to get the connector working.
Any help will be greatly appreciated.
Thanks in advance.
Haven't tried it, but the spark packages page is here: http://spark-packages.org/package/TargetHolding/pyspark-cassandra
Seems to suggest:
$SPARK_HOME/bin/spark-shell --packages TargetHolding:pyspark-cassandra:0.1.5
Note the TargetHolding: bit. That might be it.

Resources