Connecting to cassandra using pyspark - cassandra

I am a beginner learning to work with spark and cassandra.
I am trying to connect to cassandra using pyspark. I am running cassandra 2.1 and spark 1.3.
I have cloned this repo https://github.com/TargetHolding/pyspark-cassandra and followed instructions to get it working with spark shell as well as with spark-submit.
This is the command I am using ./bin/spark-submit --packages pyspark-cassandra:1.3 --conf spark.cassandra.connection.host=127.0.0.1:9042 cassandra_test.py
and similarly with pyspark replacing spark-submit (without the script in the end)
I am getting this error:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: pyspark-cassandra:1.3
I have tried to look for this error and go through related questions, but not able to get the connector working.
Any help will be greatly appreciated.
Thanks in advance.

Haven't tried it, but the spark packages page is here: http://spark-packages.org/package/TargetHolding/pyspark-cassandra
Seems to suggest:
$SPARK_HOME/bin/spark-shell --packages TargetHolding:pyspark-cassandra:0.1.5
Note the TargetHolding: bit. That might be it.

Related

Installing Apache Spark Packages to run Locally

I am looking for a clear guide or steps to installing Spark packages (specifically spark-avro) to run locally and correctly using them with spark-submit command.
I've spent a lot of time reading many posts and guides, but still not able to get spark-submit to use the locally deployed spark-avro package. Hence, if someone has already accomplished this with spark-avro or another package, please share your wisdom :)
All the existing documentation I found is a bit unclear.
Clear steps and examples would be much appreciated! P.S. I know Python/PySpark/SQL, but not much Java (yet) ...
Michael
In spark-submit command itself you can pass avro package details (make sure avro and spark version support)
spark-submit --packages org.apache.spark:spark-avro_<required_version>:<spark_version>
Example,
spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0
same way you can pass it along with spark-shell command as well to work on avro files.

How to connect documentdb to a spark application in an emr instance

I'm getting error while I'm trying to configure spark with mongodb in my EMR instance. Below is the command -
spark-shell --conf "spark.mongodb.output.uri=mongodb://admin123:Vibhuti21!#docdb-2021-09-18-15-29-54.cluster-c4paykiwnh4d.us-east-1.docdb.amazonaws.com:27017/?replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false" "spark.mongodb.output.collection="ecommerceCluster" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
I'm a beginner in Spark & AWS. Can anyone please help?
DocumentDB requires a CA bundle to be installed on each node where your spark executors will launch. As such you firstly need to install the CA certs on each instance, AWS has a guide under the JAVA section for this in two bash scripts which makes things easier.1
Once these certs are installed, your spark command needs to reference the truststores and its passwords using the configuration parameters you can pass to Spark. Here is an example that I ran and this worked fine.
spark-submit
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
--conf "spark.executor.extraJavaOptions=
-Djavax.net.ssl.trustStore=/tmp/certs/rds-truststore.jks
-Djavax.net.ssl.trustStorePassword=<yourpassword>" pytest.py
you can provide those same configuration options in both spark-shell as well.
One thing i did find tricky, was that the mongo spark connector doesnt appear to know the ssl_ca_certs parameter in the connection string, so i removed this to avoid warnings from Spark as the Spark executors would reference the keystore in the configuration anyway.

spark-submit on ibm bluemix

i've just registrated a free instance of "Analitycs for Apache Spark" and followed this tutorial to use spark submit ibm ad hoc designed script to run an app from my local machine on bluemix cloud cluster. The issue is the following: i've made everithing that was described in the tutorial and lunched this script
./spark-submit.sh --vcap ./vcap.json --deploy-mode cluster --master
https://spark.eu-gb.bluemix.net --files /home/vito/vinorosso2.csv
--conf spark.service.spark_version=2.2.0
/home/vito/workspace_2/sbt-esempi/target/scala-2.11/isolationF3.jar
--class progettoSisDis.MasterNode
everything proceed fine (dataset vinorosso2.csv and my fatJar are correctly uploaded) until the terminal sais :" submission complete" at this point when i go to the log file created by the script there was this error message :
Submit job result: Invalid plan and spark version combination in HTTP request (ibm.SparkService.PayGoPersonal, 2.0.0)
Submission ID:
ERROR: Problem submitting job. Exit
So, it wasn't enough to register a free instance of Analitycs for apache spark to submit a spark job? Hope someone can help. By the way, if it helps, on my local machine i'm using spark with intellij idea (scala). Byyye
From https://console.bluemix.net/docs/services/AnalyticsforApacheSpark/index-gentopic2.html#using_spark-submit you need to be using Spark version 1.6.x or 2.0.x. Your submit job is set to version 2.2.0. Try submitting using spark.service.spark_version=2.0.0 (assuming your code will work with this version of Spark).

How to get access to HDFS files in Spark standalone cluster mode?

I am trying to get access to HDFS files in Spark. Everything works fine when I run Spark in local mode, i.e.
SparkSession.master("local")
and get access to HDFS files by
hdfs://localhost:9000/$FILE_PATH
But when I am trying to run Spark in standalone cluster mode, i.e.
SparkSession.master("spark://$SPARK_MASTER_HOST:7077")
Error throws
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.fun$1 of type org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1
So far I have only
start-dfs.sh
in Hadoop and does not really config anything in Spark. Do I need to run Spark using YARN cluster manager instead so that Spark and Hadoop are using the same cluster manager, hence can get access to HDFS files?
I have tried to config yarn-site.xml in Hadoop following tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm, and specified HADOOP_CONF_DIR in spark-env.sh, but it does not seem to work and the same error throws. Am I missing some other configurations?
Thanks!
EDIT
The initial Hadoop version is 2.8.0 and the Spark version is 2.1.1 with Hadoop 2.7. Tried to download hadoop-2.7.4 but the same error still exists.
The question here suggests this as a java syntax issue rather than spark hdfs issue. I will try this approach and see if this solves the error here.
Inspired by the post here, solved the problem by myself.
This map-reduce job depends on a Serializable class, so when running in Spark local mode, this serializable class can be found and the map-reduce job can be executed dependently.
When running in Spark standalone cluster mode, the best is to submit the application through spark-submit, rather than running in an IDE. Packaged everything in jar and spark-submit the jar, works as a charm!

Spark + Cassandra on EMR LinkageError

I have Spark 1.6 deployed on EMR 4.4.0
I am connecting to datastax cassandra 2.2.5 deployed on EC2.
The connection works to save data into cassandra using spark-connector 1.4.2_s2.10 (Since it has guava 14) However reading data from cassandra fails using the 1.4.2 version of connector.
The right combination suggests to use 1.5.x and hence I started using 1.5.0.
First I faced the guava problem and I fixed it using userClasspathFirst solution.
spark-shell --conf spark.yarn.executor.memoryOverhead=2048
--packages datastax:spark-cassandra-connector:1.5.0-s_2.10
--conf spark.cassandra.connection.host=10.236.250.96
--conf spark.executor.extraClassPath=/home/hadoop/lib/guava-16.0.1.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
--conf spark.driver.extraClassPath=/home/hadoop/lib/guava-16.0.1.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
--conf spark.driver.userClassPathFirst=true
--conf spark.executor.userClassPathFirst=true
Now I get past Guava 16 error, however since I am using the userClassPathFirst i am facing another conflict, and I am not getting any way to resolve it.
Lost task 2.1 in stage 2.0 (TID 6, ip-10-187-78-197.ec2.internal): java.lang.LinkageError:
loader constraint violation: loader (instance of org/apache/spark/util/ChildFirstURLClassLoader) previously initiated loading for a different type with name "org/slf4j/Logger"
I am having the same trouble when I repeast the steps using Java code instead of spark-shell.
Any solution to get past it, or any other cleaner way?
Thanks!
I got the same error when using the 'userClassPathFirst' flag.
Remove these 2 flags from configuration, and just use the 'extraClassPath' paramter.
Detailed answer here:
https://stackoverflow.com/a/40235289/3487888

Resources