Spark-Cassandra-Connector Does not work for spark-submit - apache-spark

I am using spark-cassandra-connector to connect to cassandra from spark.
I am able to connect through Livy successfully using the below command.
curl -X POST --data '{"file": "/my/path/test.py", "conf" : {"spark.jars.packages": "com.datastax.spark:spark-cassandra-connector_2.11:2.3.0", "spark.cassandra.connection.host":"myip"}}' -H "Content-Type: application/json" localhost:8998/batches
Also able to connect through pyspark shell interactively using below command
sudo pyspark --packages com.datastax.spark:spark-cassandra-connector_2.10:2.0.10 --conf spark.cassandra.connection.host=myip
However not able to connect through spark-submit. some of the commands I have tried for the same are below.
spark-submit test.py --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.2 --conf spark.cassandra.connection.host=myip this one didnt work.
I tried passing these parameters my python files used for spark-submit, still didnt work.
conf = (SparkConf().setAppName("Spark-Cassandracube").set("spark.cassandra.connection.host","myip").set({"spark.jars.packages","com.datastax.spark:spark-cassandra-connector_2.11:2.3.0"))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
tried passing these parameters uisng jupyter notebook was also.
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.0 --conf spark.cassandra.connection.host="myip" pyspark-shell'
All the threads that i have seen so far are talking about spark-cassandra-connector using spark-shell but nothing much about spark-submit.
Version used
Livy : 0.5.0
Spark : 2.4.0
Cassandra : 3.11.4

Not tested, but the most probable cause is that you're specifying all options:
--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.2 \
--conf spark.cassandra.connection.host=myip
after a name of your script: test.py - in this case, spark-submit considers them as parameters for a script itself, not for spark-submit. Try to move script name after options...
P.S. See Spark documentation for more details...

Related

Path of jars added to a Spark Job - spark-submit

I am using Spark 2.1 (BTW) on a YARN cluster.
I am trying to upload JAR on YARN cluster, and to use them to replace on-site (alreading in-place) Spark JAR.
I am trying to do so through spark-submit.
The question Add jars to a Spark Job - spark-submit - and the related answers - are full of interesting points.
One helpful answer is the following one:
spark-submit --jars additional1.jar,additional2.jar \
--driver-class-path additional1.jar:additional2.jar \
--conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
--class MyClass main-application.jar
So, I understand the following:
"--jars" is for uploading jar on each node
"--driver-class-path" is for using uploaded jar for the driver.
"--conf spark.executor.extraClassPath" is for using uploaded jar for executors.
While I master the filepaths for "--jars" within a spark-submit command, what will be the filepaths of the uploaded JAR to be used in "--driver-class-path" for example ?
The doc says: "JARs and files are copied to the working directory for each SparkContext on the executor nodes"
Fine, but for the following command, what should I put instead of XXX and YYY ?
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path XXX:YYY \
--conf spark.executor.extraClassPath=XXX:YYY \
--class MyClass main-application.jar
When using spark-submit, how can I reference the "working directory for the SparkContext" to form XXX and YYY filepath ?
Thanks.
PS: I have tried
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path some1.jar:some2.jar \
--conf spark.executor.extraClassPath=some1.jar:some2.jar \
--class MyClass main-application.jar
No success (if I made no mistake)
And I have tried also:
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path ./some1.jar:./some2.jar \
--conf spark.executor.extraClassPath=./some1.jar:./some2.jar \
--class MyClass main-application.jar
No success either.
spark-submit by default uses client mode.
In client mode, you should not use --jars in conjunction with --driver-class-path.
--driver-class-path will overwrite original classpath, instead of prepending to it as one may expect.
--jars will automatically add the extra jars to the driver and executor classpath so you do not need to add its path manually.
It seems that in cluster mode --driver-class-path is ignored.

df.show() prints empty result while in hdfs it is not empty

I have a pyspark application which is submitted to yarn with multiple nodes and it also reads parquet from hdfs
in my code, i have a dataframe which is read directly from hdfs:
df = self.spark.read.schema(self.schema).parquet("hdfs://path/to/file")
when i use df.show(n=2) directly in my code after the above code, it outputs:
+---------+--------------+-------+----+
|aaaaaaaaa|bbbbbbbbbbbbbb|ccccccc|dddd|
+---------+--------------+-------+----+
+---------+--------------+-------+----+
But when i manually go to the hdfs path, data is not empty.
What i have tried?
1- at first i thought that i may have used few cores and memory for my executor and driver, so i doubled them and nothing changed.
2- then i thought that the path may be wrong, so i gave it an wrong hdfs path and it throwed error that this path does not exist
What i am assuming?
1- i think this may have something to do with drivers and executors
2- it may i have something to do with yarn
3- configs provided when using spark-submit
current config:
spark-submit \
--master yarn \
--queue my_queue_name \
--deploy-mode cluster \
--jars some_jars \
--conf spark.yarn.dist.files some_files \
--conf spark.sql.catalogImplementation=in-memory \
--properties-file some_zip_file \
--py-files some_py_files \
main.py
What i am sure
data is not empty. the same hdfs path is provided in another project which is working fine.
So the problem was with the jar files i was providing
The hadoop version was 2.7.2 and i changed it to 3.2.0 and it's working fine

Spark read csv file submitted from --files

I'm submitting a Spark job to a remote spark cluster on yarn and including a file in the spark-submit --file I want to read the submitted file as a dataframe. But I'm confused about how to go about this without having to put the file in HDFS:
spark-submit \
--class com.Employee \
--master yarn \
--files /User/employee.csv \
--jars SomeJar.jar
spark: SparkSession = // create the Spark Session
val df = spark.read.csv("/User/employee.csv")
spark.sparkContext.addFile("file:///your local file path ")
Add file using addFile so that it can be available at your worker nodes. Since you want to read local file in cluster mode.
You may need to do a slight change according to scala and the spark version your are using.
employee.csv is in the work directory of executor, just reading it as follows:
val df = spark.read.csv("employee.csv")

replace default application.conf file in spark-submit

My code works like:
val config = ConfigFactory.load
It gets the key-value pairs from application.conf by default. Then I use -Dconfig.file= to point to another conf file.
It works fine for command below:
dse -u cassandra -p cassandra spark-submit
--class packagename.classname --driver-java-options
-Dconfig.file=/home/userconfig.conf /home/user-jar-with-dependencies.jar
But now I need to split the userconfig.conf to 2 files. I tried command below. It doesn't work.
dse -u cassandra -p cassandra spark-submit
--class packagename.classname --driver-java-options
-Dconfig.file=/home/userconfig.conf,env.conf
/home/user-jar-with-dependencies.jar
By default spark will look in defaults.conf but you can 1) specify another file using 'properties-file' 2) you can pass individual keu value properties using --conf or 3) you can set up the configuration programmatically in your code using the sparkConf object
Does this help or are you looking for the akka application.conf file?

How to get SparkContext if Spark runs on Yarn?

We have a program based on Spark standalone, and in this program we use SparkContext and SqlContext to do lots of queries.
Now we want to deploy the system on a Spark which runs on Yarn. But when we modify the spark.master to yarn-cluster, the application throws an exception says this works with spark-submit type only. When we switch to yarn-client, although it no longer throws exceptions, it doesn't work properly.
It seems that if runs on Yarn, we can no longer use SparkContext to work, instead we should use something like yarn.Client, but in this way we don't know how to change our code to achieve what we have done before using SparkContext and SqlContext.
Is there a good way to solve this? Can we get SparkContext from yarn.Client or we should change our code to utilize new interfaces of yarn.Client?
Thank you!
When you run on cluster , you need to do a spark-submit like this
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
--master will be yarn
--deploy-mode will be cluster
In your application if you have something like setMaster("local[]") , you can remove it and build the code. when you do spark-submit with --Master yarn, yarn will launch containers for you instead of spark-standalone scheduler.
Your app code can look like this without any setting for Master
val conf = new SparkConf().setAppName("App Name")
val sc = new SparkContext(conf)
yarn deploy mode client is use when you want to launch driver on same machine from code is running. On a cluster the deploy mode should be cluster, this will make sure driver is launched on one of the worker node by yarn.

Resources