How to include BigQuery Connector inside Dataproc using Livy - apache-spark

I'm trying to run my application using Livy that resides inside GCP Dataproc but I'm getting this: "Caused by: java.lang.ClassNotFoundException: bigquery.DefaultSource"
I'm able to run hadoop fs -ls gs://xxxx inside Dataproc and I checked if Spark is pointing to the right location in order to find gcs-connector.jar and that's ok too.
I included Livy in Dataproc using initialization (https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/livy/)
How can I include bigquery-connector in Livy's classpath?
Could you help me, please?
Thank you all!

It looks like your application is depending on the BigQuery connector, not the GCS connector (bigquery.DefaultSource).
The GCS connector should always be included in the HADOOP classpath by default, but you will have to manually add the BigQuery connector jar to your application.
Assuming this is a Spark application, you can set the Spark jar property to pull in the bigquery connector jar from GCS at runtime: spark.jars='gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'
For more installation options, see https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/README.md

Related

How to integrate Spark, Livy and Nifi to execute spark code from jar

Is there a way, we can execute Spark code(package in jar) from Nifi, using Livy?
I can see in Nifi that using ExecuteSparkInteractive, we can submit custom code which can be run in spark cluster using livy. but what i want is , pass name of the jar file , main class in Nifi that connects Spark via Livy.
I see below article on this, but seems option like Session JARs is not available in plain Nifi installation.
https://community.cloudera.com/t5/Community-Articles/Apache-Livy-Apache-NiFi-Apache-Spark-Executing-Scala-Classes/ta-p/247985

How to pass Spark job properties to DataProcSparkOperator in Airflow?

I am trying to execute Spark jar on Dataproc using Airflow's DataProcSparkOperator. The jar is located on GCS, and I am creating Dataproc cluster on the fly and then executing this jar on the newly created Dataproc cluster.
I am able to execute this with DataProcSparkOperator of Airflow with default settings, but I am not able to configure Spark job properties (e.g. --master, --deploy-mode, --driver-memory etc.).
From documentation of airflow didn't got any help. Also tried many things but didn't worked out.
Help is appreciated.
To configure Spark job through DataProcSparkOperator you need to use dataproc_spark_properties parameter.
For example, you can set deployMode like this:
DataProcSparkOperator(
dataproc_spark_properties={ 'spark.submit.deployMode': 'cluster' })
In this answer you can find more details.

Configure external jars with HDI Jupyter Spark (Scala) notebook

I have an external custom jar that I would like to use with Azure HDInsight Jupyter notebooks; the Jupyter notebooks in HDI use Spark Magic and Livy.
Within the first cell of the notebook, I'm trying to use the jars configuration:
%%configure -f
{"jars": ["wasb://$container$#$account#.blob.core.windows.net/folder/my-custom-jar.jar"]}
But the error message I receive is:
Starting Spark application
The code failed because of a fatal error:
Status 'shutting_down' not supported by session..
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context. For instructions on how to assign resources see http://go.microsoft.com/fwlink/?LinkId=717038
b) Contact your cluster administrator to make sure the Spark magics library is configured correctly.
Current session configs: {u'jars': [u'wasb://$container$#$account#.blob.core.windows.net/folder/my-custom-jar.jar'], u'kind': 'spark'}
An error was encountered:
Status 'shutting_down' not supported by session.
I'm wondering if I'm just not understanding how Livy works in this case as I was able to successfully include a spark-package (GraphFrames) on the same cluster:
%%configure -f
{ "conf": {"spark.jars.packages": "graphframes:graphframes:0.3.0-spark2.0-s_2.11" }}
Some additional references that may be handy (just in case I missed something):
Jupyter notebooks kernels with Apache Spark clusters in HDInsight
Livy Documentation
Submit Spark jobs remotely to an Apache Spark cluster on HDInsight using Livy
Oh, I was able to figure it out and forgot to update my question. This can work if you put the jar in the default storage account of your HDI cluster.
HTH!
in case people come here for adding jars on EMR.
%%configure -f
{"name": "sparkTest", "conf": {"spark.jars": "s3://somebucket/artifacts/jars/spark-avro_2.11-2.4.4.jar"}}
contrary to the document, use jars directly won't work.

How to submit job(jar) to the Azure Spark cluster through commandline interface?

I am new to HDInsight Spark, I am trying to run a use-case to learn how things work in Azure Spark cluster. This is what I have done so far.
Able to create azure spark cluster.
Create jar by following steps as described in the link: create standalone scala application to run on HDInsight Spark cluster. I have used the same scala code as given in the link.
ssh into head node
upload jar to the blob storage using link: using azure CLI with azure storage
copy zip to machine
hadoop fs -copyToLocal
I have checked that the jar gets uploaded to the headnode(machine).
I want to run that jar and get the results as stated in the link given in
point 2 above.
What will be the next step? How can I submit spark job and get results using command line interface?
For example considering you are created jar for program submit.jar. In order to submit this to your cluster with dependency you can use below syntax.
spark-submit --master yarn --deploy-mode cluster --packages "com.microsoft.azure:azure-eventhubs-spark_2.11:2.2.5" --class com.ex.abc.MainMethod "wasb://space-hdfs#yourblob.blob.core.windows.net/xx/xx/submit.jar" "param1.json" "param2"
Here --packages :is to include dependency to you program, you can use --jars option and then followed by jar path. --jars "path/to/dependency/abc.jar"
--class : Main method of your program
after that specify path for your program jar.
you can pass parameters with if you needed as shown above
A couple of options on submitting spark jars:
1) If you want to submit the job on the headnode already, you can use spark-submit
See Apache submit jar documentation
2) An easier alternative is to submit spark jar via livy after uploading the jar to wasb storage.
See submit via livy doc. You can skip step 5 if you do it this way.

spark driver not found

I am trying to write dataframe to sqlserver using spark. I am using the method write for dataframewriter to write to sql server.
Using DriverManager.getConnection I am able to get connection of sqlserver and able to write but when using jdbc method and passing uri I am getting "No suitable driver found".
I have passed the jtds jar in the --jars in spark-shell.
Spark version : 1.4
The issue is that spark is not finding driver jar file. So download jar and place in all worker nodes of spark cluster on the same path and add this path to SPARK_CLASSPATH in spark-env.sh file
as follow
SPARK_CLASSPATH=/home/mysql-connector-java-5.1.6.jar
hope it will help

Resources