Spark-shell does not import specified jar file - apache-spark

I am a complete beginner to all this stuff in general so pardon if I'm missing some totally obvious step. I installed spark 3.1.2 and cassandra 3.11.11 and I'm trying to connect both of them through this guide I found where I made a fat jar for execution. In the link I posted when they execute the spark-shell command with the jar file, there's a line which occurs at the start.
INFO SparkContext: Added JAR file:/home/chbatey/dev/tmp/spark-cassandra-connector/spark-cassandra-connector-java/target/scala-2.10/spark-cassandra-connector-java-assembly-1.2.0-SNAPSHOT.jar at http://192.168.0.34:51235/jars/spark-15/01/26 16:16:10 INFO SparkILoop: Created spark context..
I followed all of the steps properly but it doesn't show any line like that in my shell. To confirm that it hasn't been added I try the sample program on that website and it throws an error
java.lang.NoClassDefFoundError: com/datastax/spark/connector/util/Logging
What should I do? I'm using spark-cassandra-connector-3.1.0

You don't need to compile it yourself, just follow official documentation - use --packages to automatically download all dependencies:
spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0
Your error is that connector file doesn't contain dependencies, you need to list all things, like, java driver, etc. So if you still want to use --jars option, then just download assembly version of it (link to jar) - it will contain all necessary dependencies.

Related

Spark on kubernetes with zeppelin

I am following this guide to run up a zeppelin container in a local kubernetes cluster set up using minikube.
https://zeppelin.apache.org/docs/0.9.0-SNAPSHOT/quickstart/kubernetes.html
I am able to set up zeppelin and run some sample code there. I have downloaded spark 2.4.5 & 2.4.0 source code and built it for kubernetes support with the following command:
./build/mvn -Pkubernetes -DskipTests clean package
Once spark is built I created a docker container as explained in the article:
bin/docker-image-tool.sh -m -t 2.4.X build
I configured zeppelin to use the spark image which was built with kubernetes support. The article above explains that the spark interpreter will auto configure spark on kubernetes to run in client mode and run the job.
But whenever I try to run any parahgraph with spark I receive the following error
Exception in thread "main" java.lang.IllegalArgumentException: basedir must be absolute: ?/.ivy2/local
I tried setting the spark configuration spark.jars.ivy in zeppelin to point to a temp directory but that does not work either.
I found a similar issue here:
basedir must be absolute: ?/.ivy2/local
But I can't seem to configure spark to run with the spark.jars.ivy /tmp/.ivy config. I tried building spark with the spark-defaults.conf when building spark but that does not seems to be working either.
Quite stumped at this problem and how to solve it any guidance would be appreciated.
Thanks!
I have also run into this problem, but a work-around I used for setting spark.jars.ivy=/tmp/.ivy is to rather set it is as an environment variable.
In your spark interpreter settings, add the following property: SPARK_SUBMIT_OPTIONS and set its value to --conf spark.jars.ivy=/tmp/.ivy.
This should pass additional options to spark submit and your job should continue.

how to add third party library to spark running on local machine

i am listening to eventhub stream and have seen article to attach library to cluster(databricks) and my code runs file.
For debugging i am running the code on local machine/cluster, but it fails for missing library. How can i add library when running on local machine.
i tried sparkcontext.addfile(fullpathtojar), but still same error.
You can use spark-submit --packages
Example: spark-submit --packages org.postgresql:postgresql:42.1.1
You would need to find the package that you are using and check the compatibility with spark.
With a single jar file you'd use spark-submit --jars instead.
i used spark-submit --packages {package} and it works.

Connecting to Teradata using Spark JDBC

I am trying to connect to extract data from Teradata using Spark JDBC. I have created a "lib" directory on the main parent directory and placed the external Teradata jars and ran the sbt package. In addition,I am also providing the "--jars" option on my spark-shell command to provide the jar. However, when I run the spark-shell, it does not seem to find the class
Exception in thread "main" java.lang.ClassNotFoundException: com.teradata.hadoop.tool.TeradataImportTool
However, when I do "jar tvf" on the jar file, I see the class. Somehow the Spark utility is unable to find the jar. Is there anything else I need to do so Spark could find it? Please help
This particular class com.teradata.hadoop.tool.TeradataImportTool is in teradata-hadoop-connector.jar
you can try to pass while submitting job like below example :
--conf spark.driver.extraClassPath complete path of teradata-hadoop-connector.jar
--conf spark.executor.extraClassPath complete path of teradata-hadoop-connector.jar
OR
import jars to both driver & executor. So, you need to edit conf/spark-defaults.conf adding both lines below.
spark.driver.extraClassPath complete path of teradata-hadoop-connector.jar
spark.executor.extraClassPath complete path of teradata-hadoop-connector.jar
NOTE : You can use uber jar is also known as fat jar i.e. jar
with dependencies. as well as alternative approach to avoid this kind
of issue

Error: Unrecognized option: --packages

I'm porting an existing script from BigInsights to Spark on Bluemix. I'm trying to run the following against Spark on Bluemix:
./spark-submit.sh --vcap ./vcap.json --deploy-mode cluster \
--master https://x.x.x.x:8443 --jars ./truststore.jar \
--packages org.elasticsearch:elasticsearch-spark_2.10:2.3.0 \
./export_to_elasticsearch.py ...
However, I get the following error:
Error: Unrecognized option: --packages
How can I pass the --packages parameter?
Bluemix uses a customized Spark version, with a customized spark-submit.sh script that only supports a subset of the original script parameters. You can see all the configuration properties and parameters you can use on its documentation.
Additionally, you can download the Bluemix version of the script from this link, and there you can see that there is no argument --packages.
Therefore, the problem with your approach is that the Bluemix version of spark-submit does not accept the --packages parameter, probably due to security reasons. However, alternatively, you can download the jar for the package you want (and maybe a fat jar for the dependencies) and upload them using the --jars parameter. Note: To avoid the necessity of uploading the jar files each time you call spark-submit, you can pre-upload them using curl. The details of this procedure can be found on this link.
Adding to Daniel's post, while using the method to pre-upload your package, you might want to upload your package to "${cluster_master_url}/tenant/data/libs", since Spark service sets these four spark properties "spark.driver.extraClassPath", "spark.driver.extraLibraryPath", "spark.executor.extraClassPath", and "spark.executor.extraLibraryPath" to ./data/libs/*
Reference: https://console.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index-gentopic3.html#spark-submit_properties

Microsoft Azure HDInsight -"Not Valid JAR"

I have got the following prompt (see attachment below) when I run an example from the Implementing Big Data Analysis course.
"Not a Valid JAR"
The command:
C:\apps\dist\hadoop-2.6.0.2.2.7.1-0004>hadoop jar hadoop-examples.jar wordcount /example/data/gutenberg/davici.txt /example/results
Please advise how to resolve this issue.
Thanks
The examples file was renamed when YARN was added in Hadoop 2.x, HDInsight 3.x. If you do a dir listing at the command prompt, you will see that it's now called hadoop-mapreduce-examples.jar, so the following command should work
hadoop jar hadoop-mapreduce-examples.jar wordcount /example/data/gutenberg/davinci.txt /example/results
(you also had a typo in davinci.txt)

Resources