spark-avro dependency not found - apache-spark

How to resolve the spark-avro dependency issue.
I am using spark on kubernetes. In the entrypoint.sh file, spark-submit command is used to run the driver.
CMD=(
"$SPARK_HOME/bin/spark-submit"
--conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
--deploy-mode client
--packages org.apache.spark:spark-avro_2.12:3.2.0
--conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp"
"$#"
)
Sometimes the dependency gets resolved, but many times its throws an error of
:: org.apache.spark#spark-avro_2.12;3.2.0: not found

Related

spark-atlas-connector: "SparkCatalogEventProcessor-thread" class not found exception

After following the instructions for spark-atlas-connector.
I am getting below error while running simple code to create table in spark
Spark2 2.3.1
Atlas 1.0.0
batch cmd is:
spark-submit --jars /home/user/spark-atlas-connector/spark-atlas-connector-assembly/target/spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar
--conf spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker
--conf spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker
--conf spark.sql.streaming.streamingQueryListeners=com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker
--files /home/user/atlas-application.properties
--master local
/home/user/SparkAtlas/test.py
Exception in thread "SparkCatalogEventProcessor-thread" java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/catalog/ExternalCatalogWithListener at com.hortonworks.spark.atlas.sql.SparkCatalogEventProcessor.process(SparkCatalogEventProcessor.scala:36) at com.hortonworks.spark.atlas.sql.SparkCatalogEventProcessor.process(SparkCatalogEventProcessor.scala:28) at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:72) at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:71) at scala.Option.foreach(Option.scala:257) at com.hortonworks.spark.atlas.AbstractEventProcessor.eventProcess(AbstractEventProcessor.scala:71) at com.hortonworks.spark.atlas.AbstractEventProcessor$$anon$1.run(AbstractEventProcessor.scala:38) Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
Thanks in advance.
This is clear indication of jar version mismatches
for the latest atlas version 2.0.0... below are the dependencies
<spark.version>2.4.0</spark.version>
<atlas.version>2.0.0</atlas.version>
<scala.version>2.11.12</scala.version>
For Atlas 1.0.0 see the pom.xml for it... these are dependencies
<spark.version>2.3.0</spark.version>
<atlas.version>1.0.0</atlas.version>
<scala.version>2.11.8</scala.version>
try using the correct versions of jars by seeinng the pom.xml mentioned in the link.
Note :
1) if you add one jar by seeing error and downloading it... and another place you will hit road block. Advise you to use correct versions.
2) Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.3.1 uses Scala 2.11. You will need to use a compatible Scala version (2.11.x). check your scala version as you have not mentioned in the question.

Unresolved dependency in spark-streaming-kafka-0-8_2.12;2.4.4

I use Spark 2.4.4 and I have unresolved dependency error when I add in spark submit the following package: spark-streaming-kafka-0-8_2.12;2.4.4
My submit code:
./bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.12:2.4.4
I had the same issue with spark 2.4.4. I think it's a typo in the Scala version of the package. So, use the following:
--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.4

(Zeppelin + Livy) SparkUI.appUIAddress(), something must be wrong

I'm trying to configure livy with Zeppelin following this docs:
https://zeppelin.apache.org/docs/0.7.3/interpreter/livy.html
However when I run:
%livy.spark
sc.version
I got the following error:
java.lang.RuntimeException: No result can be extracted from 'java.lang.NoSuchMethodException: org.apache.spark.ui.SparkUI.appUIAddress()', something must be wrong
I use Zeppelin 0.7.3, Spark 2.2.1, and Livy 0.4.0.
Spark running on YARN (hadoop 2.9.0). This is vanilla install, I don't use distribution like cloudera/HDP. All these software runs on one server.
I can run example org.apache.spark.examples.SparkPi in spark-shell with --master yarn without any problem. So I confirm that spark is running well on YARN.
Any help would be appreciated.
Thanks,
yusata.
This problem results from a method depreciation in spark 2.2.The appUiAddress no longer exist in spark 2.2.
As you can see in this PR https://github.com/apache/zeppelin/pull/2231.
This issue is already solved.
Somehow you still encounter the problem. I think either downgrade Spark or use a newer version of Zeppelin could solve the problem.

how to use graphframes inside SPARK on HDInsight cluster

I have setup an SPARK cluster on HDInsight and was am trying to use GraphFrames using this tutorial.
I have already used the custom scripts during the cluster creation to enable the GraphX on the spark cluster as described here.
When I am running the notepad,
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.graphframes._
i get the following error
<console>:45: error: object graphframes is not a member of package org
import org.graphframes._
^
I tried to install the graphframes from the spark terminal via Jupyter using the following command:
$SPARK_HOME/bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.5
but Still I am unable to get it working. I am new to Spark and HDInsight so can someone please point out what else I need to install on this cluster to get this working.
Today, this works in spark-shell, but doesn't work in jupyter notebook. So when you run this:
$SPARK_HOME/bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.5
It works (at least on spark 1.6 cluster version) in the context of this spark-shell session.
But in jupyter there is currently no way to load packages. This feature is going to be added soon to jupyter notebooks in the clusters. In the meantime you can use spark-shell, or spark-submit, etc.
Once you upload or import graphframes libraries from Maven repository, you need to restart your cluster so as to attach the library.
So it works for me.

Failed to load class for data source: com.databricks.spark.csv

My build.sbt file has this:
scalaVersion := "2.10.3"
libraryDependencies += "com.databricks" % "spark-csv_2.10" % "1.1.0"
I am running Spark in standalone cluster mode and my SparkConf is SparkConf().setMaster("spark://ec2-[ip].compute-1.amazonaws.com:7077").setAppName("Simple Application") (I am not using the method setJars, not sure whether I need it).
I package the jar using the command sbt package. Command I use to run the application is ./bin/spark-submit --master spark://ec2-[ip].compute-1.amazonaws.com:7077 --class "[classname]" target/scala-2.10/[jarname]_2.10-1.0.jar.
On running this, I get this error:
java.lang.RuntimeException: Failed to load class for data source:
com.databricks.spark.csv
What's the issue?
Use the dependencies accordingly. For example:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.4.0</version>
</dependency>
Include the option: --packages com.databricks:spark-csv_2.10:1.2.0 but do it after --class and before the target/
add --jars option and download the jars below from repository such as search.maven.org
--jars commons-csv-1.1.jar,spark-csv-csv.jar,univocity-parsers-1.5.1.jar \
Use the --packages option as claudiaann1 suggested also works if you have internet access without proxy. If you need to go through proxy, it won't work.
Here is the example that worked: spark-submit --jars file:/root/Downloads/jars/spark-csv_2.10-1.0.3.jar,file:/root/Downloads/jars/com‌​mons-csv-1.2.jar,file:/root/Downloads/jars/spark-sql_2.11-1.4.1.jar --class "SampleApp" --master local[2] target/scala-2.11/my-proj_2.11-1.0.jar
Use below Command , its working :
spark-submit --class ur_class_name --master local[*] --packages com.databricks:spark-csv_2.10:1.4.0 project_path/target/scala-2.10/jar_name.jar
Have you tried using the --packages argument with spark-submit? I've run into this issue with spark not respecting the dependencies listed as libraryDependencies.
Try this:
./bin/spark-submit --master spark://ec2-[ip].compute-1.amazonaws.com:7077
--class "[classname]" target/scala-2.10/[jarname]_2.10-1.0.jar
--packages com.databricks:spark-csv_2.10:1.1.0
_
From the Spark Docs:
Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with --packages. All transitive dependencies will be handled when using this command.
https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management

Resources