Azure Blob Storage Spark

Azure Blob Storage Spark - azure

I'm trying to connect Spark to azure blob storage (wasbs).
I add the following jars in the hadoop classpath
com.microsoft.azure_azure-storage-7.0.0.jar
org.apache.hadoop_hadoop-annotations-3.1.2.jar
org.apache.hadoop_hadoop-auth-3.1.2.jar
org.apache.hadoop_hadoop-azure-3.1.2.jar
org.apache.hadoop_hadoop-common-3.1.2.jar
org.eclipse.jetty_jetty-http-9.3.24.v20180605.jar
org.eclipse.jetty_jetty-io-9.3.24.v20180605.jar
org.eclipse.jetty_jetty-security-9.3.24.v20180605.jar
org.eclipse.jetty_jetty-server-9.3.24.v20180605.jar
org.eclipse.jetty_jetty-servlet-9.3.24.v20180605.jar
org.eclipse.jetty_jetty-webapp-9.3.24.v20180605.jar
org.eclipse.jetty_jetty-xml-9.3.24.v20180605.jar
and i try to use spark-submit using:
spark-submit --class mainClass --jars jars/org.apache.hadoop_hadoop-azure-3.1.2.jar,jars/com.microsoft.azure_azure-storage-7.0.0.jar,jars/org.apache.hadoop_hadoop-common-3.1.2.jar myjar.jar
and i get the following exception:
Exception in thread "main" java.lang.NoClassDefFoundError: org/eclipse/jetty/util/ajax/JSON$Convertor
If i remove hadoop-commons from the spark-submit --jars i get:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities
and if i add --jars jars/* to include all the jar files along with the jetty-utils i get
java.lang.ClassNotFoundException: my.package.MainClass
i saw similar posts that indicate multiple versions of jetty, but i can't find other versions anywhere.

For the first exception, you're missing jetty util
https://mvnrepository.com/artifact/org.eclipse.jetty/jetty-util/9.3.24.v20180605
And you should verify hadoop classpath returns what you want
For the remaining exceptions, you should verify that you can run hadoop fs - ls wasb://path on each potential Spark executor

Related

spark2-submit throwing error with multiple packages (--packages)

I'm trying to submit following Spark2 job on CDH 5.16 cluster and it's only taking first parameter of --packages option and throwing error for second parameter
spark2-submit --packages com.databricks:spark-xml_2.11:0.4.1, com.databricks:spark-csv_2.11:1.5.0 /path/to/python-script
Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR com.databricks:spark-csv_2.11:1.5.0 with URI com.databricks. Please specify a class through --class.
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:224)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.<init>(SparkSubmit.scala:911)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:911)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Running this job in CDH5.16 cluster and spark installed with Spark2 CSD
Thanks in advance.

Don't give space between packages
spark2-submit --packages com.databricks:spark-xml_2.11:0.4.1,com.databricks:spark-csv_2.11:1.5.0 /path/to/python-script

When Running Spark job in hadoop cluster i am getting java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

When i tried to run my scala code which connects hbase database it works perfectly in my local IDE . But when i run the same in hadoop cluster i am getting "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration" error .
Please help me in this

Add all the HBase library jars to HADOOP_CLASSPATH -
export HBASE_HOME="YOUR_HBASE_HOME_PATH"
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$HBASE_HOME/lib/*"
You can append any external jar needed to HADOOP_CLASSPATH, so that you don't need to explicitly set it in spark-submit command. All dependent jars will be loaded and provided to your Spark application.

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration while running spark Hbase using java API

I am running simple application to get data from HBase in Spark using java API
running spark-submit command e.g.
bin/spark-submit --master spark://192.168.43.75:7077 --class com.scry.NLPAnnotationController --driver-class-path /usr/lib/hbase/hbase-0.98.22-hadoop2/conf:$SPARK_HOME/lib_managed/jars/*.jar:$HBASE_CLASSPATH/*.jar --jars $SPARK_HOME/lib_managed/jars/*.jar:$HBASE_CLASSPATH/*.jar /home/deepak/hbase.jar
It gives error like
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
Please help to resolve this issue.
Thanks in advance,
Deepak

In the console where you are running spark-submit, first execute below command and then run spark-submit
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HBASE_CLASSPATH
If this works then add this entry in your hadoop-env.sh
Hope it helps...

--jars from different locations causes different jdbc behavior

When I load a MySQL JDBC driver by first copying it to the driver, and then including it via --jars /path/to/jdbc/driver.jar, then referencing that jdbc driver and loading data into a dataframe succeeds.
$ pyspark --jars /path/to/jdbc/driver.jar
>>> rdd = sqlContext.read.jdbc(url="jdbc:mysql://someAWSDatabase.us-west-2.rds.amazonaws.com:3306?user=root&password=somepassword", table="spark.test", properties={"driver":"com.mysql.jdbc.Driver"})
But, if I load the jar over the publicly available https-hosted version of that exact jar file, it fails.
$ pyspark --jars https://s3/path/to/jdbc/driver.jar
>>> rdd = sqlContext.read.jdbc(url="jdbc:mysql://someAWSDatabase.us-west-2.rds.amazonaws.com:3306?user=root&password=somepassword", table="spark.test", properties={"driver":"com.mysql.jdbc.Driver"})
py4j.protocol.Py4JJavaError: An error occurred while calling o37.jdbc.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
...
According to the docs, you can submit jars from various locations, from local to http/https, etc. Why would this cause a different behavior?
Update: I also tried running two spark-submit jobs, one with each variant of the jars path to the jdbc jar. The https jar submission threw the same error as above.

spark-submit classpath issue with --repositories --packages options

I'm running Spark in a standalone cluster where spark master, worker and submit each run in there own Docker container.
When spark-submit my Java App with the --repositories and --packages options I can see that it successfully downloads the apps required dependencies. However the stderr logs on the spark workers web ui reports a java.lang.ClassNotFoundException: kafka.serializer.StringDecoder. This class is available in one of the dependencies downloaded by spark-submit. But doesn't look like it's available on the worker classpath??
16/02/22 16:17:09 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: kafka/serializer/StringDecoder
at com.my.spark.app.JavaDirectKafkaWordCount.main(JavaDirectKafkaWordCount.java:71)
... 6 more
Caused by: java.lang.ClassNotFoundException: kafka.serializer.StringDecoder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
The spark-submit call:
${SPARK_HOME}/bin/spark-submit --deploy-mode cluster \
--master spark://spark-master:7077 \
--repositories https://oss.sonatype.org/content/groups/public/ \
--packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0,org.elasticsearch:elasticsearch-spark_2.10:2.2.0 \
--class com.my.spark.app.JavaDirectKafkaWordCount \
/app/spark-app.jar kafka-server:9092 mytopic

I was working with Spark 2.4.0 when I ran into this problem. I don't have a solution yet but just some observations based on experimentation and reading around for solutions. I am noting them down here just in case it helps some one in their investigation. I will update this answer if I find more information later.
The --repositories option is required only if some custom repository has to be referenced
By default the maven central repository is used if the --repositories option is not provided
When --packages option is specified, the submit operation tries to look for the packages and their dependencies in the ~/.ivy2/cache, ~/.ivy2/jars, ~/.m2/repository directories.
If they are not found, then they are downloaded from maven central using ivy and stored under the ~/.ivy2 directory.
In my case I had observed that
spark-shell worked perfectly with the --packages option
spark-submit would fail to do the same. It would download the dependencies correctly but fail to pass on the jars to the driver and worker nodes
spark-submit worked with the --packages option if I ran the driver locally using --deploy-mode client instead of cluster.
This would run the driver locally in the command shell where I ran the spark-submit command but the worker would run on the cluster with the appropriate dependency jars
I found the following discussion useful but I still have to nail down this problem.
https://github.com/databricks/spark-redshift/issues/244#issuecomment-347082455
Most people just use an UBER jar to avoid running into this problem and even to avoid the problem of conflicting jar versions where a different version of the same dependency jar is provided by the platform.
But I don't like that idea beyond a stop gap arrangement and am still looking for a solution.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string