Spark jar package dependency file - apache-spark

I want to do some ip to location computation on spark, after exploring the net ,find IPLocator https://github.com/miraclesu/IPLocator,
the IP to location need to use a file which contains the mapping information.
After packaging the jar, I can run it through on using local java, the package just runs with the IPLocator.jar and qqwry.dat in the same directory.
But I want to use this jar using spark , I tryed to use --jars IPLocator.jar qqwry.dat when starting spark-shell, but when launching , the functions still can not read get the file .
the file reading code is like
QQWryFile.class.getClassLoader().getResource("qqwry.dat")
I also tried to package qqwry.dat file into the jar, and It did not work.

You need to use --files and then SparkFiles.get inside of your program

Try to use comma delimitor and check if IPLocator.jar and qqwry.dat are distributed to spark staging folder(.sparkStaging/application_xxx).
--jars IPLocator.jar,qqwry.dat

Related

Running Spark2.3 on Kubernetes with remote dependency on S3

I am running spark-submit to run on Kubernetes (Spark 2.3). My problem is that the InitContainer does not download my jar file if it's specified as an s3a:// path but does work if I put my jar on an HTTP server and use http://. The spark driver fails, of course, because it can't find my Class (and the jar file in fact is not in the image).
I have tried two approaches:
specifying the s3a path to jar as the argument to spark-submit and
using --jars to specify the jar file's location on s3a, but both fail in the same way.
edit: also, using local:///home/myuser/app.jar does not work with the same symptoms.
On a failed run (dependency on s3a), I logged into the container and found the directory /var/spark-data/spark-jars/ to be empty. The init-container logs don't indicate any type of error.
Questions:
What is the correct way to specify remote dependencies on S3A?
Is S3A not supported yet? Only http(s)?
Any suggestions on how to further debug the InitContainer to determine why the download doesn't happen?

Spark Streaming reading from local file gives NullPointerException

Using Spark 2.2.0 on OS X High Sierra. I'm running a Spark Streaming application to read a local file:
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/sampleFile")
lines.print()
This gives me
org.apache.spark.streaming.dstream.FileInputDStream logWarning - Error finding new files
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
The file exists, and I am able to read it using SparkContext (sc) from spark-shell on the terminal. For some reason going through the Intellij application and Spark Streaming is not working. Any ideas appreciated!
Quoting the doc comments of textFileStream:
Create an input stream that monitors a Hadoop-compatible filesystem
for new files and reads them as text files (using key as LongWritable, value
as Text and input format as TextInputFormat). Files must be written to the
monitored directory by "moving" them from another location within the same
file system. File names starting with . are ignored.
#param directory HDFS directory to monitor for new file
So, the method expects the path to a directory in the parameter.
So I believe this should avoid that error:
ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/")
Spark streaming will not read old files, so first run the spark-submit command and then create the local file in the specified directory. Make sure in the spark-submit command, you give only directory name and not the file name. Below is a sample command. Here, I am passing the directory name through the spark command as my first parameter. You can specify this path in your Scala program as well.
spark-submit --class com.spark.streaming.streamingexample.HdfsWordCount --jars /home/cloudera/pramod/kafka_2.12-1.0.1/libs/kafka-clients-1.0.1.jar--master local[4] /home/cloudera/pramod/streamingexample-0.0.1-SNAPSHOT.jar /pramod/hdfswordcount.txt

Apache Spark not recognising import from external Jar

I am trying to adapt some code from Apache Zeppelin for a personal project. The idea is to pass Scala source code to be executed in Spark. Everything is working fine, until when I try to use an external jar. For this, I call
SparkConf#setJars(externalJars);
And I can see in the logs that my jar was added:
Added JAR file:/Users/.../lsa.jar at spark://192.168.0.16:60376/jars/lsa.jar with timestamp 1470532825125
And when I check the UI of Spark http://192.168.0.16:4040/environment/ I can see my jar was added with an entry under Classpath Entries:
spark://192.168.0.16:60376/jars/lsa.jar
But when I try to import a class from the JAR I get:
<console>:25: error: object cloudera is not a member of package com
import com.cloudera.datascience.lsa._
^
Does anyone have an idea about what I am missing ?
Edit: I also tried to add the JAR via the spark-defaults.conf:
spark.driver.extraClassPath /Users/.../lsa.jar
but no luck.
I can see here the doc says:
Instead, please set this through the --driver-class-path command line option or in your default properties file.
I don't know where to pass this option, should I do it for the master only or for each slave ?
Thanks in advance

Error while running Zeppelin paragraphs in Spark on Linux cluster in Azure HdInsight

I have been following this tutorial in order to set up Zeppelin on a Spark cluster (version 1.5.2) in HDInsight, on Linux. Everything worked fine, I have managed to successfully connect to the Zeppelin notebook through the SSH tunnel. However, when I try to run any kind of paragraph, the first time I get the following error:
java.io.IOException: No FileSystem for scheme: wasb
After getting this error, if I try to rerun the paragraph, I get another error:
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
These errors occur regardless of the code I enter, even if there is no reference to the hdfs. What I'm saying is that I get the "No FileSystem" error even for a trivial scala expression, such as parallelize.
Is there a missing configuration step?
I am download the tar ball that the script that you pointed to as I type. But want I am guessing is that your zeppelin install and spark install are not complete to work with wasb. In order to get spark to work with wasb you need to add some jars to the Class path. To do this you need to add something like this to your spark-defaults.conf (the paths might be different in HDInsights, this is from HDP on IaaS)
spark.driver.extraClassPath /usr/hdp/2.3.0.0-2557/hadoop/lib/azure-storage-2.2.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/microsoft-windowsazure-storage-sdk-0.6.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/hadoop-azure-2.7.1.2.3.0.0-2557.jar
spark.executor.extraClassPath /usr/hdp/2.3.0.0-2557/hadoop/lib/azure-storage-2.2.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/microsoft-windowsazure-storage-sdk-0.6.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/hadoop-azure-2.7.1.2.3.0.0-2557.jar
Once you have spark working with wasb, or next step is make those sames jar in zeppelin class path. A good way to test your setup is make a notebook that prints your env vars and class path.
sys.env.foreach(println(_))
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
Also looking at the install script, it trying to pull the zeppelin jar from wasb, you might want to change that config to somewhere else while you try some of these changes out. (zeppelin.sh)
export SPARK_YARN_JAR=wasb:///apps/zeppelin/zeppelin-spark-0.5.5-SNAPSHOT.jar
I hope this helps, if you are still have problems I have some other ideas, but would start with these first.

spark-submit to cloudera cluster can not find any dependent jars

I am able to do a spark-submit to my cloudera cluster. the job dies after a few minutes with exceptions complaining it can not find various classes. These are classes that are in the spark dependency path. I keep adding the jars one at a time using command line args --jars, the yarn log keeps dumping out the next jar it can't find.
What setting allows the spark/yarn job to find all the dependent jars?
I already set the "spark.home" attribute to the correct path - /opt/cloudera/parcels/CDH/lib/spark
I found it!
remove
.set("spark.driver.host", "driver computer ip address")
from your driver code.

Resources