Spark Streaming reading from local file gives NullPointerException - apache-spark

Using Spark 2.2.0 on OS X High Sierra. I'm running a Spark Streaming application to read a local file:
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/sampleFile")
lines.print()
This gives me
org.apache.spark.streaming.dstream.FileInputDStream logWarning - Error finding new files
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
The file exists, and I am able to read it using SparkContext (sc) from spark-shell on the terminal. For some reason going through the Intellij application and Spark Streaming is not working. Any ideas appreciated!

Quoting the doc comments of textFileStream:
Create an input stream that monitors a Hadoop-compatible filesystem
for new files and reads them as text files (using key as LongWritable, value
as Text and input format as TextInputFormat). Files must be written to the
monitored directory by "moving" them from another location within the same
file system. File names starting with . are ignored.
#param directory HDFS directory to monitor for new file
So, the method expects the path to a directory in the parameter.
So I believe this should avoid that error:
ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/")

Spark streaming will not read old files, so first run the spark-submit command and then create the local file in the specified directory. Make sure in the spark-submit command, you give only directory name and not the file name. Below is a sample command. Here, I am passing the directory name through the spark command as my first parameter. You can specify this path in your Scala program as well.
spark-submit --class com.spark.streaming.streamingexample.HdfsWordCount --jars /home/cloudera/pramod/kafka_2.12-1.0.1/libs/kafka-clients-1.0.1.jar--master local[4] /home/cloudera/pramod/streamingexample-0.0.1-SNAPSHOT.jar /pramod/hdfswordcount.txt

Related

Spark external jars and files on hdfs

I have a spark job that I run using the spark-submit command.
The jar that I use is hosted on hdfs and I call it from there directly in the spark-submit query using its hdfs file path.
With this same logic, I'm trying to do the same when for the --jars options, the files options and also the extraClassPath option (in the spark.conf) but it seems that there is an issue with the fact that it point to a hdfs file path.
My command looks like this:
spark-submit \
--class Main \
--jars 'hdfs://path/externalLib.jar' \
--files 'hdfs://path/log4j.xml' \
--properties-file './spark.conf' \
'hdfs://path/job_name.jar
So not only when I call a method that refers the externalLib.jar, spark raises an exception telling me that it doesn't find the method but also from the starts I have the warning logs:
Source and destination file systems are the same. Not copying externalLib.jar
Source and destination file systems are the same. Not copying log4j.xml
It must come from the fact that I precise a hdfs path because it works flawlessly when I refers to those jar in the local file system.
Maybe it isn't possible ? What can I do ?

SparkContext.addFile upload the file to driver node but not workers

I tried to run a sc.texfile("file:///.../myLocalFile.txt") on a cluster and I got java.io.FileNotFoundException on the workers.
So I googled and I found sc.addFile / SparkFiles.get to upload the file to each workers.
So here is my code:
sc.addFile("file:///.../myLocalFile.txt")
val input = sc.textFile(SparkFiles.get("myLocalFile.txt"))
I see that the driver node upload the file to a directory in /tmp and then my workers get the FileNotFoundException because:
I don't see any printout saying that the workers have downloaded the file as they should have
They try to access the file with the drivers's path. So I assume SparkFiles.get() is ran on the driver node, not the worker (which I confirmed by adding a println).
I tried with spark-submit --files option and I see exactly the same problem.
So what am I doing wrong? All I want is to sc.textFile() on a cluster.
You need to copy files on workers to the same path as on driver, or use hdfs as it will be available on on workers. Workers don't have these files you can go to the folder and see yourself, i would scp them
sc.addFile is not for this purpose. If you want to read files through sc, you need put your file on hdfs instead of using sc.addFile

Spark jar package dependency file

I want to do some ip to location computation on spark, after exploring the net ,find IPLocator https://github.com/miraclesu/IPLocator,
the IP to location need to use a file which contains the mapping information.
After packaging the jar, I can run it through on using local java, the package just runs with the IPLocator.jar and qqwry.dat in the same directory.
But I want to use this jar using spark , I tryed to use --jars IPLocator.jar qqwry.dat when starting spark-shell, but when launching , the functions still can not read get the file .
the file reading code is like
QQWryFile.class.getClassLoader().getResource("qqwry.dat")
I also tried to package qqwry.dat file into the jar, and It did not work.
You need to use --files and then SparkFiles.get inside of your program
Try to use comma delimitor and check if IPLocator.jar and qqwry.dat are distributed to spark staging folder(.sparkStaging/application_xxx).
--jars IPLocator.jar,qqwry.dat

Where is Spark writing SaveAsTextFile in cluster?

I'm a bit at loss here (Spark newbie). I spun up an EC2 cluster, and submitted a Spark job which saves as text file in the last step. The code reads
reduce_tuples.saveAsTextFile('september_2015')
and the working directory of the python file I'm submitting is /root. I cannot find the directory called september_2005, and if I try to run the job again I get the error:
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ec2-54-172-88-52.compute-1.amazonaws.com:9000/user/root/september_2015 already exists
The ec2 address is the master node where I'm ssh'ing to, but I don't have a folder /user/root.
Seems like Spark is creating the september_2015 directory somehwere, but find doesn't find it. Where does Spark write the resulting directory? Why is it pointing me to a directory that doesn't exist in the master node filesystem?
You're not saving it in the local file system, you're saving it in the hdfs cluster. Try eph*-hdfs/bin/hadoop fs -ls /, then you should see your file. See eph*-hdfs/bin/hadoop help for more commands, eg. -copyToLocal.

How to avoid "Not a file" exceptions when reading from HDFS with spark

I copy a tree of files from S3 to HDFS with S3DistCP in an initial EMR step. hdfs dfs -ls -R hdfs:///data_dir shows the expected files, which look something like:
/data_dir/year=2015/
/data_dir/year=2015/month=01/
/data_dir/year=2015/month=01/day=01/
/data_dir/year=2015/month=01/day=01/data01.12345678
/data_dir/year=2015/month=01/day=01/data02.12345678
/data_dir/year=2015/month=01/day=01/data03.12345678
The 'directories' are listed as zero-byte files.
I then run a spark step which needs to read these files. The loading code is thus:
sqlctx.read.json('hdfs:///data_dir, schema=schema)
The job fails with a java exception
java.io.IOException: Not a file: hdfs://10.159.123.38:9000/data_dir/year=2015
I had (perhaps naively) assumed that spark would recursively descend the 'dir tree' and load the data files. If I point to S3 it loads the data successfully.
Am I misunderstanding HDFS? Can I tell spark to ignore zero-byte files? Can i use S3DistCp to flatten the tree?
In Hadoop configuration for current spark context, configure "recursive" read for Hadoop InputFormat before to get the sql ctx
val hadoopConf = sparkCtx.hadoopConfiguration
hadoopConf.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
This will give the solution for "not a file".
Next, to read multiple files:
Hadoop job taking input files from multiple directories
or union the list of files into single dataframe :
Read multiple files from a directory using Spark
Problem solved with :
spark-submit ...
--conf spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive=true \
--conf spark.hive.mapred.supports.subdirectories=true \
...
The parameters must be set in this way in spark version 2.1.0 :
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")

Resources