SparkContext.addFile upload the file to driver node but not workers - apache-spark

I tried to run a sc.texfile("file:///.../myLocalFile.txt") on a cluster and I got java.io.FileNotFoundException on the workers.
So I googled and I found sc.addFile / SparkFiles.get to upload the file to each workers.
So here is my code:
sc.addFile("file:///.../myLocalFile.txt")
val input = sc.textFile(SparkFiles.get("myLocalFile.txt"))
I see that the driver node upload the file to a directory in /tmp and then my workers get the FileNotFoundException because:
I don't see any printout saying that the workers have downloaded the file as they should have
They try to access the file with the drivers's path. So I assume SparkFiles.get() is ran on the driver node, not the worker (which I confirmed by adding a println).
I tried with spark-submit --files option and I see exactly the same problem.
So what am I doing wrong? All I want is to sc.textFile() on a cluster.

You need to copy files on workers to the same path as on driver, or use hdfs as it will be available on on workers. Workers don't have these files you can go to the folder and see yourself, i would scp them

sc.addFile is not for this purpose. If you want to read files through sc, you need put your file on hdfs instead of using sc.addFile

Related

Hadoop copyToLocalFile failing in Yarn cluster mode

I was trying to copy a file to local from HDFS using Hadoop's copyToLocalFile function from my Spark2 application.
val hadoopConf = new Configuration()
val hdfs = FileSystem.get(hadoopConf)
val src = new Path("/user/yxs7634/all.txt")
val dest = new Path("file:///home/yxs7634/all.txt")
hdfs.copyToLocalFile(src, dest)
The above code is working fine when I submit my spark application in Yarn client mode. But, It keeps failing with the below exception in Yarn cluster mode.
18/10/03 12:18:40 ERROR yarn.ApplicationMaster: User class threw exception: java.io.FileNotFoundException: /home/yxs7634/all.txt (Permission denied)
In yarn-cluster mode the driver is also handled by yarn and the selected driver node may not be the one where you're submitting the job. Hence for this job to work in yarn-cluster mode I believe you need to place the local file in all the spark nodes in the cluster.
In yarn mode, the spark job is submitted through YARN.
The driver would be started on a different node.
To tackle this issue, you can use a distributed file system like HDFS to store your file and then giving the absolute path.
eg:
val src = new Path("hdfs://nameservicehost:8020/user/yxs7634/all.txt")
Looks like Spark server running under one user (for ex. "spark"), and file in code stored in other user "yxs7634" directory.
In cluster mode user "spark" does not allows to write in "yxs7634" user dir, and such exception occurs.
Additional permission for Spark user to write in "/home/yxs7634" is required.
In local mode worked fine, because Spark runs under "yxs7634" user.
You have a permission denied error, I mean, the user you are using to submit the job is not able to access the file. The directory should have at least read permission to user "other", something like this: -rw-rw-r--
Can you paste the permissions of the directory and the file? The command is
hdfs dfs -ls /your-directory/

No such file or directory in spark cluster mode

I am writing a spark-streaming application using pyspark which basically process the data.
Inshort packaging overview:
This application contains several modules and some config files which are non .py files (ex:.yaml or .json).
I am packaging this entire application in package.zip file and submitting this package.zip to spark.
Now the problem is when i issue the spark-submit command in yarn cluster mode. I get IOError. Below is stacktrace
Traceback (most recent call last):
File "main/main.py", line 10, in <module>
import logger.logger
File "package.zip/logger/logger.py", line 36, in get_logger
IOError: [Errno 2] No such file or directory: 'logger/config.yaml'
Spark-Command :
spark-submit --master yarn-cluster --py-files package.zip main/main.py
But when I am submitting job in yarn-client mode the application works as expected.
My understanding:
When I submit the job in client mode the spark driver runs in same machine where I have issued the command. And the package is distributed across all nodes.
And when I issue the command in cluster mode the both spark driver and application master runs in single node(which is not client who submitted code.) and still package is distribute to all nodes in cluster.
In both the cases package.zip is available to all nodes then why is that only py files are getting loaded and non py files are failed to load in cluster mode.
Can any one please help me to understand the situation here and resolve the problem?
Updated--
Observations
In Client Mode The zipped package is unzipped in the path where driver script is running.
Where as in Cluster Mode the zip package shared across all node but not unzipped.
Here do I need to unzip package in all nodes ?
Is there any way to tell spark to unzip package in worker node?
You can pass your extra files with --files option.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-submit.html

Spark Streaming reading from local file gives NullPointerException

Using Spark 2.2.0 on OS X High Sierra. I'm running a Spark Streaming application to read a local file:
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/sampleFile")
lines.print()
This gives me
org.apache.spark.streaming.dstream.FileInputDStream logWarning - Error finding new files
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
The file exists, and I am able to read it using SparkContext (sc) from spark-shell on the terminal. For some reason going through the Intellij application and Spark Streaming is not working. Any ideas appreciated!
Quoting the doc comments of textFileStream:
Create an input stream that monitors a Hadoop-compatible filesystem
for new files and reads them as text files (using key as LongWritable, value
as Text and input format as TextInputFormat). Files must be written to the
monitored directory by "moving" them from another location within the same
file system. File names starting with . are ignored.
#param directory HDFS directory to monitor for new file
So, the method expects the path to a directory in the parameter.
So I believe this should avoid that error:
ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/")
Spark streaming will not read old files, so first run the spark-submit command and then create the local file in the specified directory. Make sure in the spark-submit command, you give only directory name and not the file name. Below is a sample command. Here, I am passing the directory name through the spark command as my first parameter. You can specify this path in your Scala program as well.
spark-submit --class com.spark.streaming.streamingexample.HdfsWordCount --jars /home/cloudera/pramod/kafka_2.12-1.0.1/libs/kafka-clients-1.0.1.jar--master local[4] /home/cloudera/pramod/streamingexample-0.0.1-SNAPSHOT.jar /pramod/hdfswordcount.txt

How to configure Executor in Spark Local Mode

In Short
I want to configure my application to use lz4 compression instead of snappy, what I did is:
session = SparkSession.builder()
.master(SPARK_MASTER) //local[1]
.appName(SPARK_APP_NAME)
.config("spark.io.compression.codec", "org.apache.spark.io.LZ4CompressionCodec")
.getOrCreate();
but looking at the console output, it's still using snappy in the executor
org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
and
[Executor task launch worker-0] compress.CodecPool (CodecPool.java:getCompressor(153)) - Got brand-new compressor [.snappy]
According to this post, what I did here only configure the driver, but not the executor. The solution on the post is to change the spark-defaults.conf file, but I'm running spark in local mode, I don't have that file anywhere.
Some more detail:
I need to run the application in local mode (for the purpose of unit test). The tests works fine locally on my machine, but when I submit the test to a build engine(RHEL5_64), I got the error
snappy-1.0.5-libsnappyjava.so: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found
I did some research and it seems the simplest fix is to use lz4 instead of snappy for codec, so I try the above solution.
I have been stuck in this issue for several hours, any help is appreciated, thank you.
what I did here only configure the driver, but not the executor.
In local mode there is only one JVM which hosts both driver and executor threads.
the spark-defaults.conf file, but I'm running spark in local mode, I don't have that file anywhere.
Mode is not relevant here. Spark in local mode uses the same configuration files. If you go to the directory where you store Spark binaries you should see conf directory:
spark-2.2.0-bin-hadoop2.7 $ ls
bin conf data examples jars LICENSE licenses NOTICE python R README.md RELEASE sbin yarn
In this directory there is a bunch of template files:
spark-2.2.0-bin-hadoop2.7 $ ls conf
docker.properties.template log4j.properties.template slaves.template spark-env.sh.template
fairscheduler.xml.template metrics.properties.template spark-defaults.conf.template
If you want to set configuration option copy spark-defaults.conf.template to spark-defaults.conf and edit it according to your requirements.
Posting my solution here, #user8371915 does answered the question, but did not solve my problem, because in my case I can't modified the property files.
What I end up doing is adding another configuration
session = SparkSession.builder()
.master(SPARK_MASTER) //local[1]
.appName(SPARK_APP_NAME)
.config("spark.io.compression.codec", "org.apache.spark.io.LZ4CompressionCodec")
.config("spark.sql.parquet.compression.codec", "uncompressed")
.getOrCreate();

Where is Spark writing SaveAsTextFile in cluster?

I'm a bit at loss here (Spark newbie). I spun up an EC2 cluster, and submitted a Spark job which saves as text file in the last step. The code reads
reduce_tuples.saveAsTextFile('september_2015')
and the working directory of the python file I'm submitting is /root. I cannot find the directory called september_2005, and if I try to run the job again I get the error:
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ec2-54-172-88-52.compute-1.amazonaws.com:9000/user/root/september_2015 already exists
The ec2 address is the master node where I'm ssh'ing to, but I don't have a folder /user/root.
Seems like Spark is creating the september_2015 directory somehwere, but find doesn't find it. Where does Spark write the resulting directory? Why is it pointing me to a directory that doesn't exist in the master node filesystem?
You're not saving it in the local file system, you're saving it in the hdfs cluster. Try eph*-hdfs/bin/hadoop fs -ls /, then you should see your file. See eph*-hdfs/bin/hadoop help for more commands, eg. -copyToLocal.

Resources