Spark saveAsNewAPIHadoopFile works on local mode but not on Cluster mode - apache-spark

After upgrading to CDH5.4 and Spark streaming 1.3, I'm encountering a strange issue where saveAsNewAPIHadoopFile is no longer saving files to HDFS as it's suppose to. I can see that the _temp directory being generated, but when the Save is complete, the _temp is removed and leaving the directory empty with just a SUCCESS file. I have a feeling that the files are generated but afterward, they were unable to be moved out of the _temp directory before _temp is deleted.
This issue only happen when running on the Spark Cluster (standalone mode). If I run the job with local spark, files are saved as expected.
Some help would be appreciated.

Are you running this on your laptop/desktop?
One way this can happen is if the path you use for your output is a relative path on NFS. In that case, Spark assumes relative paths are hdfs:// not file:// and can't write out to disk.

Related

Why is spark .sparkStaging folder under hdfs when running spark on yarn in local machine?

i am trying to figure out why my spark .sparkStaging folder default to be under /user/name/ folder on my local hdfs ? I never set the working directory for spark at all. Hence why and how do i get this set with hdfs ? what configuration sets the default for that. I checked in spark environement in the UI Tab, and the config of yarn, and i can't see anything that sets that. Can someone give me a hint on that ?

Prevent Spark from copying JAR dependencies to `work/` folder for each executor node

Is there a way to prevent Spark from automatically copying the JAR files specified via --jars in the spark-submit command to the work/ folder for each executor node?
My spark-submit command specifies all the JAR dependencies for the job like so
spark-submit \
--master <master> \
--jars local:/<jar1-path>,local:/<jar2-path>... \
<application-jar> \
<arguments>
These JAR paths live on a distributed filesystem that is available in the same location on all the cluster nodes.
Now, according to the documentation:
Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. This can use up a significant amount of space over time and will need to be cleaned up.
The last sentence is absolutely true. My JAR dependencies need to include some multi-gigabyte model files, and when I deploy my Spark job over 100 nodes, you can imagine that having 100 copies of these files wastes huge amounts of disk space, not to mention the time it takes to copy them.
Is there a way to prevent Spark from copying the dependencies? I'm not sure I understand why it needs to copy them in the first place, given that the JARS are accessible from each cluster node via the same path. There should not be a need to keep distinct copies of each JAR in each node's working directory.
That same Spark documentation mentions that
local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.
...which is exactly how I'm referencing the JARS in the spark-submit command.
So, can Spark be prevented from copying all JARS specified via local:/... to the working directory of each cluster node? If so, how? If not, is there a reason why this copying must happen?
Edit: clarified that copies are per-node (not per-executor)

Writing file to unix directory using spark cluster mode

I have a spark application which runs in local mode currently and writes an output to a file in a local UNIX directory.
Now, I want to run the same job in yarn cluster mode and still want to write into that UNIX folder.
Can I use the same saveAsTextFile(path)?
Yes, you can. But it is not the best practice to do that. The spark itself can run standalone and on distributed file system. The reason we are using distributed file system is that the data is huge and the expected output might be huge.
So, if you are compeltely sure that the output will fit in to your local file system, go for it or you can save it to your local storage using the below command.
bin/hadoop fs -copyToLocal /hdfs/source/path /localfs/destination/path

How to load some files into Spark nodes without duplication?

I have some text files on master server to be processed by a Spark cluster for some statistics purpose.
For example, I have 1.txt, 2.txt,3.txt on master server in a specified directory like /data/.I want use a Spark cluster to process all of them one times. If I use sc.textFile("/data/*.txt") to load all files, other node in cluster cannot find these files on local file system. However if I use sc.addFile and SparkFiles.get to achieve them on each node, the 3 text files will be downloaded to each node and all of them will be processed multi times.
How to solve it without HDFS? Thanks.
According to official document, just copy all files to all nodes.
http://spark.apache.org/docs/1.2.1/programming-guide.html#external-datasets
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

Unable to save RDD's and DF's in a Spark Cluster

When running in a single node no-cluster mode, whenever I do rdd.saveAsTextFile("file://...") or df.write().csv("file://...") it creates a folder at that path with part-files and a file called _SUCCESS.
But when I use the same code for cluster mode, it doesn't work. I doesn't throw any errors but there are no part-files created in that folder. Though the folder and the _SUCCESS file are created, the actual part files data is not.
I am not sure what exactly the problem is here. Any suggestions on how to solve this are greatly appreaciated.
Since in cluster mode, tasks are performed in worker machines
You should try to save the file in hadoop or S3 or some fileserver, like ftp if you are running in cluster mode.

Resources