Do spark uses same cache location for storing tmp files per each executor
e.g., If I have two task running in one executor and both create a file with the same name, will one gives an error that "file exists"?
I got the answer from another source,
It does use the same cache location, as per its spark local dirs
property, and io tmp dir for java stuff,
if by create a file you mean like adding a file (addFile), then you
could overcome this by also setting spark.files.overwrite to true,
which will only work if the current file is different from the newly
added.
Related
I am new to spark, I am observing that after spark program completes there are directories getting created and their names are like:
spark-3505c49a-0402-41ce-8187-c82ea7527e15
blockmgr-37b4b5e6-a97c-4779-9658-a19c194b9a2c
Please help me to understand about these directories, why are they getting created?
These are temporary directories. The spark- prefixed directory is used for multiple purposes, one of which is the ShuffleManager and blockmgr- prefixed directories is used by the BlockManager.
My goal is to build a daily process that will overwrite all partitions under specific path in S3 with new data from data frame.
I do -
df.write.format(source).mode("overwrite").save(path)
(Also tried the dynamic overwrite option).
However, in some runs old data is not being deleted. Means I see files from old date together with new files under the same partition.
I suspect it has something to do with runs that broke in the middle due to memory issues and left some corrupted files that the next run did not delete but couldn’t reproduce it yet.
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") - option will keep your existing partition and overwriting a single partition. if you want to overwrite all existing partitions and keep the current partition then unset the above configurations. ( i tested in spark version 2.4.4)
I'm trying to write about 30k-60k parquet files to s3 using Spark and it's taking a massive amount of time (40+ minutes) due to the s3 rate limit.
I wonder if there is a best practice to do such a thing. I heard that writing the data to HDFS and then copying it using s3-dist-cp may be faster. I can't understand why. isn't the copy from HDFS will take the same amount of time because of the s3 rate limit?
Thanks for your help
There is nothing wrong in this approach and works absolutely fine in most of the use cases, but there might be some challenges due to the way in S3 files are written.
Two Important Concepts to Understand
S3(Object Store) != POSIX File System : Rename Operation:
File rename process in POSIX based file system is a metadata only operation.Only the pointer changes and file remains as is on the disk. For example, I have a file abc.txt and I want to rename it as xyz.txt its instantaneous and atomic. xyz.txt’s last modified timestamp remain same as abc.txt’s last modfied timestamp.
Where as in AWS S3 (object store) the file rename under the hood is a copy followed by a delete operation. The source file is first copied to destination and then the source file is deleted.So “aws s3 mv” changes the last modified timestamp of destination file unlike POSIX file system.The metadata here is a key value store where key is the file path and value is the content of the file and there is no such process as changing the key and get this done immediately. The rename process depends on the size of the file. If there is a directory rename(there is nothing called directory in S3 for for simplicity we can assume a recusrive set of files as a directory) then it depends on the # of files inside the dir along with size of each file. So in a nutshell rename is very expensive operation in S3 as compared to normal file system.
S3 Consistency Model
S3 comes with 2 kinds of consistency a.read after write b.eventual consistency and which some cases results in file not found expectation.Files being added and not listed or files being deleted or not removed from list.
Deep explanation:
Spark leverages Hadoop’s “FileOutputCommitter” implementations to write data. Writing data again involves multiple steps and on a high level staging output files and then committing them i.e. writing final files.Here the rename step is involved as I was talking earlier from staging to final step.As you know a spark job is divided into multiple stages and set of tasks and due to nature of distributed computing the tasks are prone to failure so there is also provision to re-launch same task due to system failure or speculative execution of slow running tasks and that leads to concepts of task commit and job commit functions.Here we have 2 options of readily available algorithms and how job and task commits are done and having said this not one algorithm is better then other rather based on where we are committing data.
mapreduce.fileoutputcommitter.algorithm.version=1
commitTask renames the data generated by task from task temporary directory to job temporary directory.
When all the tasks are complete commitJob rename all the data from job temporary directory to final destination and at the end creates _SUCCESS file.
Here driver does the work of commitJob at the end so object stores like S3 may take longer time because of lots of task temporary file being queued up for rename operation(its not serial though)and the write performance is not optimized.It might work pretty well for HDFS as rename is not expensive and just a metadata change.For AWS S3 during commitJob each rename operation of files opens up huge number of API calls to AWS S3 and might cause issues of unexpected API call closure if the number of files are high. It might not also. I have seen both the cases on the same job running in two different times.
mapreduce.fileoutputcommitter.algorithm.version=2
commitTask moves data generated by task from task temporary directory directly to the final destination as soon as task is complete.
commitJob basically writes the _SUCCESS file and doesn't do much.
From a high level this looks optimized but it comes with a limitation not to have the speculative task execution and also if any task fails due to corrupt data then we might end up with residual data in the final destination and needs a clean up. So this algorithm doesn't give 100% data correctness or doesn't work for use cases where we need data in append mode to existing files.Even if this ensures optimised results comes with a risk.The reason for good performance is basically because of less number of rename operations as compared to algorithm 1(still there are renames). Here we might encounter issues of file not found expectations because commitTask writes the file in temporary path and immediately renames them and there are light chances of eventual consistency issues.
Best Practices
Here are few I think we can use while writing spark data processing applications :
If you have a HDFS cluster available then write data from Spark to HDFS and copy it to S3 to persist. s3-dist-cp can be used for data copy from HDFS to S3 optimally.Here we can avoid all that rename operation.With AWS EMR being running for only duration of compute and then terminated afterwards to persist result this approach looks preferable.
Try avoiding writing files and reading it again and again unless there are consumers for the files , and spark is well known for in-memory processing and careful data persistence/cache in-memory will help the optimized run time of the application.
I just ran this:
dbutils.fs.ls("dbfs:/FileStore/")
I see this result:
[FileInfo(path='dbfs:/FileStore/import-stage/', name='import-stage/', size=0),
FileInfo(path='dbfs:/FileStore/jars/', name='jars/', size=0),
FileInfo(path='dbfs:/FileStore/job-jars/', name='job-jars/', size=0),
FileInfo(path='dbfs:/FileStore/plots/', name='plots/', size=0),
FileInfo(path='dbfs:/FileStore/tables/', name='tables/', size=0)]
Shouldn't there be something in filestore? I have hundreds of GB of data in a lake. I am having all kinds of problems getting Databricks to find these files. When I use Azure Data Factory, everything works perfectly fine. It's starting to drive me crazy!
For instance, when I run this:
dbutils.fs.ls("/mnt/rawdata/2019/06/28/parent/")
I get this message:
java.io.FileNotFoundException: File/6199764716474501/mnt/rawdata/2019/06/28/parent does not exist.
I have tens of thousands of files in my lake! I can't understand why I can't get a list these files!!
In Azure Databricks, this is expected behaviour.
For Files it displays the actual file size.
For Directories it displays the size=0
Example: In dbfs:/FileStore/ I have three files shown in white color and three folders shown in blue color. Checking the file size using databricks cli.
dbfs ls -l dbfs:/FileStore/
When you check out the result using dbutils as follows:
dbutils.fs.ls("dbfs:/FileStore/")
Important point to remember while reading the files larger than 2GB:
Support only files less than 2GB in size. If you use local file I/O APIs to read or write files larger than 2GB you might see corrupted files. Instead, access files larger than 2GB using the DBFS CLI, dbutils.fs, or Spark APIs or use the /dbfs/ml folder described in Local file APIs for deep learning.
If you write a file using the local file I/O APIs and then immediately try to access it using the DBFS CLI, dbutils.fs, or Spark APIs, you might encounter a FileNotFoundException, a file of size 0, or stale file contents. That is expected because the OS caches writes by default. To force those writes to be flushed to persistent storage (in our case DBFS), use the standard Unix system call sync.
There are multiple way to solve this issue. You may checkout similar SO thread answered by me.
Hope this helps.
I'd like to run a spark job that outputs to some directory that contains the day at which the job started. Is there a way to share a single date object (joda.time for example) in all spark nodes, so no matter what node outputs what pipe, they all output to the same dir structure?
If the question is
Is there a way to share a single date object (joda.time for example)
in all spark nodes
then naturally the answer is "broadcast the object"
if the real question is how do I specify path of output, then, really you do not need to broadcast the path. You can just say rdd.saveAsFile(/path) and the function will automatically dump each partition in a single file (like part000 or so). Of course, all worker nodes must have access to the location specified by "path" variable, so in a real cluster it has to be HDFS or S3 or NFS or likes.
From documentation:
saveAsTextFile(path)
Write the elements of the dataset as a text file (or set of text
files) in a given directory in the local filesystem, HDFS or any other
Hadoop-supported file system. Spark will call toString on each element
to convert it to a line of text in the file.
Simply create the object in your driver program (as a val) and close over it where you need it. It should be copied over to the worker nodes to be used as you need.