Change spark _temporary directory path - apache-spark

Is it possible to change the _temporary directory where spark save its temporary files before writing?
In particular, since I am writing single partitions of a table I woud like the temporary folder to be within the partition folder.
Is it possibile?

There is no way to use the default FileOutputCommitter because of its implementation, the FileOutputCommiter creates a ${mapred.output.dir}/_temporary subdirectory where the files are written and later on, after being committed, moved to ${mapred.output.dir}.
In the end, an entire temporary folder deleted. When two or more Spark jobs have the same output directory, mutual deletion of files will be inevitable.
Eventually, I've downloaded org.apache.hadoop.mapred.FileOutputCommitter and org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter (you can name it YourFileOutputCommitter) made some changes that allows _temporaly rename
in your driver, you'll have to add following code:
val conf: JobConf = new JobConf(sc.hadoopConfiguration)
conf.setOutputCommitter(classOf[YourFileOutputCommitter])
// update temporary path for committer
YourFileOutputCommitter.tempPath = "_tempJob1"
note: it's better to use MultipleTextOutputFormat to rename files because two jobs that write to the same location can override each other.
Update
I've created short post in our tech blog, it has more details
https://www.outbrain.com/techblog/2020/03/how-you-can-set-many-spark-jobs-write-to-the-same-path/

Related

Change temporary path for individual job from spark code

I have multiple jobs that I want to execute in parallel that append daily data into the same path using dynamic partitioning.
The problem i am facing is the temporary path that get created during the job execution by spark. Multiple jobs end up sharing the same temp folder and cause conflict, which can cause one job to delete temp files, and the other job fail with an error saying an expected temp file doesn't exist.
Can we change temporary path for individual job or is there any alternate way to avoid issue
To change the temp location you can do this:
/opt/spark/bin/spark-shell --conf "spark.local.dir=/local/spark-temp"
spark.local.dir changes where all temp files are read and written to, I would advise building and opening the positions of this location via command line before the first session with this argument is run.

Spark: How to overwrite data in partitions but not the root folder while saving to disk?

W.r.t. following code:
spark.sql(sqlStatement).write.partitionBy("city", "dataset", "origin").mode(SaveMode.Overwrite).parquet(rootPath)
It deletes everything under the rootPath before writing data to it. If the code is changed to:
spark.sql(sqlStatement).write.partitionBy("city", "dataset", "origin").mode(SaveMode.Append).parquet(rootPath)
then it does not delete anything. What we want is a mode that will not delete the data under rootPath but delete the data under a city/dataset/origin before writing to it. How can this be done?
Try basepath option. Partition discovery will be only pointed towards children of '/city/dataset/origin'
according to documentation -
Spark SQL’s partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if
path="/my/data/x=1" then x=1 will no longer be considered a partition
but only children of x=1.) This behavior can be overridden by manually
specifying the basePath that partitioning discovery should start with
(SPARK-11678).
spark.sql(sqlStatement)\
.write.partitionBy("city", "dataset","origin")\
.option("basePath","/city/dataset/origin") \
.mode(SaveMode.Append).parquet(rootPath)
let me know if this doesnt work. I'll remove my answer.
Have a look at spark.sql.sources.partitionOverwriteMode="dynamic" setting, which was introduced in Spark 2.3.0.

Junk Spark output file on S3 with dollar signs

I have a simple spark job that reads a file from s3, takes five and writes back in s3.
What I see is that there is always additional file in s3, next to my output "directory", which is called output_$folder$.
What is it? How I can prevent spark from creating it?
Here is some code to show what I am doing...
x = spark.sparkContext.textFile("s3n://.../0000_part_00")
five = x.take(5)
five = spark.sparkContext.parallelize(five)
five.repartition(1).saveAsTextFile("s3n://prod.casumo.stu/dimensions/output/")
After the job I have s3 "directory" called output which contains results and another s3 object called output_$folder$ which I don't know what it is.
Changing S3 paths in the application from s3:// to s3a:// seems to have done the trick for me. The $folder$ files are no longer getting created since I started using s3a://.
Ok, it seems I found out what it is.
It is some kind of marker file, probably used for determining if the S3 directory object exists or not.
How I reached this conclusion?
First, I found this link that shows the source of
org.apache.hadoop.fs.s3native.NativeS3FileSystem#mkdir
method: http://apache-spark-user-list.1001560.n3.nabble.com/S3-Extra-folder-files-for-every-directory-node-td15078.html
Then I googled other source repositories to see if I am going to find different version of the method. I didn't.
At the end, I did an experiment and rerun the same spark job after I removed the s3 output directory object but left output_$folder$ file. Job failed saying that output directory already exists.
My conclusion, this is hadoop's way to know if there is a directory in s3 with given name and I will have to live with that.
All the above happens when I run the job from my local, dev machine - i.e. laptop. If I run the same job from a aws data pipeline, output_$folder$ does not get created.
s3n:// and s3a:// doesn't generate marker directory like <output>_$folder$
If you are using hadoop with AWS EMR., I found moving from s3 to s3n is straight forward since they both use same file system implementation, whereas s3a involves AWS credential related code change.
('fs.s3.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3n.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')

Spark job keeps having output folder already exists exception

I am running a spark job, and it kept failing with output folder already exists exceptions. I indeed removed the output folder before the job. Looks like the folder is created during the job and it confused other nodes/threads. It happens randomly but not always.
rdd.write().format("parquet").mode(SaveMode.Overwrite).save("location");
This should solve the issue of file already exists.
If you are using a local filesystem path, then be aware that the folder gets created on all workers. So you probably have to delete it from all of them.

In hive how to insert data into a single file

This work
INSERT OVERWRITE DIRECTORY 'wasb:///hiveblob/' SELECT * from table1;
but when we give command like
INSERT OVERWRITE DIRECTORY 'wasb:///hiveblob/sample.csv' SELECT * from
table1;
Failed with exception Unable to rename: wasb://incrementalhive-1#crmdbs.blob.core.windows.net/hive/scratch/hive_2015-06-08_10-01-03_930_4881174794406290153-1/-ext-10000 to: wasb:/hiveblob/sample.csv
So, is there any way in which we can insert data to a single file
I don't think you can tell hive to write to a specific file like wasb:///hiveblob/foo.csv directly.
What you can do is:
Tell hive to merge the output files into one before you run the query.
This way you can have as many reducers as you want and still have single output file.
Run your query, e.g. INSERT OVERWRITE DIRECTORY ...
Then use dfs -mv within hive to rename the file to whatever.
This is probably less painful than using separate hadoop fs -getmerger /your/src/folder /your/dest/folder/yourFileName as suggested by Ramzy.
The way to instruct to merge the files may be different depending on the runtime engine you are using.
For example, if you use tez as the runtime engine in your hive queries, you can do this:
-- Set the tez execution engine
-- And instruct to merge the results
set hive.execution.engine=tez;
set hive.merge.tezfiles=true;
-- Your query goes here.
-- The results should end up in wasb:///hiveblob/000000_0 file.
INSERT OVERWRITE DIRECTORY 'wasb:///hiveblob/' SELECT * from table1;
-- Rename the output file into whatever you want
dfs -mv 'wasb:///hiveblob/000000_0' 'wasb:///hiveblob/foo.csv'
(The above worked for me with these versions: HDP 2.2, Tez 0.5.2, and Hive 0.14.0)
For MapReduce engine (which is the default), you can try these, although I haven't tried them myself:
-- Try this if you use MapReduce engine.
set hive.execution.engine=mr;
set hive.merge.mapredfiles=true;
You can coerce hive to build to build one file by forcing reducers to one. This will copy any fragmented files in one table and combine them in another location in HDFS. Of course forcing one reducer breaks the benefit of parallelism. If you plan on doing any transformation of data I recommend doing that first then doing this in a last and separate phase.
To produce a single file using hive you can try:
set hive.exec.dynamic.partition.mode=nostrict;
set hive.exec.compress.intermediate=false;
set hive.exec.compress.output=false;
set hive.exec.reducers.max=1;
create table if not exists db.table
stored as textfiel as
select * from db.othertable;
db.othertable is the table that has multiple fragmented files. db.table will have a single text file containing the combined data.
You will be having multiple output files by default, equal to the number of reducers. That is decided by Hive. However you can configure the reducers. Look here. However, the performance can be a hit, if we reduce the reducers and will run into more execution time. Alternatively, once the files are present, you can use get merge, and combine all the files into one file.
hadoop fs -getmerger /your/src/folder /your/dest/folder/yourFileName
. The src folder contains all the files to be merged.

Resources