How to save files in same directory using saveAsNewAPIHadoopFile spark scala - apache-spark

I am using spark streaming and I want to save each batch of spark streaming on my local in Avro format. I have used saveAsNewAPIHadoopFile to save data in Avro format. This works well. But it overwrites the existing file. Next batch data will overwrite the old data. Is there any way to save Avro file in common directory? I tried by adding some properties of Hadoop job conf for adding a prefix in the file name. But not working any properties.
dstream.foreachRDD {
rdd.saveAsNewAPIHadoopFile(
path,
classOf[AvroKey[T]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[T]],
job.getConfiguration()
)
}

Try this -
You can make your process split into 2 steps :
Step-01 :- Write Avro file using saveAsNewAPIHadoopFile to <temp-path>
Step-02 :- Move file from <temp-path> to <actual-target-path>
This will definitely solve your problem for now. I will share my thoughts if I get to fulfill this scenario in one step instead of two.
Hope this is helpful.

Related

Spark overwrite parquet files on aws s3 raise URISyntaxException: Relative path in absolute URI

I am using Spark to write and read parquet files on AWS S3. I have parquet files which stored in
's3a://mybucket/file_name.parquet/company_name=company_name/record_day=2019-01-01 00:00:00'
partitioned by 'company_name' and 'record_day'
I want to write basic pipeline to update my parquet files on regularly basis by 'record_day'. To do this, i am gonna use overwrite mode:
df.write.mode('overwrite').parquet(s3a://mybucket/file_name.parquet/company_name='company_name'/record_day='2019-01-01 00:00:00')
But am getting unexpected error 'java.net.URISyntaxException: Relative path in absolute URI: key=2019-01-01 00:00:00'.
I spent several hours searching for the problem but found no solution(. For some tests, I replaced the 'overwrite' parameter with 'append', and everything works fine. I also made a simple dataframe and overwrite mode also works fine on it. I know that i can solve my problem in a different way, by deleting and then writing the particular part, but I would like to understand what the cause of the error is?
Spark 2.4.4 Hadoop 2.8.5
Appreciate any help.
I had the same error and the my solution was to remove the : part in the date.

Pyspark writing out to partitioned parquet using s3a issue

I have a pyspark script which reads in unpartioned single parquet file from s3, does some transformations and writes back to a another s3 bucket as partitioned by date.
Im using s3a to do the read and write. Reading in the files and performing the transformations is fine and no problem. However, when i try to write out to s3 using s3a and partitioned it throws the following error:
WARN s3a.S3AFileSystem: Found file (with /): real file? should not
happen: folder1/output
org.apache.hadoop.fs.FileAlreadyExistsException: Can't make directory
for path 's3a://bucket1/folder1/output' since it is a file.
The part of the code im using to write is as follows where im trying to append to existing directory but a new partition for new date:
output_loc = "s3a://bucket1/folder1/output/"
finalDf.write.partitionBy("date", "advertiser_id") \
.mode("append") \
.parquet(output_loc)
Im using Hadoop v3.0.0 and Spark 2.4.1
Has anyone come across this issue when using s3a instead of s3n. BTW it works fine on a older instance using s3n.
Thanks
There's an entry in your bucket s3a://bucket1/folder1/output/ with the trailing slash which is size > 0. S3A is warning that it's unhappy as that's treated as an empty-dir marker which is at risk of deletion once you add files underneath.
Look in the S3 bucket from the AWS console, see what is there, delete it
try using the output_loc without a trailing / to see if that helps (unlikely...)
Add a followup on the outcome; if the delete doesn't fix things then a hadoop JIRA may be worth filing

HDFS and Spark: Best way to write a file and reuse it from another program

I have some results from a Spark application saved in the HDFS as files called part-r-0000X (X= 0, 1, etc.). And, because I want to join the whole content in a file, I'm using the following command:
hdfs dfs -getmerge srcDir destLocalFile
The previous command is used in a bash script which makes empty the output directory (where the part-r-... files are saved) and, inside a loop, executes the above getmerge command.
The thing is I need to use the resultant file in another Spark program which need that merged file as input in the HDFS. So I'm saving it as local and then I upload it to the HDFS.
I've thought another option which is write the file from the Spark program in this way:
outputData.coalesce(1, false).saveAsTextFile(outPathHDFS)
But I've read coalesce() doesn't help with the performance.
Any other ideas? suggestions? Thanks!
You wish to merge all the files into a single one so that you can load all the files at once into a Spark rdd, is my guess.
Let the files be in Parts(0,1,....) in HDFS.
Why not load it with wholetextFiles, which actually does what you need.
wholeTextFiles(path, minPartitions=None, use_unicode=True)[source]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn
Do rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”), then rdd contains:
(a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content)
Try SPARK BucketBy.
This is a nice feature via df.write.saveAsTable(), but this format can only be read by SPARK. Data shows up in Hive metastore but cannot be read by Hive, IMPALA.
The best solution that I've found so far was:
outputData.saveAsTextFile(outPath, classOf[org.apache.hadoop.io.compress.GzipCodec])
Which saves the outputData in compressed part-0000X.gz files under the outPath directory.
And, from the other Spark app, it reads those files using this:
val inputData = sc.textFile(inDir + "part-00*", numPartition)
Where inDir corresponds to the outPath.

Recursively Read Files Spark wholeTextFiles

I have a directory in an azure data lake that has the following path:
'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib'
Within this directory there are a number of other directories (50) that have the format 20190404.
The directory 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/20180404' contains 100 or so xml files which I am working with.
I can create an rdd for each of the sub-folders which works fine, but ideally I want to pass only the top path, and have spark recursively find the files. I have read other SO posts and tried using a wildcard thus:
pathWild = 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/*'
rdd = sc.wholeTextFiles(pathWild)
rdd.count()
But it just freezes and does nothing at all, seems to completely destroy the kernel. I am working in Jupyter on Spark 2.x. New to spark. Thanks!
Try this:
pathWild = 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/*/*'

How to overwrite the output directory in spark

I have a spark streaming application which produces a dataset for every minute.
I need to save/overwrite the results of the processed data.
When I tried to overwrite the dataset org.apache.hadoop.mapred.FileAlreadyExistsException stops the execution.
I set the Spark property set("spark.files.overwrite","true") , but there is no luck.
How to overwrite or Predelete the files from spark?
UPDATE: Suggest using Dataframes, plus something like ... .write.mode(SaveMode.Overwrite) ....
Handy pimp:
implicit class PimpedStringRDD(rdd: RDD[String]) {
def write(p: String)(implicit ss: SparkSession): Unit = {
import ss.implicits._
rdd.toDF().as[String].write.mode(SaveMode.Overwrite).text(p)
}
}
For older versions try
yourSparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sc = SparkContext(yourSparkConf)
In 1.1.0 you can set conf settings using the spark-submit script with the --conf flag.
WARNING (older versions): According to #piggybox there is a bug in Spark where it will only overwrite files it needs to to write it's part- files, any other files will be left unremoved.
since df.save(path, source, mode) is deprecated, (http://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.DataFrame)
use df.write.format(source).mode("overwrite").save(path)
where df.write is DataFrameWriter
'source' can be ("com.databricks.spark.avro" | "parquet" | "json")
From the pyspark.sql.DataFrame.save documentation (currently at 1.3.1), you can specify mode='overwrite' when saving a DataFrame:
myDataFrame.save(path='myPath', source='parquet', mode='overwrite')
I've verified that this will even remove left over partition files. So if you had say 10 partitions/files originally, but then overwrote the folder with a DataFrame that only had 6 partitions, the resulting folder will have the 6 partitions/files.
See the Spark SQL documentation for more information about the mode options.
The documentation for the parameter spark.files.overwrite says this: "Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source." So it has no effect on saveAsTextFiles method.
You could do this before saving the file:
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://localhost:9000"), hadoopConf)
try { hdfs.delete(new org.apache.hadoop.fs.Path(filepath), true) } catch { case _ : Throwable => { } }
Aas explained here:
http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html
df.write.mode('overwrite').parquet("/output/folder/path") works if you want to overwrite a parquet file using python. This is in spark 1.6.2. API may be different in later versions
val jobName = "WordCount";
//overwrite the output directory in spark set("spark.hadoop.validateOutputSpecs", "false")
val conf = new
SparkConf().setAppName(jobName).set("spark.hadoop.validateOutputSpecs", "false");
val sc = new SparkContext(conf)
This overloaded version of the save function works for me:
yourDF.save(outputPath, org.apache.spark.sql.SaveMode.valueOf("Overwrite"))
The example above would overwrite an existing folder. The savemode can take these parameters as well (https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SaveMode.html):
Append: Append mode means that when saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.
ErrorIfExists: ErrorIfExists mode means that when saving a DataFrame to a data source, if data already exists, an exception is expected to be thrown.
Ignore: Ignore mode means that when saving a DataFrame to a data source, if data already exists, the save operation is expected to not save the contents of the DataFrame and to not change the existing data.
Spark – Overwrite the output directory:
Spark by default doesn’t overwrite the output directory on S3, HDFS, and any other file systems, when you try to write the DataFrame contents to an existing directory, Spark returns runtime error hence. To overcome this Spark provides an enumeration org.apache.spark.sql.SaveMode.Overwrite to overwrite the existing folder.
We need to use this Overwrite as an argument to mode() function of the DataFrameWrite class, for example.
df. write.mode(SaveMode.Overwrite).csv("/tmp/out/foldername")
or you can use the overwrite string.
df.write.mode("overwrite").csv("/tmp/out/foldername")
Besides Overwrite, SaveMode also offers other modes like SaveMode.Append, SaveMode.ErrorIfExists and SaveMode.Ignore
For older versions of Spark, you can use the following to overwrite the output directory with the RDD contents.
sparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sparkContext = SparkContext(sparkConf)
If you are willing to use your own custom output format, you would be able to get the desired behaviour with RDD as well.
Have a look at the following classes:
FileOutputFormat,
FileOutputCommitter
In file output format you have a method named checkOutputSpecs, which is checking whether the output directory exists.
In FileOutputCommitter you have the commitJob which is usually transferring data from the temporary directory to its final place.
I wasn't able to verify it yet (would do it, as soon as I have few free minutes) but theoretically: If I extend FileOutputFormat and override checkOutputSpecs to a method that doesn't throw exception on directory already exists, and adjust the commitJob method of my custom output committer to perform which ever logic that I want (e.g. Override some of the files, append others) than I may be able to achieve the desired behaviour with RDDs as well.
The output format is passed to: saveAsNewAPIHadoopFile (which is the method saveAsTextFile called as well to actually save the files). And the Output committer is configured at the application level.

Resources