Spark - Read and Write back to same S3 location - apache-spark

I am reading a dataset dataset1 and dataset2 from S3 locations. I then transform them and write back to the same location where dataset2 was read from.
However, I get below error message:
An error occurred while calling o118.save. No such file or directory 's3://<myPrefix>/part-00001-a123a120-7d11-581a-b9df-bc53076d57894-c000.snappy.parquet
If I try to write to a new S3 location e.g. s3://dataset_new_path.../ then the code works fine.
my_df \
.write.mode('overwrite') \
.format('parquet') \
.save(s3_target_location)
Note: I have tried using .cache() after reading in the dataframe but still get the same error.

The reason this causes a problem is that you are reading and writing to the same path that you are trying to overwrite. It is standard Spark issue and nothing to do with AWS Glue.
Spark uses lazy transformation on DF and it is triggered when certain action is called. It creates DAG to keep information about all transformations which should be applied to DF.
When you read data from same location and write using override, 'write using override' is action for DF. When spark sees 'write using override', in it's execution plan it adds to delete the path first, then trying to read that path which is already vacant; hence error.
Possible workaround would be to write to some temp location first and then using it as source, override in dataset2 location

Related

Spark Reading and Writing to same S3 Path Giving Unable to infer Schema Error

I need to perform update insert (Upsert) on old data with new data.
Pseudo Code:
old_data = spark.read.parquet('s3://bucket/old_data/')
new_data= spark.read.parquet('s3://bucket/new_data/')
common_records = old_data.join(new_data,on=opk,how="inner")
non_match_records = old_data.join(new_data,on=opk,how="left_anti")
new_records = new_data.join(old_data,on=opk,how="left_anti")
dfs = [common_records , non_match_records , new_records ]
final_data = reduce(DataFrame.unionAll, dfs)
final_data .cache()
final_data.write.parquet('s3://bucket/old_data/')
Error :
Even I cached data, Still It's Looking for old_data path, Is there anyway to directly write to old data s3 path.
I have tried it to write to some temp path and read from it and write to main path Like below It worked but It's taking time process when I have data in Billons.
final_data.write.parquet('s3://bucket/temp/')
df = spark.read.parquet('s3://bucket/temp/')
df.write.parquet('s3://bucket/old_data/')
I want to reduce this Temp writing & reading part.
Thanks in advance :)
You need to perform an action to trigger dataframe caching. Thus you should modify the last lines of your code snippet as follow:
...
final_data = final_data.cache()
final_data.count()
final_data.write.parquet('s3://bucket/old_data/')
By performing a count action on your dataframe, you trigger caching process and then be able to write on same directory from where your read.
However, I don't know if this will improve performance of your application as cache fallback to disk write when dataframe is too huge for memory. If your use case is updating parquet files, I advise you to look at DeltaLake that was created to solve this.

Pyspark writing out to partitioned parquet using s3a issue

I have a pyspark script which reads in unpartioned single parquet file from s3, does some transformations and writes back to a another s3 bucket as partitioned by date.
Im using s3a to do the read and write. Reading in the files and performing the transformations is fine and no problem. However, when i try to write out to s3 using s3a and partitioned it throws the following error:
WARN s3a.S3AFileSystem: Found file (with /): real file? should not
happen: folder1/output
org.apache.hadoop.fs.FileAlreadyExistsException: Can't make directory
for path 's3a://bucket1/folder1/output' since it is a file.
The part of the code im using to write is as follows where im trying to append to existing directory but a new partition for new date:
output_loc = "s3a://bucket1/folder1/output/"
finalDf.write.partitionBy("date", "advertiser_id") \
.mode("append") \
.parquet(output_loc)
Im using Hadoop v3.0.0 and Spark 2.4.1
Has anyone come across this issue when using s3a instead of s3n. BTW it works fine on a older instance using s3n.
Thanks
There's an entry in your bucket s3a://bucket1/folder1/output/ with the trailing slash which is size > 0. S3A is warning that it's unhappy as that's treated as an empty-dir marker which is at risk of deletion once you add files underneath.
Look in the S3 bucket from the AWS console, see what is there, delete it
try using the output_loc without a trailing / to see if that helps (unlikely...)
Add a followup on the outcome; if the delete doesn't fix things then a hadoop JIRA may be worth filing

Dataframe not able to write on S3

I am creating a dataframe from existing hive table.Table is partitioned on date and site column.Now, when i am trying to overwrite the data in this same table after some computation with previous day data.It is successfully getting loaded.
But when i am trying to write final dataframe at S3 bucket. I am getting error saying file not found.Now the file it is mentioning is previous day file which is now overwritten.
If i write dataframe first and then overwrite table then its running fine.
For writing at S3 location , what it has to do with table partition file?
Below is the error and code.
java.io.FileNotFoundException: No such file or directory: s3://bucket_1/DM/web_fact_tbl/local_dt=2018-05-10/site_name=ABC/part-00000-882a6e29-eb6a-477c-8b88-6fe853956674.c000
fact_tbl = spark.table('db.web_fact_tbl')
fact_lkp = fact_tbl.filter(fact_tbl['local_dt']=='2018-05-10')
fact_join = fact_lkp.alias('a').join(fact_tbl.alias('b'),(col('a.id') == col('b.id')),"inner").select('a.*')
fact_final = fact_join.union(fact_tbl)
fact_final.coalesce(2).createOrReplaceTempView('cwf')
spark.sql('INSERT OVERWRITE TABLE dm.web_fact_tbl PARTITION (local_dt, site_name) \
SELECT * FROM cwf')
fact_final.write.csv('s3://bucket_1/yahoo')
Before last line fact_final is just a "lazy" dataframe object that contains definitions only. It does not contain any data. But it has pointer to exact data files, where data is stored actually.
When you try to perform actual operations (does not matter it's writing to S3, or executing query like fact_final.count()) you'll get the error as above. It looks like partition local_dt=2018-05-10 does not exists anymore (files/folder that sits behind it does not exists).
You can try to re-initialize dataframe once again, before final write (it's another lazy operation - all work is done in your case while you writing it on S3).

Spark Save DataSet partitionby File Already Exists Error

I have a DataSet which was constructed in the following way:
Encoder<MyDomain> encoder= Encoders.bean(MyDomain.class);
Dataset<MyDomain> stdDS = sc.createDataset(filteredRecords.rdd(), encoder);
Dataset<Row> rowDataset = stdDS.withColumn("idHash", stdDS.col("id").substr(0, 5));
I am then trying to output the dataset by doing:
rowDataset.write().partitionBy("keep", "idHash").save("test.parquet");
When I only partition by "keep" everything works correctly, then I partition by both "keep" and "idHash" I get:
File already exists: file:/C:/dev/test.parquet/_temporary/0/_temporary/attempt_201701191219_0001_m_000000_0/keep=true/idHash=0a/part-r-00000-2c2e0494-f6a7-47d7-88e2-f49dffb608d1.snappy.parquet
How can I get my DataSet to properly output using multiple partitions. The folder is empty to start with. Also this error is happening when I run on my local machine, in production this data will be output to S3 so any solution needs to work both against a local filesystem and AWS S3.
Thanks,
Nathan
Try
rowDataset.repartition("keep", "idHash").write().partitionBy("keep", "idHash").save("test.parquet");

How to overwrite the output directory in spark

I have a spark streaming application which produces a dataset for every minute.
I need to save/overwrite the results of the processed data.
When I tried to overwrite the dataset org.apache.hadoop.mapred.FileAlreadyExistsException stops the execution.
I set the Spark property set("spark.files.overwrite","true") , but there is no luck.
How to overwrite or Predelete the files from spark?
UPDATE: Suggest using Dataframes, plus something like ... .write.mode(SaveMode.Overwrite) ....
Handy pimp:
implicit class PimpedStringRDD(rdd: RDD[String]) {
def write(p: String)(implicit ss: SparkSession): Unit = {
import ss.implicits._
rdd.toDF().as[String].write.mode(SaveMode.Overwrite).text(p)
}
}
For older versions try
yourSparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sc = SparkContext(yourSparkConf)
In 1.1.0 you can set conf settings using the spark-submit script with the --conf flag.
WARNING (older versions): According to #piggybox there is a bug in Spark where it will only overwrite files it needs to to write it's part- files, any other files will be left unremoved.
since df.save(path, source, mode) is deprecated, (http://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.DataFrame)
use df.write.format(source).mode("overwrite").save(path)
where df.write is DataFrameWriter
'source' can be ("com.databricks.spark.avro" | "parquet" | "json")
From the pyspark.sql.DataFrame.save documentation (currently at 1.3.1), you can specify mode='overwrite' when saving a DataFrame:
myDataFrame.save(path='myPath', source='parquet', mode='overwrite')
I've verified that this will even remove left over partition files. So if you had say 10 partitions/files originally, but then overwrote the folder with a DataFrame that only had 6 partitions, the resulting folder will have the 6 partitions/files.
See the Spark SQL documentation for more information about the mode options.
The documentation for the parameter spark.files.overwrite says this: "Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source." So it has no effect on saveAsTextFiles method.
You could do this before saving the file:
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://localhost:9000"), hadoopConf)
try { hdfs.delete(new org.apache.hadoop.fs.Path(filepath), true) } catch { case _ : Throwable => { } }
Aas explained here:
http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html
df.write.mode('overwrite').parquet("/output/folder/path") works if you want to overwrite a parquet file using python. This is in spark 1.6.2. API may be different in later versions
val jobName = "WordCount";
//overwrite the output directory in spark set("spark.hadoop.validateOutputSpecs", "false")
val conf = new
SparkConf().setAppName(jobName).set("spark.hadoop.validateOutputSpecs", "false");
val sc = new SparkContext(conf)
This overloaded version of the save function works for me:
yourDF.save(outputPath, org.apache.spark.sql.SaveMode.valueOf("Overwrite"))
The example above would overwrite an existing folder. The savemode can take these parameters as well (https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SaveMode.html):
Append: Append mode means that when saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.
ErrorIfExists: ErrorIfExists mode means that when saving a DataFrame to a data source, if data already exists, an exception is expected to be thrown.
Ignore: Ignore mode means that when saving a DataFrame to a data source, if data already exists, the save operation is expected to not save the contents of the DataFrame and to not change the existing data.
Spark – Overwrite the output directory:
Spark by default doesn’t overwrite the output directory on S3, HDFS, and any other file systems, when you try to write the DataFrame contents to an existing directory, Spark returns runtime error hence. To overcome this Spark provides an enumeration org.apache.spark.sql.SaveMode.Overwrite to overwrite the existing folder.
We need to use this Overwrite as an argument to mode() function of the DataFrameWrite class, for example.
df. write.mode(SaveMode.Overwrite).csv("/tmp/out/foldername")
or you can use the overwrite string.
df.write.mode("overwrite").csv("/tmp/out/foldername")
Besides Overwrite, SaveMode also offers other modes like SaveMode.Append, SaveMode.ErrorIfExists and SaveMode.Ignore
For older versions of Spark, you can use the following to overwrite the output directory with the RDD contents.
sparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sparkContext = SparkContext(sparkConf)
If you are willing to use your own custom output format, you would be able to get the desired behaviour with RDD as well.
Have a look at the following classes:
FileOutputFormat,
FileOutputCommitter
In file output format you have a method named checkOutputSpecs, which is checking whether the output directory exists.
In FileOutputCommitter you have the commitJob which is usually transferring data from the temporary directory to its final place.
I wasn't able to verify it yet (would do it, as soon as I have few free minutes) but theoretically: If I extend FileOutputFormat and override checkOutputSpecs to a method that doesn't throw exception on directory already exists, and adjust the commitJob method of my custom output committer to perform which ever logic that I want (e.g. Override some of the files, append others) than I may be able to achieve the desired behaviour with RDDs as well.
The output format is passed to: saveAsNewAPIHadoopFile (which is the method saveAsTextFile called as well to actually save the files). And the Output committer is configured at the application level.

Resources