Dataframe not able to write on S3 - apache-spark

I am creating a dataframe from existing hive table.Table is partitioned on date and site column.Now, when i am trying to overwrite the data in this same table after some computation with previous day data.It is successfully getting loaded.
But when i am trying to write final dataframe at S3 bucket. I am getting error saying file not found.Now the file it is mentioning is previous day file which is now overwritten.
If i write dataframe first and then overwrite table then its running fine.
For writing at S3 location , what it has to do with table partition file?
Below is the error and code.
java.io.FileNotFoundException: No such file or directory: s3://bucket_1/DM/web_fact_tbl/local_dt=2018-05-10/site_name=ABC/part-00000-882a6e29-eb6a-477c-8b88-6fe853956674.c000
fact_tbl = spark.table('db.web_fact_tbl')
fact_lkp = fact_tbl.filter(fact_tbl['local_dt']=='2018-05-10')
fact_join = fact_lkp.alias('a').join(fact_tbl.alias('b'),(col('a.id') == col('b.id')),"inner").select('a.*')
fact_final = fact_join.union(fact_tbl)
fact_final.coalesce(2).createOrReplaceTempView('cwf')
spark.sql('INSERT OVERWRITE TABLE dm.web_fact_tbl PARTITION (local_dt, site_name) \
SELECT * FROM cwf')
fact_final.write.csv('s3://bucket_1/yahoo')

Before last line fact_final is just a "lazy" dataframe object that contains definitions only. It does not contain any data. But it has pointer to exact data files, where data is stored actually.
When you try to perform actual operations (does not matter it's writing to S3, or executing query like fact_final.count()) you'll get the error as above. It looks like partition local_dt=2018-05-10 does not exists anymore (files/folder that sits behind it does not exists).
You can try to re-initialize dataframe once again, before final write (it's another lazy operation - all work is done in your case while you writing it on S3).

Related

Databricks Delta Live Tables - Apply Changes from delta table

I am working with Databricks Delta Live Tables, but have some problems with upserting some tables upstream. I know it is quite a long text below, but I tried to describe my problem as clear as possible. Let me know if some parts are not clear.
I have the following tables and flow:
Landing_zone -> This is a folder in which JSON files are added that contain data of inserted or updated records.
Raw_table -> This is the data in the JSON files but in table format. This table is in delta format. No transformations are done, except from transforming the JSON structure into a tabular structure (I did an explode and then creating columns from the JSON keys).
Intermediate_table -> This is the raw_table, but with some extra columns (depending on other column values).
To go from my landing zone to the raw table I have the following Pyspark code:
cloudfile = {"cloudFiles.format":"JSON",
"cloudFiles.schemaLocation": sourceschemalocation,
"cloudFiles.inferColumnTypes": True}
#dlt.view('landing_view')
def inc_view():
df = (spark
.readStream
.format('cloudFiles')
.options(**cloudFilesOptions)
.load(filpath_to_landing)
<Some transformations to go from JSON to tabular (explode, ...)>
return df
dlt.create_target_table('raw_table',
table_properties = {'delta.enableChangeDataFeed': 'true'})
dlt.apply_changes(target='raw_table',
source='landing_view',
keys=['id'],
sequence_by='updated_at')
This code works as expected. I run it, add a changes.JSON file to the landing zone, rerun the pipeline and the upserts are correctly applied to the 'raw_table'
(However, each time a new parquet file with all the data is created in the delta folder, I would expect that only a parquet file with the inserted and updated rows was added? And that some information about the current version was kept in the delta logs? Not sure if this is relevant for my problem. I already changed the table_properties of the 'raw_table' to enableChangeDataFeed = true. The readStream for 'intermediate_table' then has option(readChangeFeed, 'true')).
Then I have the following code to go from my 'raw_table' to my 'intermediate_table':
#dlt.table(name='V_raw_table', table_properties={delta.enableChangeDataFeed': 'True'})
def raw_table():
df = (spark.readStream
.format('delta')
.option('readChangeFeed', 'true')
.table('LIVE.raw_table'))
df = df.withColumn('ExtraCol', <Transformation>)
return df
ezeg
dlt.create_target_table('intermediate_table')
dlt.apply_changes(target='intermediate_table',
source='V_raw_table',
keys=['id'],
sequence_by='updated_at')
Unfortunately, when I run this, I get the error:
'Detected a data update (for example part-00000-7127bd29-6820-406c-a5a1-e76fc7126150-c000.snappy.parquet) in the source table at version 2. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory.'
I checked in the 'ignoreChanges', but don't think this is what I want. I would expect that the autoloader would be able to detect the changes in the delta table and pass them through the flow.
I am aware that readStream only works with append, but that is why I would expect that after the 'raw_table' is updated, a new parquet file would be added to the delta folder with only the inserts and updates. This added parquet file is then detected by autoloader and could be used to apply the changes to the 'intermediate_table'.
Am I doing this the wrong way? Or am I overlooking something? Thanks in advance!
As readStream only works with appends, any change in the the source file will create issues downstream. The assumption that an update on "raw_table" will only insert a new parquet file is incorrect. Based on the settings like "optimized writes" or even without it, apply_changes can add or remove files. You can find this information in your "raw_table/_delta_log/xxx.json" under "numTargetFilesAdded" and "numTargetFilesRemoved".
Basically, "Databricks recommends you use Auto Loader to ingest only immutable files".
When you changed the settings to include the option '.option('readChangeFeed', 'true')', you should start with a full refresh(there is dropdown near start). Doing this will resolve the error 'Detected data update xxx', and your code should work for the incremental update.

Can underlying parquet files be deleted without negatively impacting DeltaLake _delta_log

Using .vacuum() on a DeltaLake table is very slow (see Delta Lake (OSS) Table on EMR and S3 - Vacuum takes a long time with no jobs).
If I manually deleted the underlying parquet files and did not add a new json log file or add a new .checkpoint.parquet file and change the _delta_log/_last_checkpoint file that points to it; what would the negative impacts to the DeltaLake table be, if any?
Obviously time-traveling, i.e. loading a previous version of the table that relied on the parquet files I removed, would not work. What I want to know is, would there be any issues reading, writing, or appending to the current version of the DeltaLake table?
What I am thinking of doing in pySpark:
### Assuming a working SparkSession as `spark`
from subprocess import check_output
import json
from pyspark.sql import functions as F
awscmd = "aws s3 cp s3://my_s3_bucket/delta/_delta_log/_last_checkpoint -"
last_checkpoint = str(json.loads(check_output(awscmd, shell=True).decode("utf-8")).get('version')).zfill(20)
s3_bucket_path = "s3a://my_s3_bucket/delta/"
df_chkpt_del = (
spark.read.format("parquet")
.load(f"{s3_bucket_path}/_delta_log/{last_checkpoint}.checkpoint.parquet")
.where(F.col("remove").isNotNull())
.select("remove.*")
.withColumn("deletionTimestamp", F.from_unixtime(F.col("deletionTimestamp")/1000))
.withColumn("delDateDiffDays", F.datediff(F.col("deletionTimestamp"), F.current_timestamp()))
.where(F.col("delDateDiffDays") < -7 )
)
There are a lot of options from here. One could be:
df_chkpt_del.select("path").toPandas().to_csv("files_to_delete.csv", index=False)
Where I could read files_to_delete.csv into a bash array and then use a simple bash for loop passing each parquet file s3 path to an aws s3 rm command to remove the files one by one.
This may be slower than vacuum(), but at least it will not be consuming cluster resources while it is working.
If I do this, will I also have to either:
write a new _delta_log/000000000000000#####.json file that correctly documents these changes?
write a new 000000000000000#####.checkpoint.parquet file that correctly documents these changes and change the _delta_log/_last_checkpoint file to point to that checkpoint.parquet file?
The second option would be easier.
However, if there will be no negative effects if I just remove the files and don't change anything in the _delta_log, then that would be the easiest.
TLDR. Answering this question.
If I manually deleted the underlying parquet files and did not add a new json log file or add a new .checkpoint.parquet file and change the _delta_log/_last_checkpoint file that points to it; what would the negative impacts to the DeltaLake table be, if any?
Yes, this could potentially corrupt your delta table.
Let me briefly answers how delta-lake reads a version using _delta_log.
If you want to read version x then it will go to delta log of all versions from 1 to x-1 and will make a running sum of parquet files to read. Summary of this process is saved as a .checkpoint after every 10th version to make this process of running sum efficient.
What do I mean by this running sum?
Assume,
version 1 log says, add add file_1, file_2, file_3
version 2 log says, add delete file_1, file_2, and add file_4
So when reading version no 2, total instruction will be
add file_1, file_2, file_3 -> delete file_1, file_2, and add file_4
So, resultant files read will be file_3 and file_4.
What if you delete a parquet from a file system?
Say in version 3, you delete file_4 from file system. If you don't use .vacuum then delta log will not know that file_4 is not present, it will try to read it and will fail.

Convert CSV files from multiple directory into parquet in PySpark

I have CSV files from multiple paths that are not parent directories in s3 bucket. All the tables have the same partition keys.
the directory of the s3:
table_name_1/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.csv
table_name_2/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.csv
...
I need to convert these csv files into parquet files and store them in another s3 bucket that has the same directory structure.
the directory of another s3:
table_name_1/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.parquet
table_name_2/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.parquet
...
I have a solution is iterating through the s3 bucket and find the CSV file and convert it to parquet and save to the another S3 path. I find this way is not efficient, because i have a loop and did the conversion one file by one file.
I want to utilize the spark library to improve the efficiency.
Then, I tried:
spark.read.csv('s3n://bucket_name/table_name_1/').write.partitionBy('partition_key_1', 'partition_key_2').parquet('s3n://another_bucket/table_name_1')
This way works good for each table, but to optimize it more, I want to take the table_name as a parameter, something like:
TABLE_NAMES = [table_name_1, table_name_2, ...]
spark.read.csv('s3n://bucket_name/{*TABLE_NAMES}/').write.partitionBy('partition_key_1', 'partition_key_2').parquet('s3n://another_bucket/{*TABLE_NAMES}')
Thanks
The mentioned question provides solutions for reading multiple files at once. The method spark.read.csv(...) accepts one or multiple paths as shown here. For reading the files you can apply the same logic. Although, when it comes to writing, Spark will merge all the given dataset/paths into one Dataframe. Therefore it is not possible to generate from one single dataframe multiple dataframes without applying a custom logic first. So to conclude, there is not such a method for extracting the initial dataframe directly into multiple directories i.e df.write.csv(*TABLE_NAMES).
The good news is that Spark provides a dedicated function namely input_file_name() which returns the file path of the current record. You can use it in combination with TABLE_NAMES to filter on the table name.
Here it is one possible untested PySpark solution:
from pyspark.sql.functions import input_file_name
TABLE_NAMES = [table_name_1, table_name_2, ...]
source_path = "s3n://bucket_name/"
input_paths = [f"{source_path}/{t}" for t in TABLE_NAMES]
all_df = spark.read.csv(*input_paths) \
.withColumn("file_name", input_file_name()) \
.cache()
dest_path = "s3n://another_bucket/"
def write_table(table_name: string) -> None:
all_df.where(all_df["file_name"].contains(table_name))
.write
.partitionBy('partition_key_1','partition_key_2')
.parquet(f"{dest_path}/{table_name}")
for t in TABLE_NAMES:
write_table(t)
Explanation:
We generate and store the input paths into input_paths. This will create paths such as: s3n://bucket_name/table1, s3n://bucket_name/table2 ... s3n://bucket_name/tableN.
Then we load all the paths into one dataframe in which we add a new column called file_name, this will hold the path of each row. Notice that we also use cache here, this is important since we have multiple len(TABLE_NAMES) actions in the following code. Using cache will prevent us from loading the datasource again and again.
Next we create the write_table which is responsible for saving the data for the given table. The next step is to filter based on the table name using all_df["file_name"].contains(table_name), this will return only the records that contain the value of the table_name in the file_name column. Finally we save the filtered data as you already did.
In the last step we call write_table for every item of TABLE_NAMES.
Related links
How to import multiple csv files in a single load?
Get HDFS file path in PySpark for files in sequence file format

Spark - Read and Write back to same S3 location

I am reading a dataset dataset1 and dataset2 from S3 locations. I then transform them and write back to the same location where dataset2 was read from.
However, I get below error message:
An error occurred while calling o118.save. No such file or directory 's3://<myPrefix>/part-00001-a123a120-7d11-581a-b9df-bc53076d57894-c000.snappy.parquet
If I try to write to a new S3 location e.g. s3://dataset_new_path.../ then the code works fine.
my_df \
.write.mode('overwrite') \
.format('parquet') \
.save(s3_target_location)
Note: I have tried using .cache() after reading in the dataframe but still get the same error.
The reason this causes a problem is that you are reading and writing to the same path that you are trying to overwrite. It is standard Spark issue and nothing to do with AWS Glue.
Spark uses lazy transformation on DF and it is triggered when certain action is called. It creates DAG to keep information about all transformations which should be applied to DF.
When you read data from same location and write using override, 'write using override' is action for DF. When spark sees 'write using override', in it's execution plan it adds to delete the path first, then trying to read that path which is already vacant; hence error.
Possible workaround would be to write to some temp location first and then using it as source, override in dataset2 location

Spark - Cannot Write To Parquet File when using mapWithInputSplit

I am trying to solve a simple log parsing problem: given N log files
logfile.20160601.txt
logfile.29169692.txt
...
I would save a parquet file with the date key of the log.
In order to accomplish that I found this way in order to get the inputSplit path.
val data = sc.textFile("/logdirectory/*.*")
val logsWithFileName = data.mapPartitionsWithInputSplit { (inputSplit, iterator) =>
val file = inputSplit.asInstanceOf[FileSplit]
val logDateKey = getDatekeyFromPath(file)
iterator.map { tpl => ( logDateKey, tpl._2.toString) }
}
val logs = logsWithFileName.map(item => LogItem(item._1,item._2))
val df = logs.toDF
Now I try to save the dataframe
df.write.partitionBy("logdatekey", "hostname").save("C:/temp/results.parquet")
but I receive this message
Output directory file:/C:/temp/results.parquet already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/C:/temp/results.parquet already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1179)
Does anyone experimented this strange behavior? Could this be related to the use of input split?
Many thanks in adavance
Rob
Well you error message says allo. You are trying to write an output that already exist.
You just need to what are the available save operations :
Save operations can optionally take a SaveMode, that specifies how to handle existing data if present. It is important to realize that these save modes do not utilize any locking and are not atomic. Additionally, when performing a Overwrite, the data will be deleted before writing out the new data.
SaveMode.ErrorIfExists (behavior by default) when saving a DataFrame to a data source, if data already exists, an exception is expected to be thrown.
SaveMode.Append - when saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.
SaveMode.Overwrite - means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame.
SaveMode.Ignore - means that when saving a DataFrame to a data source, if data already exists, the save operation is expected to not save the contents of the DataFrame and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL.
So if your case, if you want to overwrite your existing data, you should do the following :
import org.apache.spark.sql.SaveMode
df.write.partitionBy("logdatekey", "hostname").mode(SaveMode.Overwrite).save("C:/temp/results.parquet")

Resources