read only non-merged files in pyspark - apache-spark

I have N deltas in N folders (ex. /user/deltas/1/delta1.csv, /user/deltas/2/delta2csv,.../user/deltas/n/deltaN.csv)
all deltas have same columns, only information in columns is different.
i have a code for reading my csv files from folder "deltas"
dfTable = spark.read.format("csv").option("recursiveFileLookup","true")\
.option("header", "true).load("/home/user/deltas/")
and i gonna use deltaTable.merge to merge and update information from deltas and write updated information in table (main_table.csv)
For example tommorow i will have new delta with another updated information, and i will run my code again to refresh data in my main_table.csv .
How to avoid deltas that have already been used by deltaTable.merge earlier to the file main_table.csv ?
is it possible maybe to change file type after delta's run for example to parquet and thats how to avoid re-using deltas again? because im reading csv files, not parquet, or something like log files etc..

I think a time path filter might work well for your use case. If you are running your code daily (either manually or with a job), then you could use the modifiedAfter parameter to only load files that were modified after 1 day ago (or however often you are rerunning this code).
from datetime import datetime, timedelta
timestamp_last_run = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%dT-%H:%M:%S")
dfTable = spark.read.format("csv").option("recursiveFileLookup","true")\
.option("header", "true).load("/home/user/deltas/", modifiedAfter=timestamp_last_run)
## ...perform merge operation and save data in main_table.csv

Related

Add the creation date of a parquet file into a DataFrame

Currently I load multiple parquet file with this code :
df = spark.read.parquet("/mnt/dev/bronze/Voucher/*/*")
(Into the Voucher folder, there is one folder by date, and one parquet file inside it)
How can I add the creation date of each parquet file into my DataFrame ?
Thanks
EDIT 1:
Thanks rainingdistros, I wrote this:
import os
from datetime import datetime, timedelta
Path = "/dbfs/mnt/dev/bronze/Voucher/2022-09-23/"
fileFull = Path +'/'+'XXXXXX.parquet'
statinfo = os.stat(fileFull)
create_date = datetime.fromtimestamp(statinfo.st_ctime)
display(create_date)
Now I must find a way to loop through all the files and add a column in the DataFrame.
The information returned from os.stat might not be accurate unless the file is first operation on these files is your requirement (i.e., adding the additional column with creation time).
Each time the file is modified, both st_mtime and st_ctime will be updated to this modification time. The following are the images indicating the same:
When I modify this file, the changes can be observed in the information returned by os.stat.
So, if adding this column is the first operation that is going to be performed on these files, then you can use the following code to add this date as column to your files.
from pyspark.sql.functions import lit
import pandas as pd
path = "/dbfs/mnt/repro/2022-12-01"
fileinfo = os.listdir(path)
for file in fileinfo:
pdf = pd.read_csv(f"{path}/{file}")
pdf.display()
statinfo = os.stat("/dbfs/mnt/repro/2022-12-01/sample1.csv")
create_date = datetime.fromtimestamp(statinfo.st_ctime)
pdf['creation_date'] = [create_date.date()] * len(pdf)
pdf.to_csv(f"{path}/{file}", index=False)
These files would have this new column as shown below after running the code:
It might be better to take the value directly from folder in this case as the information is already available and all that needs to be done is to extract and add column to files in a similar manner as in the above code.
See if below steps help....
Refer to the link to get the list of files in DBFS - SO - Loop through Files in DBFS
Once you have the files, loop through them and for each file use the code you have written in your question.
Please note that dbutils has the mtime of a file in it. The os module provides way to identify the ctime i.e. the time of most recent metadata changes on Unix, - ideally should have been st_birthtime - but that does not seem to work in my trials...Hope it works for you...

Dealing with overwritten files in Databricks Autoloader

Main topic
I am facing a problem that I am struggling a lot to solve:
Ingest files that already have been captured by Autoloader but were
overwritten with new data.
Detailed problem description
I have a landing folder in a data lake where every day a new file is posted. You can check the image example below:
Each day an automation post a file with new data. This file is named with a suffix meaning the Year and Month of the current period of the posting.
This naming convention results in a file that is overwritten each day with the accumulated data extraction of the current month. The number of files in the folder only increases when the current month is closed and a new month starts.
To deal with that I have implemented the following PySpark code using the Autoloader feature from Databricks:
# Import functions
from pyspark.sql.functions import input_file_name, current_timestamp, col
# Define variables used in code below
checkpoint_directory = "abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/Test/_checkpoint/sapex_ap_posted"
data_source = f"abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/Test"
source_format = "csv"
table_name = "prod_gbs_gpdi.bronze_data.sapex_ap_posted"
# Configure Auto Loader to ingest csv data to a Delta table
query = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", source_format)
.option("cloudFiles.schemaLocation", checkpoint_directory)
.option("header", "true")
.option("delimiter", ";")
.option("skipRows", 7)
.option("modifiedAfter", "2022-10-15 11:34:00.000000 UTC-3") # To ingest files that have a modification timestamp after the provided timestamp.
.option("pathGlobFilter", "AP_SAPEX_KPI_001 - Posted Invoices in *.CSV") # A potential glob pattern to provide for choosing files.
.load(data_source)
.select(
"*",
current_timestamp().alias("_JOB_UPDATED_TIME"),
input_file_name().alias("_JOB_SOURCE_FILE"),
col("_metadata.file_modification_time").alias("_MODIFICATION_TIME")
)
.writeStream
.option("checkpointLocation", checkpoint_directory)
.option("mergeSchema", "true")
.trigger(availableNow=True)
.toTable(table_name)
)
This code allows me to capture each new file and ingest it into a Raw Table.
The problem is that it works fine ONLY when a new file arrives. But if the desired file is overwritten in the landing folder the Autoloader does nothing because it assumes the file has already been ingested, even though the modification time of the file has chaged.
Failed tentative
I tried to use the option modifiedAfter in the code. But it appears to only serve as a filter to prevent files with a Timestamp to be ingested if it has the property before the threshold mentioned in the timestamp string. It dows not reingest files that have Timestamps before the modifiedAfter threshold.
.option("modifiedAfter", "2022-10-15 14:10:00.000000 UTC-3")
Question
Does someone knows how to detect a file that was already ingested but has a different modified date and how to reprocess that to load in a table?
I have figured out a solution to this problem. In the Autoloader Options list in Databricks documentation is possible to see an option called cloudFiles.allowOverwrites. If you enable that in the streaming query then whenever a file is overwritten in the lake the query will ingest it into the target table. Please pay attention that this option will probably duplicate the data whenever a new file is overwritten. Therefore, downstream treatment will be necessary.

Databricks Delta Live Tables - Apply Changes from delta table

I am working with Databricks Delta Live Tables, but have some problems with upserting some tables upstream. I know it is quite a long text below, but I tried to describe my problem as clear as possible. Let me know if some parts are not clear.
I have the following tables and flow:
Landing_zone -> This is a folder in which JSON files are added that contain data of inserted or updated records.
Raw_table -> This is the data in the JSON files but in table format. This table is in delta format. No transformations are done, except from transforming the JSON structure into a tabular structure (I did an explode and then creating columns from the JSON keys).
Intermediate_table -> This is the raw_table, but with some extra columns (depending on other column values).
To go from my landing zone to the raw table I have the following Pyspark code:
cloudfile = {"cloudFiles.format":"JSON",
"cloudFiles.schemaLocation": sourceschemalocation,
"cloudFiles.inferColumnTypes": True}
#dlt.view('landing_view')
def inc_view():
df = (spark
.readStream
.format('cloudFiles')
.options(**cloudFilesOptions)
.load(filpath_to_landing)
<Some transformations to go from JSON to tabular (explode, ...)>
return df
dlt.create_target_table('raw_table',
table_properties = {'delta.enableChangeDataFeed': 'true'})
dlt.apply_changes(target='raw_table',
source='landing_view',
keys=['id'],
sequence_by='updated_at')
This code works as expected. I run it, add a changes.JSON file to the landing zone, rerun the pipeline and the upserts are correctly applied to the 'raw_table'
(However, each time a new parquet file with all the data is created in the delta folder, I would expect that only a parquet file with the inserted and updated rows was added? And that some information about the current version was kept in the delta logs? Not sure if this is relevant for my problem. I already changed the table_properties of the 'raw_table' to enableChangeDataFeed = true. The readStream for 'intermediate_table' then has option(readChangeFeed, 'true')).
Then I have the following code to go from my 'raw_table' to my 'intermediate_table':
#dlt.table(name='V_raw_table', table_properties={delta.enableChangeDataFeed': 'True'})
def raw_table():
df = (spark.readStream
.format('delta')
.option('readChangeFeed', 'true')
.table('LIVE.raw_table'))
df = df.withColumn('ExtraCol', <Transformation>)
return df
ezeg
dlt.create_target_table('intermediate_table')
dlt.apply_changes(target='intermediate_table',
source='V_raw_table',
keys=['id'],
sequence_by='updated_at')
Unfortunately, when I run this, I get the error:
'Detected a data update (for example part-00000-7127bd29-6820-406c-a5a1-e76fc7126150-c000.snappy.parquet) in the source table at version 2. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory.'
I checked in the 'ignoreChanges', but don't think this is what I want. I would expect that the autoloader would be able to detect the changes in the delta table and pass them through the flow.
I am aware that readStream only works with append, but that is why I would expect that after the 'raw_table' is updated, a new parquet file would be added to the delta folder with only the inserts and updates. This added parquet file is then detected by autoloader and could be used to apply the changes to the 'intermediate_table'.
Am I doing this the wrong way? Or am I overlooking something? Thanks in advance!
As readStream only works with appends, any change in the the source file will create issues downstream. The assumption that an update on "raw_table" will only insert a new parquet file is incorrect. Based on the settings like "optimized writes" or even without it, apply_changes can add or remove files. You can find this information in your "raw_table/_delta_log/xxx.json" under "numTargetFilesAdded" and "numTargetFilesRemoved".
Basically, "Databricks recommends you use Auto Loader to ingest only immutable files".
When you changed the settings to include the option '.option('readChangeFeed', 'true')', you should start with a full refresh(there is dropdown near start). Doing this will resolve the error 'Detected data update xxx', and your code should work for the incremental update.

Can underlying parquet files be deleted without negatively impacting DeltaLake _delta_log

Using .vacuum() on a DeltaLake table is very slow (see Delta Lake (OSS) Table on EMR and S3 - Vacuum takes a long time with no jobs).
If I manually deleted the underlying parquet files and did not add a new json log file or add a new .checkpoint.parquet file and change the _delta_log/_last_checkpoint file that points to it; what would the negative impacts to the DeltaLake table be, if any?
Obviously time-traveling, i.e. loading a previous version of the table that relied on the parquet files I removed, would not work. What I want to know is, would there be any issues reading, writing, or appending to the current version of the DeltaLake table?
What I am thinking of doing in pySpark:
### Assuming a working SparkSession as `spark`
from subprocess import check_output
import json
from pyspark.sql import functions as F
awscmd = "aws s3 cp s3://my_s3_bucket/delta/_delta_log/_last_checkpoint -"
last_checkpoint = str(json.loads(check_output(awscmd, shell=True).decode("utf-8")).get('version')).zfill(20)
s3_bucket_path = "s3a://my_s3_bucket/delta/"
df_chkpt_del = (
spark.read.format("parquet")
.load(f"{s3_bucket_path}/_delta_log/{last_checkpoint}.checkpoint.parquet")
.where(F.col("remove").isNotNull())
.select("remove.*")
.withColumn("deletionTimestamp", F.from_unixtime(F.col("deletionTimestamp")/1000))
.withColumn("delDateDiffDays", F.datediff(F.col("deletionTimestamp"), F.current_timestamp()))
.where(F.col("delDateDiffDays") < -7 )
)
There are a lot of options from here. One could be:
df_chkpt_del.select("path").toPandas().to_csv("files_to_delete.csv", index=False)
Where I could read files_to_delete.csv into a bash array and then use a simple bash for loop passing each parquet file s3 path to an aws s3 rm command to remove the files one by one.
This may be slower than vacuum(), but at least it will not be consuming cluster resources while it is working.
If I do this, will I also have to either:
write a new _delta_log/000000000000000#####.json file that correctly documents these changes?
write a new 000000000000000#####.checkpoint.parquet file that correctly documents these changes and change the _delta_log/_last_checkpoint file to point to that checkpoint.parquet file?
The second option would be easier.
However, if there will be no negative effects if I just remove the files and don't change anything in the _delta_log, then that would be the easiest.
TLDR. Answering this question.
If I manually deleted the underlying parquet files and did not add a new json log file or add a new .checkpoint.parquet file and change the _delta_log/_last_checkpoint file that points to it; what would the negative impacts to the DeltaLake table be, if any?
Yes, this could potentially corrupt your delta table.
Let me briefly answers how delta-lake reads a version using _delta_log.
If you want to read version x then it will go to delta log of all versions from 1 to x-1 and will make a running sum of parquet files to read. Summary of this process is saved as a .checkpoint after every 10th version to make this process of running sum efficient.
What do I mean by this running sum?
Assume,
version 1 log says, add add file_1, file_2, file_3
version 2 log says, add delete file_1, file_2, and add file_4
So when reading version no 2, total instruction will be
add file_1, file_2, file_3 -> delete file_1, file_2, and add file_4
So, resultant files read will be file_3 and file_4.
What if you delete a parquet from a file system?
Say in version 3, you delete file_4 from file system. If you don't use .vacuum then delta log will not know that file_4 is not present, it will try to read it and will fail.

Convert CSV files from multiple directory into parquet in PySpark

I have CSV files from multiple paths that are not parent directories in s3 bucket. All the tables have the same partition keys.
the directory of the s3:
table_name_1/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.csv
table_name_2/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.csv
...
I need to convert these csv files into parquet files and store them in another s3 bucket that has the same directory structure.
the directory of another s3:
table_name_1/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.parquet
table_name_2/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.parquet
...
I have a solution is iterating through the s3 bucket and find the CSV file and convert it to parquet and save to the another S3 path. I find this way is not efficient, because i have a loop and did the conversion one file by one file.
I want to utilize the spark library to improve the efficiency.
Then, I tried:
spark.read.csv('s3n://bucket_name/table_name_1/').write.partitionBy('partition_key_1', 'partition_key_2').parquet('s3n://another_bucket/table_name_1')
This way works good for each table, but to optimize it more, I want to take the table_name as a parameter, something like:
TABLE_NAMES = [table_name_1, table_name_2, ...]
spark.read.csv('s3n://bucket_name/{*TABLE_NAMES}/').write.partitionBy('partition_key_1', 'partition_key_2').parquet('s3n://another_bucket/{*TABLE_NAMES}')
Thanks
The mentioned question provides solutions for reading multiple files at once. The method spark.read.csv(...) accepts one or multiple paths as shown here. For reading the files you can apply the same logic. Although, when it comes to writing, Spark will merge all the given dataset/paths into one Dataframe. Therefore it is not possible to generate from one single dataframe multiple dataframes without applying a custom logic first. So to conclude, there is not such a method for extracting the initial dataframe directly into multiple directories i.e df.write.csv(*TABLE_NAMES).
The good news is that Spark provides a dedicated function namely input_file_name() which returns the file path of the current record. You can use it in combination with TABLE_NAMES to filter on the table name.
Here it is one possible untested PySpark solution:
from pyspark.sql.functions import input_file_name
TABLE_NAMES = [table_name_1, table_name_2, ...]
source_path = "s3n://bucket_name/"
input_paths = [f"{source_path}/{t}" for t in TABLE_NAMES]
all_df = spark.read.csv(*input_paths) \
.withColumn("file_name", input_file_name()) \
.cache()
dest_path = "s3n://another_bucket/"
def write_table(table_name: string) -> None:
all_df.where(all_df["file_name"].contains(table_name))
.write
.partitionBy('partition_key_1','partition_key_2')
.parquet(f"{dest_path}/{table_name}")
for t in TABLE_NAMES:
write_table(t)
Explanation:
We generate and store the input paths into input_paths. This will create paths such as: s3n://bucket_name/table1, s3n://bucket_name/table2 ... s3n://bucket_name/tableN.
Then we load all the paths into one dataframe in which we add a new column called file_name, this will hold the path of each row. Notice that we also use cache here, this is important since we have multiple len(TABLE_NAMES) actions in the following code. Using cache will prevent us from loading the datasource again and again.
Next we create the write_table which is responsible for saving the data for the given table. The next step is to filter based on the table name using all_df["file_name"].contains(table_name), this will return only the records that contain the value of the table_name in the file_name column. Finally we save the filtered data as you already did.
In the last step we call write_table for every item of TABLE_NAMES.
Related links
How to import multiple csv files in a single load?
Get HDFS file path in PySpark for files in sequence file format

Resources