reading delta table specific file in folder - apache-spark

I am trying to read a specific file from a folder which contain multiple delta files,Please refer attached screenshot
Reason I am looking to read the delta file based on the schema version. The folder mentioned above contains files with different different schema structure.
code snippet for writing a file :
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("/home/games/Documents/test_delta/")
Code for reading a delta file
import pyspark[![enter image description here][1]][1]
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
path_to_data = '/home/games/Documents/test_delta/_delta_log/00000000000000000001.json'
df = spark.read.format("delta").load(path_to_data)
df.show()
error :
org.apache.spark.sql.delta.DeltaAnalysisException: /home/games/Documents/test_delta/_delta_log/ is not a Delta table.

You should use:
df = spark.read.format("delta").option("versionAsOf", 0).load(path_to_data)
You can specify other versions instead of 0 depending upon how many times how have overwritten the data. You can also use timestamps. Please see delta quick-start for more info.
Also, the delta_log folder actually contains delta transaction log in json format, not the actual data. The data is present in parent folder (test_delta in your case). The files starting with part-0000 are the ones that contain the actual data. These are .parquet files. There are no files with .delta extensions.

Related

Dealing with overwritten files in Databricks Autoloader

Main topic
I am facing a problem that I am struggling a lot to solve:
Ingest files that already have been captured by Autoloader but were
overwritten with new data.
Detailed problem description
I have a landing folder in a data lake where every day a new file is posted. You can check the image example below:
Each day an automation post a file with new data. This file is named with a suffix meaning the Year and Month of the current period of the posting.
This naming convention results in a file that is overwritten each day with the accumulated data extraction of the current month. The number of files in the folder only increases when the current month is closed and a new month starts.
To deal with that I have implemented the following PySpark code using the Autoloader feature from Databricks:
# Import functions
from pyspark.sql.functions import input_file_name, current_timestamp, col
# Define variables used in code below
checkpoint_directory = "abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/Test/_checkpoint/sapex_ap_posted"
data_source = f"abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/Test"
source_format = "csv"
table_name = "prod_gbs_gpdi.bronze_data.sapex_ap_posted"
# Configure Auto Loader to ingest csv data to a Delta table
query = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", source_format)
.option("cloudFiles.schemaLocation", checkpoint_directory)
.option("header", "true")
.option("delimiter", ";")
.option("skipRows", 7)
.option("modifiedAfter", "2022-10-15 11:34:00.000000 UTC-3") # To ingest files that have a modification timestamp after the provided timestamp.
.option("pathGlobFilter", "AP_SAPEX_KPI_001 - Posted Invoices in *.CSV") # A potential glob pattern to provide for choosing files.
.load(data_source)
.select(
"*",
current_timestamp().alias("_JOB_UPDATED_TIME"),
input_file_name().alias("_JOB_SOURCE_FILE"),
col("_metadata.file_modification_time").alias("_MODIFICATION_TIME")
)
.writeStream
.option("checkpointLocation", checkpoint_directory)
.option("mergeSchema", "true")
.trigger(availableNow=True)
.toTable(table_name)
)
This code allows me to capture each new file and ingest it into a Raw Table.
The problem is that it works fine ONLY when a new file arrives. But if the desired file is overwritten in the landing folder the Autoloader does nothing because it assumes the file has already been ingested, even though the modification time of the file has chaged.
Failed tentative
I tried to use the option modifiedAfter in the code. But it appears to only serve as a filter to prevent files with a Timestamp to be ingested if it has the property before the threshold mentioned in the timestamp string. It dows not reingest files that have Timestamps before the modifiedAfter threshold.
.option("modifiedAfter", "2022-10-15 14:10:00.000000 UTC-3")
Question
Does someone knows how to detect a file that was already ingested but has a different modified date and how to reprocess that to load in a table?
I have figured out a solution to this problem. In the Autoloader Options list in Databricks documentation is possible to see an option called cloudFiles.allowOverwrites. If you enable that in the streaming query then whenever a file is overwritten in the lake the query will ingest it into the target table. Please pay attention that this option will probably duplicate the data whenever a new file is overwritten. Therefore, downstream treatment will be necessary.

Unable to read Databricks Delta / Parquet File with Delta Format

I am trying to read a delta / parquet in Databricks using the follow code in Databricks
df3 = spark.read.format("delta").load('/mnt/lake/CUR/CURATED/origination/company/opportunities_final/curorigination.presentation.parquet')
However, I'm getting the following error:
A partition path fragment should be the form like `part1=foo/part2=bar`. The partition path: curorigination.presentation.parquet
This seemed very straightforward, but not sure why I'm getting the error
Any thoughts?
The file structure looks like the following
The error shows that delta lake think you have wrong partition path naming.
If you have any partition column in your delta table, for example year month day, your path should look like /mnt/lake/CUR/CURATED/origination/company/opportunities_final/year=yyyy/month=mm/day=dd/curorigination.presentation.parquet and you just need to do df = spark.read.format("delta").load("/mnt/lake/CUR/CURATED/origination/company/opportunities_final").
If you just read it as parquet, you can just do df = spark.read.parquet("/mnt/lake/CUR/CURATED/origination/company/opportunities_final") because you don't need to read the absolute path of the parquet file.
The above error mainly happens because of incorrect path format curorigination.presentation.parquet. please check your delta location and also check whether delta file is created or not :
%fs ls /mnt/lake/CUR/CURATED/origination/company/opportunities_final/
I reproduced the same thing in my environment. First of all, I created a data frame with a parquet file.
df1 = spark.read.format("parquet").load("/FileStore/tables/")
display(df1)
After that I just converted the parquet file into delta format and saved the file into this location/mnt/lake/CUR/CURATED/origination/company/opportunities_final/demo_delta1 .
df1.coalesce(1).write.format('delta').mode("overwrite").save("/mnt/lake/CUR/CURATED/origination/company/opportunities_final/demo_delta1")
#Reading delta file
df3 = spark.read.format("delta").load("/mnt/lake/CUR/CURATED/origination/company/opportunities_final/demo_delta")
display(df3)

delta log (delta lake) - convert parquet to delta table issues

I'm using trying to convert parquet table to delta table in spark.
I'm able to convert the existing table into delta table using the below command,
import io.delta.tables._
val deltaTable = DeltaTable.convertToDelta(spark, "parquet.`path`")
This created delta log folder in the path along with
lastcheckpoint
00000000000000000000.checkpoint.parquet
00000000000000000000.json files
inside it, we are good till now.
Next I try to insert data into the table, without any errors it would insert.
But I dont any new .json file or new checkpoint parquet file (after 10 inserts) in the delta log folder.
Instead I see all the inserts in a new parquet file.
It should be merged into one right? and every-time there should be a new json created after every inserts
What am I missing here? Any specific configuration?
Spark - 3.2.1
delta - delta-core_2.12-1.2.1
also using --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

PySpark/DataBricks: How to read parquet files using 'file:///' and not 'dbfs'

I am trying to use petastorm in a different manner which requires that I tell it where my parquet files are stored through one of the following:
hdfs://some_hdfs_cluster/user/yevgeni/parquet8, or file:///tmp/mydataset, or s3://bucket/mydataset, or gs://bucket/mydataset. Since I am on DataBricks and given other constraints, my option is to use the file:/// option.
However, I am at a loss as to how specify the location of my parquet files. I continually get rejected saying that Path does not exist:
Here is what I am doing:
# save spark df to parquet
dbutils.fs.rm('dbfs:/mnt/team01/assembled_train.parquet', recurse=True)
assembled_train.write.parquet('dbfs:/mnt/team01/assembled_train')
# look at files
display(dbutils.fs.ls('mnt/team01/assembled_train/'))
# results
path name size
dbfs:/mnt/team01/assembled_train/_SUCCESS _SUCCESS 0
dbfs:/mnt/team01/assembled_train/_committed_2150262571233317067 _committed_2150262571233317067 856
dbfs:/mnt/team01/assembled_train/_started_2150262571233317067 _started_2150262571233317067 0
dbfs:/mnt/team01/assembled_train/part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet 578991
dbfs:/mnt/team01/assembled_train/part-00001-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035358-1-c000.snappy.parquet part-00001-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035358-1-c000.snappy.parquet 579640
dbfs:/mnt/team01/assembled_train/part-00002-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035359-1-c000.snappy.parquet part-00002-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035359-1-c000.snappy.parquet 580675
dbfs:/mnt/team01/assembled_train/part-00003-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035360-1-c000.snappy.parquet part-00003-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035360-1-c000.snappy.parquet 579483
dbfs:/mnt/team01/assembled_train/part-00004-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035361-1-c000.snappy.parquet part-00004-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035361-1-c000.snappy.parquet 578807
dbfs:/mnt/team01/assembled_train/part-00005-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035362-1-c000.snappy.parquet part-00005-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035362-1-c000.snappy.parquet 580942
dbfs:/mnt/team01/assembled_train/part-00006-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035363-1-c000.snappy.parquet part-00006-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035363-1-c000.snappy.parquet 579202
dbfs:/mnt/team01/assembled_train/part-00007-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035364-1-c000.snappy.parquet part-00007-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035364-1-c000.snappy.parquet 579810
While testing with a basic dataframe load from the file structure, like so:
df1 = spark.read.option("header", "true").parquet('file:///mnt/team01/assembled_train/part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet')```
I get file does not exist.
You just need to specify the path as it is, no need for 'file:///':
df1 = spark.read.option("header", "true").parquet('/mnt/team01/assembled_train/part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet')
If this doesn't work, try the methods in https://docs.databricks.com/applications/machine-learning/load-data/petastorm.html#configure-cache-directory

Convert CSV files from multiple directory into parquet in PySpark

I have CSV files from multiple paths that are not parent directories in s3 bucket. All the tables have the same partition keys.
the directory of the s3:
table_name_1/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.csv
table_name_2/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.csv
...
I need to convert these csv files into parquet files and store them in another s3 bucket that has the same directory structure.
the directory of another s3:
table_name_1/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.parquet
table_name_2/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.parquet
...
I have a solution is iterating through the s3 bucket and find the CSV file and convert it to parquet and save to the another S3 path. I find this way is not efficient, because i have a loop and did the conversion one file by one file.
I want to utilize the spark library to improve the efficiency.
Then, I tried:
spark.read.csv('s3n://bucket_name/table_name_1/').write.partitionBy('partition_key_1', 'partition_key_2').parquet('s3n://another_bucket/table_name_1')
This way works good for each table, but to optimize it more, I want to take the table_name as a parameter, something like:
TABLE_NAMES = [table_name_1, table_name_2, ...]
spark.read.csv('s3n://bucket_name/{*TABLE_NAMES}/').write.partitionBy('partition_key_1', 'partition_key_2').parquet('s3n://another_bucket/{*TABLE_NAMES}')
Thanks
The mentioned question provides solutions for reading multiple files at once. The method spark.read.csv(...) accepts one or multiple paths as shown here. For reading the files you can apply the same logic. Although, when it comes to writing, Spark will merge all the given dataset/paths into one Dataframe. Therefore it is not possible to generate from one single dataframe multiple dataframes without applying a custom logic first. So to conclude, there is not such a method for extracting the initial dataframe directly into multiple directories i.e df.write.csv(*TABLE_NAMES).
The good news is that Spark provides a dedicated function namely input_file_name() which returns the file path of the current record. You can use it in combination with TABLE_NAMES to filter on the table name.
Here it is one possible untested PySpark solution:
from pyspark.sql.functions import input_file_name
TABLE_NAMES = [table_name_1, table_name_2, ...]
source_path = "s3n://bucket_name/"
input_paths = [f"{source_path}/{t}" for t in TABLE_NAMES]
all_df = spark.read.csv(*input_paths) \
.withColumn("file_name", input_file_name()) \
.cache()
dest_path = "s3n://another_bucket/"
def write_table(table_name: string) -> None:
all_df.where(all_df["file_name"].contains(table_name))
.write
.partitionBy('partition_key_1','partition_key_2')
.parquet(f"{dest_path}/{table_name}")
for t in TABLE_NAMES:
write_table(t)
Explanation:
We generate and store the input paths into input_paths. This will create paths such as: s3n://bucket_name/table1, s3n://bucket_name/table2 ... s3n://bucket_name/tableN.
Then we load all the paths into one dataframe in which we add a new column called file_name, this will hold the path of each row. Notice that we also use cache here, this is important since we have multiple len(TABLE_NAMES) actions in the following code. Using cache will prevent us from loading the datasource again and again.
Next we create the write_table which is responsible for saving the data for the given table. The next step is to filter based on the table name using all_df["file_name"].contains(table_name), this will return only the records that contain the value of the table_name in the file_name column. Finally we save the filtered data as you already did.
In the last step we call write_table for every item of TABLE_NAMES.
Related links
How to import multiple csv files in a single load?
Get HDFS file path in PySpark for files in sequence file format

Resources