delta log (delta lake) - convert parquet to delta table issues - apache-spark

I'm using trying to convert parquet table to delta table in spark.
I'm able to convert the existing table into delta table using the below command,
import io.delta.tables._
val deltaTable = DeltaTable.convertToDelta(spark, "parquet.`path`")
This created delta log folder in the path along with
lastcheckpoint
00000000000000000000.checkpoint.parquet
00000000000000000000.json files
inside it, we are good till now.
Next I try to insert data into the table, without any errors it would insert.
But I dont any new .json file or new checkpoint parquet file (after 10 inserts) in the delta log folder.
Instead I see all the inserts in a new parquet file.
It should be merged into one right? and every-time there should be a new json created after every inserts
What am I missing here? Any specific configuration?
Spark - 3.2.1
delta - delta-core_2.12-1.2.1
also using --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Related

reading delta table specific file in folder

I am trying to read a specific file from a folder which contain multiple delta files,Please refer attached screenshot
Reason I am looking to read the delta file based on the schema version. The folder mentioned above contains files with different different schema structure.
code snippet for writing a file :
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("/home/games/Documents/test_delta/")
Code for reading a delta file
import pyspark[![enter image description here][1]][1]
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
path_to_data = '/home/games/Documents/test_delta/_delta_log/00000000000000000001.json'
df = spark.read.format("delta").load(path_to_data)
df.show()
error :
org.apache.spark.sql.delta.DeltaAnalysisException: /home/games/Documents/test_delta/_delta_log/ is not a Delta table.
You should use:
df = spark.read.format("delta").option("versionAsOf", 0).load(path_to_data)
You can specify other versions instead of 0 depending upon how many times how have overwritten the data. You can also use timestamps. Please see delta quick-start for more info.
Also, the delta_log folder actually contains delta transaction log in json format, not the actual data. The data is present in parent folder (test_delta in your case). The files starting with part-0000 are the ones that contain the actual data. These are .parquet files. There are no files with .delta extensions.

Spark: refresh Delta Table in S3

how can I run the refresh table command on a Delta Table in S3?
When I do
deltatable = DeltaTable.forPath(spark, "s3a://test-bucket/delta_table/")
spark.catalog.refreshTable(deltatable)
I am getting the error:
AttributeError: 'DeltaTable' object has no attribute '_get_object_id'
Does the refresh command only work for Hive tables?
Thanks!
Ok. It's really an incorrect function - the spark.catalog.refreshTable function (doc) is used to refresh table metadata inside the Spark. It has nothing to do with recovery of the Delta table.
To fix this on Delta you need to do something different. Unfortunately I'm not 100% sure about right way for open source Delta implementation - on Databricks we have FSCK REPAIR TABLE SQL command for that. I would try following (be careful, make a backup!):
If removed files were in the recent version, then you may try to use RESTORE command with spark.sql.files.ignoreMissingFiles set to true
If removed files were for the specific partition, then you can read the table (again with spark.sql.files.ignoreMissingFiles set to true), leave data only for that partitions, and write data using overwrite mode with replaceWhere option (doc) that contains condition
Or you can read the whole Delta table (again with spark.sql.files.ignoreMissingFiles set to true) and write it back in Overwrite mode - it will of course duplicate your data, but the old files will be removed by the VACUUM.

Unable to read Databricks Delta / Parquet File with Delta Format

I am trying to read a delta / parquet in Databricks using the follow code in Databricks
df3 = spark.read.format("delta").load('/mnt/lake/CUR/CURATED/origination/company/opportunities_final/curorigination.presentation.parquet')
However, I'm getting the following error:
A partition path fragment should be the form like `part1=foo/part2=bar`. The partition path: curorigination.presentation.parquet
This seemed very straightforward, but not sure why I'm getting the error
Any thoughts?
The file structure looks like the following
The error shows that delta lake think you have wrong partition path naming.
If you have any partition column in your delta table, for example year month day, your path should look like /mnt/lake/CUR/CURATED/origination/company/opportunities_final/year=yyyy/month=mm/day=dd/curorigination.presentation.parquet and you just need to do df = spark.read.format("delta").load("/mnt/lake/CUR/CURATED/origination/company/opportunities_final").
If you just read it as parquet, you can just do df = spark.read.parquet("/mnt/lake/CUR/CURATED/origination/company/opportunities_final") because you don't need to read the absolute path of the parquet file.
The above error mainly happens because of incorrect path format curorigination.presentation.parquet. please check your delta location and also check whether delta file is created or not :
%fs ls /mnt/lake/CUR/CURATED/origination/company/opportunities_final/
I reproduced the same thing in my environment. First of all, I created a data frame with a parquet file.
df1 = spark.read.format("parquet").load("/FileStore/tables/")
display(df1)
After that I just converted the parquet file into delta format and saved the file into this location/mnt/lake/CUR/CURATED/origination/company/opportunities_final/demo_delta1 .
df1.coalesce(1).write.format('delta').mode("overwrite").save("/mnt/lake/CUR/CURATED/origination/company/opportunities_final/demo_delta1")
#Reading delta file
df3 = spark.read.format("delta").load("/mnt/lake/CUR/CURATED/origination/company/opportunities_final/demo_delta")
display(df3)

Databricks Delta Live Tables - Apply Changes from delta table

I am working with Databricks Delta Live Tables, but have some problems with upserting some tables upstream. I know it is quite a long text below, but I tried to describe my problem as clear as possible. Let me know if some parts are not clear.
I have the following tables and flow:
Landing_zone -> This is a folder in which JSON files are added that contain data of inserted or updated records.
Raw_table -> This is the data in the JSON files but in table format. This table is in delta format. No transformations are done, except from transforming the JSON structure into a tabular structure (I did an explode and then creating columns from the JSON keys).
Intermediate_table -> This is the raw_table, but with some extra columns (depending on other column values).
To go from my landing zone to the raw table I have the following Pyspark code:
cloudfile = {"cloudFiles.format":"JSON",
"cloudFiles.schemaLocation": sourceschemalocation,
"cloudFiles.inferColumnTypes": True}
#dlt.view('landing_view')
def inc_view():
df = (spark
.readStream
.format('cloudFiles')
.options(**cloudFilesOptions)
.load(filpath_to_landing)
<Some transformations to go from JSON to tabular (explode, ...)>
return df
dlt.create_target_table('raw_table',
table_properties = {'delta.enableChangeDataFeed': 'true'})
dlt.apply_changes(target='raw_table',
source='landing_view',
keys=['id'],
sequence_by='updated_at')
This code works as expected. I run it, add a changes.JSON file to the landing zone, rerun the pipeline and the upserts are correctly applied to the 'raw_table'
(However, each time a new parquet file with all the data is created in the delta folder, I would expect that only a parquet file with the inserted and updated rows was added? And that some information about the current version was kept in the delta logs? Not sure if this is relevant for my problem. I already changed the table_properties of the 'raw_table' to enableChangeDataFeed = true. The readStream for 'intermediate_table' then has option(readChangeFeed, 'true')).
Then I have the following code to go from my 'raw_table' to my 'intermediate_table':
#dlt.table(name='V_raw_table', table_properties={delta.enableChangeDataFeed': 'True'})
def raw_table():
df = (spark.readStream
.format('delta')
.option('readChangeFeed', 'true')
.table('LIVE.raw_table'))
df = df.withColumn('ExtraCol', <Transformation>)
return df
ezeg
dlt.create_target_table('intermediate_table')
dlt.apply_changes(target='intermediate_table',
source='V_raw_table',
keys=['id'],
sequence_by='updated_at')
Unfortunately, when I run this, I get the error:
'Detected a data update (for example part-00000-7127bd29-6820-406c-a5a1-e76fc7126150-c000.snappy.parquet) in the source table at version 2. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory.'
I checked in the 'ignoreChanges', but don't think this is what I want. I would expect that the autoloader would be able to detect the changes in the delta table and pass them through the flow.
I am aware that readStream only works with append, but that is why I would expect that after the 'raw_table' is updated, a new parquet file would be added to the delta folder with only the inserts and updates. This added parquet file is then detected by autoloader and could be used to apply the changes to the 'intermediate_table'.
Am I doing this the wrong way? Or am I overlooking something? Thanks in advance!
As readStream only works with appends, any change in the the source file will create issues downstream. The assumption that an update on "raw_table" will only insert a new parquet file is incorrect. Based on the settings like "optimized writes" or even without it, apply_changes can add or remove files. You can find this information in your "raw_table/_delta_log/xxx.json" under "numTargetFilesAdded" and "numTargetFilesRemoved".
Basically, "Databricks recommends you use Auto Loader to ingest only immutable files".
When you changed the settings to include the option '.option('readChangeFeed', 'true')', you should start with a full refresh(there is dropdown near start). Doing this will resolve the error 'Detected data update xxx', and your code should work for the incremental update.

Loading Parquet File into a Hive Table stored as Parquet fail(values are null)

I am simply trying to create a table in hive that is stored as a parquet file, and then transform the csv file that holds the data into a parquet file, and then load it into the hdfs directory to insert the values.below is my sequence that I am doing but to no avail:
First I created a table in Hive:
CREATE external table if not EXISTS db1.managed_table55 (dummy string)
stored as parquet
location '/hadoop/db1/managed_table55';
Then i loaded a parquet file into the above hdfs location using this spark:
df=spark.read.csv("/user/use_this.csv", header='true')
df.write.save('/hadoop/db1/managed_table55/test.parquet', format="parquet")
It loads but here is the output......all null values:
Here is the original values in the use_this.csv file that I transformed into a parquet file:
This is proof that the specified location created the table's folder(managed_table55) and the file(test.parquet):
Any ideas or suggestions as why this keeps happening? I know there's probably a minor tweak but I cant seem to identify it.
As you are writing parquet file to /hadoop/db1/managed_table55/test.parquet this location Try creating table in the same location and read the data from hive table.
Create Hive Table:
hive> CREATE external table if not EXISTS db1.managed_table55 (dummy string)
stored as parquet
location '/hadoop/db1/managed_table55/test.parquet';
Pyspark:
df=spark.read.csv("/user/use_this.csv", header='true')
df.write.save('/hadoop/db1/managed_table55/test.parquet', format="parquet")

Resources