Apache Iceberg Scheme Evolution using Spark

Apache Iceberg Scheme Evolution using Spark - apache-spark

Currently I am using Iceberg in my project, so I am having one doubt in that.
My Current Scenario:
I have loaded the data into my Iceberg table using spark data frame(this is my doing through spark job)
df.writeTo("catalog.mydb.test2").using("iceberg").create()
Now From source side I have added two colums and started the Job which is doing merge
df.createOrReplaceTempView("myview")
spark.sql("MERGE INTO catalog.mydb.test2 as t USING (SELECT * FROM myview) as s ON t.id = s.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT ")
Doing both of these step I am expecting new columns to be added into the target table but it did not worked,
As I can see Iceberg Support full schema evolution.. What does it means..if it is not adding any columns dynamically to my target table.
Please help how can I achieve adding new columns into my target table dynmically.

You can enable this with merge-schema option, but we don't recommend it because, as #shay__ points out, it can sometimes cause unmanageable catastrophes.

Related

Databricks Delta Live Tables - Apply Changes from delta table

I am working with Databricks Delta Live Tables, but have some problems with upserting some tables upstream. I know it is quite a long text below, but I tried to describe my problem as clear as possible. Let me know if some parts are not clear.
I have the following tables and flow:
Landing_zone -> This is a folder in which JSON files are added that contain data of inserted or updated records.
Raw_table -> This is the data in the JSON files but in table format. This table is in delta format. No transformations are done, except from transforming the JSON structure into a tabular structure (I did an explode and then creating columns from the JSON keys).
Intermediate_table -> This is the raw_table, but with some extra columns (depending on other column values).
To go from my landing zone to the raw table I have the following Pyspark code:
cloudfile = {"cloudFiles.format":"JSON",
"cloudFiles.schemaLocation": sourceschemalocation,
"cloudFiles.inferColumnTypes": True}
#dlt.view('landing_view')
def inc_view():
df = (spark
.readStream
.format('cloudFiles')
.options(**cloudFilesOptions)
.load(filpath_to_landing)
<Some transformations to go from JSON to tabular (explode, ...)>
return df
dlt.create_target_table('raw_table',
table_properties = {'delta.enableChangeDataFeed': 'true'})
dlt.apply_changes(target='raw_table',
source='landing_view',
keys=['id'],
sequence_by='updated_at')
This code works as expected. I run it, add a changes.JSON file to the landing zone, rerun the pipeline and the upserts are correctly applied to the 'raw_table'
(However, each time a new parquet file with all the data is created in the delta folder, I would expect that only a parquet file with the inserted and updated rows was added? And that some information about the current version was kept in the delta logs? Not sure if this is relevant for my problem. I already changed the table_properties of the 'raw_table' to enableChangeDataFeed = true. The readStream for 'intermediate_table' then has option(readChangeFeed, 'true')).
Then I have the following code to go from my 'raw_table' to my 'intermediate_table':
#dlt.table(name='V_raw_table', table_properties={delta.enableChangeDataFeed': 'True'})
def raw_table():
df = (spark.readStream
.format('delta')
.option('readChangeFeed', 'true')
.table('LIVE.raw_table'))
df = df.withColumn('ExtraCol', <Transformation>)
return df
ezeg
dlt.create_target_table('intermediate_table')
dlt.apply_changes(target='intermediate_table',
source='V_raw_table',
keys=['id'],
sequence_by='updated_at')
Unfortunately, when I run this, I get the error:
'Detected a data update (for example part-00000-7127bd29-6820-406c-a5a1-e76fc7126150-c000.snappy.parquet) in the source table at version 2. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory.'
I checked in the 'ignoreChanges', but don't think this is what I want. I would expect that the autoloader would be able to detect the changes in the delta table and pass them through the flow.
I am aware that readStream only works with append, but that is why I would expect that after the 'raw_table' is updated, a new parquet file would be added to the delta folder with only the inserts and updates. This added parquet file is then detected by autoloader and could be used to apply the changes to the 'intermediate_table'.
Am I doing this the wrong way? Or am I overlooking something? Thanks in advance!

As readStream only works with appends, any change in the the source file will create issues downstream. The assumption that an update on "raw_table" will only insert a new parquet file is incorrect. Based on the settings like "optimized writes" or even without it, apply_changes can add or remove files. You can find this information in your "raw_table/_delta_log/xxx.json" under "numTargetFilesAdded" and "numTargetFilesRemoved".
Basically, "Databricks recommends you use Auto Loader to ingest only immutable files".

When you changed the settings to include the option '.option('readChangeFeed', 'true')', you should start with a full refresh(there is dropdown near start). Doing this will resolve the error 'Detected data update xxx', and your code should work for the incremental update.

Write spark Dataframe to an exisitng Delta Table by providing TABLE NAME instead of TABLE PATH

I am trying to write spark dataframe into an existing delta table.
I do have multiple scenarios where I could save data into different tables as shown below.
SCENARIO-01:
I have an existing delta table and I have to write dataframe into that table with option mergeSchema since the schema may change for each load.
I am doing the same with below command by providing delta table path
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").save(finalDF01DestFolderPath)
Just want to know whether this can be done by providing exisiting delta TABLE NAME instead of delta PATH.
This has been resolved by updating data write command as below.
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").saveAsTable(finalDF01DestTableName)
Is this the correct way ?
SCENARIO 02:
I have to update the existing table if the record already exists and if not insert a new record.
For this I am currently doing as shown below.
spark.sql("SET spark.databricks.delta.schema.autoMerge.enabled = true")
DeltaTable.forPath(DestFolderPath)
.as("t")
.merge(
finalDataFrame.as("s"),
"t.id = s.id AND t.name= s.name")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
I tried with below script.
destMasterTable.as("t")
.merge(
vehMasterDf.as("s"),
"t.id = s.id")
.whenNotMatched().insertAll()
.execute()
but getting below error(even with alias instead of as).
error: value as is not a member of String
destMasterTable.as("t")
Here also I am using delta table path as destination, Is there any way so that we could provide delta TABLE NAME instead of TABLE PATH?
It will be good to provide TABLE NAME instead of TABLE PATH, In case if we chage the table path later will not affect the code.
I have not seen anywhere in databricks documentation providing table name along with mergeSchema and autoMerge.
Is it possible to do so?

To use existing data as a table instead of path you either were need to use saveAsTable from the beginning, or just register existing data in the Hive metastore using the SQL command CREATE TABLE USING, like this (syntax could be slightly different depending on if you're running on Databricks, or OSS Spark, and depending on the version of Spark):
CREATE TABLE IF NOT EXISTS my_table
USING delta
LOCATION 'path_to_existing_data'
after that, you can use saveAsTable.
For the second question - it looks like destMasterTable is just a String. To refer to existing table, you need to use function forName from the DeltaTable object (doc):
DeltaTable.forName(destMasterTable)
.as("t")
...

Spark Datasource Hudi table read using instant time

I'm reading Hudi table using Spark.read.format("hudi")
want to understand how is this option works hoodie.datasource.read.begin.instanttime
Will it similar to hudi's hoodie_commit_ts column available in parquets files?
I'm not able to get same count between an external table on top of the same hudi path using hoodie_commit_ts column and below approach.
Sample code is here
beginTime = '20201201194517'
incremental_read_options = {'hoodie.datasource.query.type': 'incremental', 'hoodie.datasource.read.begin.instanttime': beginTime}
Incremental_DF = spark.read.format("org.apache.hudi").
options(**incremental_read_options).
load()

Incremental queries and hoodie.datasource.read.begin.instanttime are based on _hoodie_commit_time data from the metadata table.
What you are accomplishing with this is an incremental read starting from the beginTime to the most recent data upsert. If you pass the exact time of a commit as beginTime, your query won't contain that commit. You would have to pass that (beginTime - 1) for that.
Also, you can use point in time queries, by passing the option hoodie.datasource.read.end.instanttime to restrain your query to a time point between beginTime and endTime (also exclusive).

Spark jdbc overwrite mode not working as expected

I would like to perform update and insert operation using spark
please find the image reference of existing table
Here i am updating id :101 location and inserttime and inserting 2 more records:
and writing to the target with mode overwrite
df.write.format("jdbc")
.option("url", "jdbc:mysql://localhost/test")
.option("driver","com.mysql.jdbc.Driver")
.option("dbtable","temptgtUpdate")
.option("user", "root")
.option("password", "root")
.option("truncate","true")
.mode("overwrite")
.save()
After executing the above command my data is corrupted which is inserted into db table
Data in the dataframe
Could you please let me know your observations and solutions

Spark JDBC writer supports following modes:
append: Append contents of this :class:DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error (default case): Throw an exception if data already exists
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Since you are using "overwrite" mode it recreate your table as per then column length, if you want your own table definition create table first and use "append" mode

i would like to perform update and insert operation using spark
There is no equivalent in to SQL UPDATE statement with Spark SQL. Nor is there an equivalent of the SQL DELETE WHERE statement with Spark SQL. Instead, you will have to delete the rows requiring update outside of Spark, then write the Spark dataframe containing the new and updated records to the table using append mode (in order to preserve the remaining existing rows in the table).

In case where you need to perform UPSERT / DELETE operations in your pyspark code, i suggest you to use pymysql libary, and execute your upsert/delete operations. Please check this post for more info, and code sample for reference : Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array
Please modify the code sample as per your needs.

I wouldn't recommend TRUNCATE, since it would actually drop the table, and create new table. While doing this, the table may lose column level attributes that were set earlier...so be careful while using TRUNCATE, and be sure, if it's ok for dropping the table/recreate the table.

Upsert logic is working fine when following below steps
df = (spark.read.format("csv").
load("file:///C:/Users/test/Desktop/temp1/temp1.csv", header=True,
delimiter=','))
and doing this
(df.write.format("jdbc").
option("url", "jdbc:mysql://localhost/test").
option("driver", "com.mysql.jdbc.Driver").
option("dbtable", "temptgtUpdate").
option("user", "root").
option("password", "root").
option("truncate", "true").
mode("overwrite").save())
Still, I am unable to understand the logic why its failing when i am writing using the data frame directly

Spark returns Empty DataFrame but Populated in Hive

I have a table in hive
db.table_name
When I run the following in hive I get results back
SELECT * FROM db.table_name;
When I run the following in a spark-shell
spark.read.table("db.table_name").show
It shows nothing. Similarly
sql("SELECT * FROM db.table_name").show
Also shows nothing. Selecting arbitrary columns out before the show also displays nothing. Performing a count states the table has 0 rows.
Running the same queries works against other tables in the same database.
Spark Version: 2.2.0.cloudera1
The table is created using
table.write.mode(SaveMode.Overwrite).saveAsTable("db.table_name")
And if I read the file using the parquet files directly it works.
spark.read.parquet(<path-to-files>).show
EDIT:
I'm currently using a workaround by describing the table and getting the location and using spark.read.parquet.

Have you refresh metadata table? Maybe you need to refresh table to access to new data.
spark.catalog.refreshTable("my_table")

I solved the problem by using
query_result.write.mode(SaveMode.Overwrite).format("hive").saveAsTable("table")
which stores the results in textfile.
There is probably some incompatibility with the Hive parquet.
I also found a Cloudera report about it (CDH Release Notes): they recommend creating the Hive table manually and then load data from a temporary table or by query.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Apache Iceberg Scheme Evolution using Spark - apache-spark

You can enable this with merge-schema option, but we don't recommend it because, as #shay__ points out, it can sometimes cause unmanageable catastrophes.

Related

Databricks Delta Live Tables - Apply Changes from delta table

Write spark Dataframe to an exisitng Delta Table by providing TABLE NAME instead of TABLE PATH

Spark Datasource Hudi table read using instant time

Spark jdbc overwrite mode not working as expected

Spark returns Empty DataFrame but Populated in Hive

Categories

Resources