How to checkpoint a delta table manually? - delta-lake

Delta Lake creates a checkpoint automatically per 10 version. Is there any way to create checkpoint manually?

import org.apache.spark.sql.delta.DeltaLog
DeltaLog.forTable(spark,dataPath).checkpoint()

Related

How to load the latest version of delta parquet using spark?

I have access to a repository where a team writes parquet file (without partitioning them), using delta (i.e there is a delta log in this repo). I have no access to the table itself though. To create a dataframe from those parquet, I am using the below code:
spark.read.format('delta').load(repo)
Executing this loads the entire dataframe, regardless of the delta log. How should I proceed to load the latest version of my data?

VACUUM/OPTIMIZE Effect on Autoloader Checkpoints

I'm using Databricks Autoloader to incrementally stream from a Delta Lake table into a SQL database. If an OPTIMIZE or VACUUM statement is ran against the Delta table, new files are added/subtracted.
My question is, will the autoloader checkpoint discount these optimized files on the next stream? Or will my entire Delta table be streamed into SQL because autoloader doesn't recognize it's already processed the data?
As long as you specify the format of the readStream correctly, the autoloader checkpoint will disregard all aggregated files created by OPTIMIZE command. In this case, the code should be started as follows. df.readStream.format('delta')

what is spark.databricks.delta.snapshotPartitions configuration used for in delta lake?

I was going through delta lake and came across a configuration spark.databricks.delta.snapshotPartitions however not quite sure what this is used for? Can't find this in delta lake documentation as well.
In delta lake github found below code, but not sure how this property works
val DELTA_SNAPSHOT_PARTITIONS =
buildConf("snapshotPartitions")
.internal()
.doc("Number of partitions to use when building a Delta Lake snapshot.")
.intConf
.checkValue(n => n > 0, "Delta snapshot partition number must be positive.")
.createOptional
Delta Lake uses Spark to process the transaction logs in the _delta_log directory. When Delta Lake loads the transaction logs, it will replay logs to generate the current state of the table which is called Snapshot. There is a repartition operation in this step. You can use spark.databricks.delta.snapshotPartitions to config how many partitions to use in the repartition operation. When your table metadata grows, you may need to increase this config so that each partition of the table metadata can be fit into the executor memory.

How to list all delta tables in Databricks Azure?

I have saved one dataframe in my delta lake, below is the command:
df2.write.format("delta").mode("overwrite").partitionBy("updated_date").save("/delta/userdata/")
Also I can load and see the delta lake /userdata:
dfres=spark.read.format("delta").load("/delta/userdata")
but here , I have one doubt like when I am moving several parquet files from blob to delta lake creating dataframe, then how some one else would know which file I have moved and how he can work on those delta, is there any command to list all the dataframes in delta lake in databricks?
Break down the problem into:
Find the paths of all tables you want to check. Managed tables in the default location are stored at spark.conf.get("spark.sql.warehouse.dir") + s"/$tableName". If you have external tables, it is better to use catalog.listTables() followed by catalog.getTableMetadata(ident).location.getPath. Any other paths can be used directly.
Determine which paths belong to Delta tables using DeltaTable.isDeltaTable(path).
Hope this helps.

Can Hive Read data from Delta lake file format?

I started going through DELTA LAKE file format, is hive capable of reading data from this newly introduced delta file format? If so could you please let me know the serde you were using.
Hive support is available with Delta Lake file format. First, step is to add the jars from https://github.com/delta-io/connectors, in our hive path. And then create a table using following format.
CREATE EXTERNAL TABLE test.dl_attempts_stream
(
...
)
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION
Delta Format picks up partition by default, so no need to mention partition while creating a table.
NOTE: If data is being inserted via a Spark job, please provide hive-site.xml, and enableHiveSupport in Spark Job, to create Delta Lake table in Hive.

Resources