How to reduce the temporary table expiration time in BigQuery - apache-spark

We run Spark jobs which access BigQuery. During read phase, data is being pulled from a temporary table with naming convention _sbc_*. By default, the table exipration is 24 hours. But for our usecase, retention period of 1 hour is more than enough. I was wondering is there any we can bring down the temporary table expiration duration from 24 hours to 1 hour.
Below is how we instantiate spark config,
val sparkConf = new SparkConf
sparkConf.setAppName("test-app")
sparkConf.setMaster("local[*]")
sparkConf.set("viewsEnabled", "true")
sparkConf.set("parentProject", "<parentProject>")
sparkConf.set("materializationProject", "<materializationProject>")
sparkConf.set("materializationDataset", "<materializationDataset>")
sparkConf.set("credentials", "<credentials>")
Note: Temporary table is getting created in project passed for materializationProject parameter.
Spark version : spark-2.3.1

spark-bigquery-connector doesn't provide any option to set expiration time over the materialized views it creates during reading.
However, if you're using a specific materializationDataset for these jobs, you can directly define the default expiration time for that dataset in BigQuery. It will be applied to all tables and views created under the dataset.
bq update --default_table_expiration 3600 materializationProject:materializationDataset

As of 2023-01-13, there now appears to be an option called materializationExpirationTimeInMinutes that defines the temporary table expiration time. If not set, it defaults to 24 hours.
See here.

Related

Airflow dags - reporting of the runtime for tracking purposes

I am trying to find a way to capture the dag stats - i.e run time (start time, end time), status, dag id, task id, etc for various dags and their task in a separate table
found the default logs which goes to elasticsearch/kibana, but not a simple way to pull the required logs from there back to the s3 table.
building a separate process to load those logs in s3 will have replicated data and also there will be too much data to scan and filter as tons of other system-related logs are generated as well.
adding a function to each dag - would have to modify each dag
what other possibilities are to get it don't efficiently, of any other airflow inbuilt feature can be used
You can try using Ad Hoc Query available in Apache airflow.
This option is available at Data Profiling -> Ad Hoc Query and select airflow_db
If you wish to get DAG statistics such as start_time,end_time etc you can simply query in the below format
select start_date,end_date from dag_run where dag_id = 'your_dag_name'
The above query returns start_time and end_time details of the DAG for all the DAG runs. If you wish to get details for a particular run then you can add another filter condition like below
select start_date,end_date from dag_run where dag_id = 'your_dag_name' and execution_date = '2021-01-01 09:12:59.0000' ##this is a sample time
You can get this execution_date from tree or graph views. Also you can get other stats like id,dag_id,execution_date,state,run_id,conf as well.
You can also refer to https://airflow.apache.org/docs/apache-airflow/1.10.1/profiling.html#:~:text=Part%20of%20being%20productive%20with,application%20letting%20you%20visualize%20data. link for more details.
You did not mention do you need this information real time or in batches.
Since you do not want to use ES logs either, you can try airflow metrics, if it suits your need.
However pulling this information from database is not efficient, in any case but it still is an option if you are not looking for real time data collection.

What to do to prevent Delta Lake checkpoints to be removed in Azure Databricks?

I noticed that I have only 2 checkpoints files in a delta lake folder. Every 10 commits, a new checkpoint is created and the oldest one is removed.
For instance this morning, I had 2 checkpoints: 340 and 350. I was available to time travel from 340 to 359.
Now, after a "write" action, I have 2 checkpoints: 350 and 360. I'm now able to time travel from 350 to 360.
What can remove the old checkpoints? How can I prevent that?
I'm using Azure Databricks 7.3 LTS ML.
Ability to perform time travel isn't directly related to the checkpoint. Checkpoint is just an optimization that allows to quickly access metadata as Parquet file without need to scan individual transaction log files. This blog post describes the details of the transaction log in more details
The commits history is retained by default for 30 days, and could be customized as described in documentation. Please note that vacuum may remove deleted files that are still referenced in the commit log, because data is retained only for 7 days by default. So it's better to check corresponding settings.
If you perform following test, then you can see that you have history for more than 10 versions:
df = spark.range(10)
for i in range(20):
df.write.mode("append").format("delta").save("/tmp/dtest")
# uncomment if you want to see content of log after each operation
#print(dbutils.fs.ls("/tmp/dtest/_delta_log/"))
then to check files in log - you should see both checkpoints and files for individual transactions:
%fs ls /tmp/dtest/_delta_log/
also check the history - you should have at least 20 versions:
%sql
describe history delta.`/tmp/dtest/`
and you should be able to go to the early version:
%sql
SELECT * FROM delta.`/tmp/dtest/` VERSION AS OF 1
If you want to keep your checkpoints X days, you can set delta.checkpointRetentionDuration to X days this way:
spark.sql(f"""
ALTER TABLE delta.`path`
SET TBLPROPERTIES (
delta.checkpointRetentionDuration = 'X days'
)
"""
)

Retaining Delta log transaction data of Delta Lake forever

I had a small confusion on transactional log of Delta lake. In the documentation it is mentioned that by default retention policy is 30 days and can be modified by property -: delta.logRetentionDuration=interval-string .
But I don't understand when the actual log files are deleted from the delta_log folder. Is it when we run some operation? Or may be VACCUM operation. However, it is mentioned that VACCUM operation only deletes data files and not logs. But will it delete logs older than specified log retention duration?
reference -: https://docs.databricks.com/delta/delta-batch.html#data-retention
delta-io/delta PROTOCOL.md:
By default, the reference implementation creates a checkpoint every 10 commits.
There is an async process that runs for every 10th commit to the _delta_log folder. It will create a checkpoint file and will clean up the .crc and .json files that are older than the delta.logRetentionDuration.
Checkpoints.scala has checkpoint > checkpointAndCleanupDeltaLog > doLogCleanup. MeetadataCleanup.scala has doLogCleanup > cleanUpExpiredLogs.
The value of the option is an interval literal. There is no way to specify literal infinite and months and years are not allowed for this particular option (for a reason). However nothing stops you from saying interval 1000000000 weeks - 19 million years is effectively infinite.

Append only new aggregates based on groupby keys

I have to process some files which arrive to me daily. The information have primary key (date,client_id,operation_id). So I created a Stream which append only new data into a delta table:
operations\
.repartition('date')\
.writeStream\
.outputMode('append')\
.trigger(once=True)\
.option("checkpointLocation", "/mnt/sandbox/operations/_chk")\
.format('delta')\
.partitionBy('date')\
.start('/mnt/sandbox/operations')
This is working fine, but i need to summarize this information grouped by (date,client_id), so i created another streaming from this operations table to a new table:
summarized= spark.readStream.format('delta').load('/mnt/sandbox/operations')
summarized= summarized.groupBy('client_id','date').agg(<a lot of aggs>)
summarized.repartition('date')\
.writeStream\
.outputMode('complete')\
.trigger(once=True)\
.option("checkpointLocation", "/mnt/sandbox/summarized/_chk")\
.format('delta')\
.partitionBy('date')\
.start('/mnt/sandbox/summarized')
This is working, but every time I got new data into operations table, spark recalculates summarized all over again. I tried to use the append mode on the second streaming, but it need watermarks, and the date is DateType.
There is a way to only calculate new aggregates based on the group keys and append them on the summarized?
You need to use Spark Structured Streaming - Window Operations
When you use windowed operations, it will do the bucketing according to windowDuration and slideDuration. windowDuration tells you what is the length of the window, and slideDuration tells by how much time should you slide the window.
If you groupby using window() [docs], you will get a resultant window column along with other columns you groupby with like client_id
For example:
windowDuration = "10 minutes"
slideDuration = "5 minutes"
summarized = before_summary.groupBy(before_summary.client_id,
window(before_summary.date, windowDuration, slideDuration)
).agg(<a lot of aggs>).orderBy('window')

How to count the number of messages fetched from a Kafka topic in a day?

I am fetching data from Kafka topics and storing them in Deltalake(parquet) format. I wish to find the number of messages fetched in particular day.
My thought process: I thought to read the directory where the data is stored in parquet format using spark and apply count on the files with ".parquet" for a particular day. This returns a count but I am not really sure if that's the correct way.
Is this way correct ? Are there any other ways to count the number of messages fetched from a Kafka topic for a particular day(or duration) ?
Message we consume from topic not only have key-value but also have other information like timestamp
Which can be used to track the consumer flow.
Timestamp
Timestamp get updated by either Broker or Producer based on Topic configuration. If Topic configured time stamp type is CREATE_TIME, the timestamp in the producer record will be used by the broker whereas if Topic configured to LOG_APPEND_TIME , timestamp will be overwritten by the broker with the broker local time while appending the record.
So if you are storing any where if you keep timestamp you can very well track per day, or per hour message rate.
Other way you can use some Kafka dashboard like Confluent Control Center (License price) or Grafana (free) or any other tool to track the message flow.
In our case while consuming message and storing or processing along with that we also route meta details of message to Elastic Search and we can visualize it through Kibana.
You can make use of the "time travel" capabilities that Delta Lake offers.
In your case you can do
// define location of delta table
val deltaPath = "file:///tmp/delta/table"
// travel back in time to the start and end of the day using the option 'timestampAsOf'
val countStart = spark.read.format("delta").option("timestampAsOf", "2021-04-19 00:00:00").load(deltaPath).count()
val countEnd = spark.read.format("delta").option("timestampAsOf", "2021-04-19 23:59:59").load(deltaPath).count()
// print out the number of messages stored in Delta Table within one day
println(countEnd - countStart)
See documentation on Query an older snapshot of a table (time travel).
Another way to retrieve this information without counting the rows between two versions is to use Delta table history. There are several advantages of that - you don't read the whole dataset, you can take into account updates & deletes as well, for example if you're doing MERGE operation (it's not possible to do with comparing .count on different versions, because update is replacing the actual value, or delete the row).
For example, for just appends, following code will count all inserted rows written by normal append operations (for other things, like, MERGE/UPDATE/DELETE we may need to look into other metrics):
from delta.tables import *
df = DeltaTable.forName(spark, "ml_versioning.airbnb").history()\
.filter("timestamp > 'begin_of_day' and timestamp < 'end_of_day'")\
.selectExpr("cast(nvl(element_at(operationMetrics, 'numOutputRows'), '0') as long) as rows")\
.groupBy().sum()

Resources