I have a spark streaming query running in databricks. While loading data from a kafka topic to delta lake, the cell output while running displays "Compute snapshot for version : 3001". I saw this message many times before but it was the first time I'm seeing an abnormally huge number.
What exactly does this message mean ? How should one intrepret what's happening under the hood?
Also, does having a high number have any impact on performance of the task ?
From the question I've inferred that you are saving the data into a Delta Lake format, which by design have a concept of Time Travel, which in principle allows you to track the changes in the underlying data by saving so-called snapshots of the table:
more about Snaphots - https://books.japila.pl/delta-lake-internals/Snapshot/
more about Time Travel in Delta Lake - https://docs.delta.io/latest/delta-batch.html#query-an-older-snapshot-of-a-table-time-travel
Related
We have a use-case where we receive large volume of data (i.e., 80 GB divided into 300 files comes every 5 mins) in ADLS-V2 and using spark-connector to write from ADLS-V2 to Kusto table.
During the write stage, noticed multiple cores are used to batch the entire data and only one core is used to write to Kusto table, i.e., 80Gb is writing with only one core and remaining cores are in idle state.
This process takes good amount of 20-25 mins and we have tight SLA of 10 mins.
Azure databricks(28GB RAM and 8 CPU cores each- 5 nodes)
Each file size is of ~260MB uncompressed and in parquet format. I also seen some best practices document where it says file size should be between 100MB to 1 GB uncompressed.
Using writestream API in databricks to write the data.
What is the ideal approach to write the data from ADLS to ADX in distributed way using spark-connector ?
First - the most efficient flow from ADLS storage to ADX is EventGrid, as the writing through the Spark connector means data is translated to Spark internal and then to csv which is sent to ADX. From the conversation with you guys it was clear you are using Spark for transforming the data before ingestion, in that case the Spark connector is a good choice.
From version 3.1.0 the connector flow got split by default into three Jobs (unless writeMode.Queued - is used), the first translates data into csv, writes it to storage, and queue and ingestion for ADX, this is done in distributed fashion. The second stage is polling on these ingestions until all finishes successfully to ensure transactionality, this is done using one core as the operation is really cheap (call to table storage) and there's no need to hold more than one worker for that. Third stage is sealing the transaction (this is metadata operation in ADX) - and therefore also needs one core.
Hello spark community,
Currently we have a batch pipeline in azure databricks that reads from a delta table. Job is run every night once per day. We extract its data, save it in our own location as delta table again and then we write to azure sql database table. Everything is partioned on date like this: Date=2021-01-01, etc.
Things are about to change now since our source delta table is about to get refreshed every 2-3 minutes and the requirement is to change the ETL from nightly batch to streaming mode but still using the same source and target tables as well as the same sql database table.
Right now this streaming challenge is imposing several questions:
Our delta source table is quite huge (30B+ rows), so far in our nightly batch we extracted only the changed date keys and used MERGE INTO to write the updates/inserts, however once we switch to streaming mode is it possible to tell spark to start streaming from a specific point in time since we do not want to load the whole table again this time with streaming mode?
How do you write a stream to a target delta table with MERGE INTO having to check huge amount of keys on both sides. I suppose we can get use of a foreachBatch and do that in a micro batch manner, but still how each new micro batch to check for existing/non-existing keys in our target delta table in a time-efficient manner (30B+ rows in the current target table)?
Not sure about that but is it possible for a streaming spark job to write directly to a sql database (azure) and will this become a bottleneck situation hence not advisable to be done?
I am really looking forward for some good advice and design decisions since this would be quite a big issue with the current data size. Appreciate every opinion here!
What is the difference between append and overwrite to parquet in spark.
I'm processing huge amount of data for say 10 days. At present I'm processing daily logs into parquet files using "append" method and partitioning the data based on date. But the problem I'm facing is daily data is also very huge and taking a lot of time, contributing to high CPU usage as well while processing data using EMR cluster. This is making my job very slow and expensive. So I'm looking for a way where I can further split the data and can merge the data to day cluster.
Please see spark SaveMode docs
https://spark.apache.org/docs/latest/api/java/index.html
I am interested in using Spark Structured Streaming for real-time data processing using data from for example last ~24hours of running, however I am not able to find correct solution for this problem.
Some useful information about entire situation:
Data is constantly flowing all the time as an input for Spark so stream is active 24/7
Spark does some actions and then writes some data to files(eg. parquet)
Watermarking is used to reduce state size
Someone wants to work on only most recent data returned from Spark Structured Streaming(for example all data from last 24 hours) to have a quick view on what happened in last time and for further very specific analysis.
From what I understand, watermarking helps managing state size so Spark does not hold entire data about the state. This is a good thing and solves one problem with 24/7 running.
The other problem is output data. Currently Spark appends the data and nothing else, this makes it grow bigger and bigger. Using memory sink for testing it creates memory problems. I didn't try it with file sink because it creates one file for each record(ugh) so there's a risk of using all available inodes in system extremely quickly. I can create one file per window with file sink.
So my question is:
Is it possible to force Spark Structured Streaming to delete output data after some amount of time when it is no longer needed? I want to keep output data only from for example last 24 hours. Is there any build-in solution or do I need to do it on my own? If I needed to do it on my own, wouldn't checkpoint data and spark metadata get corrupted?
Using watermarking will allow us to keep only the last 24 hours
df.withWatermark("timestamp", "24 hours") //when timestamp is your event time field
From Spark Spark doc
in Spark 2.1, we have introduced watermarking, which lets the engine
automatically track the current event time in the data and attempt to
clean up old state accordingly.
I found the distinction between different notions of time in the Apache Flink documentation in Event Time / Processing Time / Ingestion Time.
Event time is the time that each individual event occurred on its producing device.
And that's what datasets come with and so is available in Spark Structured Streaming out of the box.
Processing time refers to the system time of the machine that is executing the respective operation.
Ingestion time is the time that events enter Flink.
The two processing time and ingestion time are of my concern. I think I know how to achieve processing time, but am not sure about ingestion time (or perhaps I'm wrong and it's the opposite).
How to achieve ingestion time in Spark Structured Streaming 2.2 and later?