Sub directories under checkpoint directory for spark structured streaming - apache-spark

The checkpoint directory for spark structured streaming create four sub directories. What are each of them for?
/warehouse/test_topic/checkpointdir1/commits
/warehouse/test_topic/checkpointdir1/metadata
/warehouse/test_topic/checkpointdir1/offsets
/warehouse/test_topic/checkpointdir1/sources

From the StreamExecution class doc:
/**
* A write-ahead-log that records the offsets that are present in each batch. In order to ensure
* that a given batch will always consist of the same data, we write to this log *before* any
* processing is done. Thus, the Nth record in this log indicated data that is currently being
* processed and the N-1th entry indicates which offsets have been durably committed to the sink.
*/
val offsetLog = new OffsetSeqLog(sparkSession, checkpointFile("offsets"))
/**
* A log that records the batch ids that have completed. This is used to check if a batch was
* fully processed, and its output was committed to the sink, hence no need to process it again.
* This is used (for instance) during restart, to help identify which batch to run next.
*/
val commitLog = new CommitLog(sparkSession, checkpointFile("commits"))
Metadata log is for information related to the query. e.g in KafkaSource it is used to write the starting offsets of the query(offset for each partition)

Source folder contains the Initial kafka offset values of each partition.
like if your kafka has 3 partitions 1,2,3 and starting values for each partition is 0 then it will contain value like {1:0,2:0,3:0}

Related

How to parallelly merge data into partitions of databricks delta table using PySpark/Spark streaming?

I have a PySpark streaming pipeline which reads data from a Kafka topic, data undergoes thru various transformations and finally gets merged into a databricks delta table.
In the beginning we were loading data into the delta table by using the merge function as given below.
This incoming dataframe inc_df had data for all partitions.
merge into main_db.main_delta_table main_dt USING inc_df df ON
main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND
main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND
main_.rule_read_start=df.rule_read_start AND
main_.company = df.company
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
We were executing the above query on table level.
I have given a very basic diagram of the process in the image below.
But my delta table is partitioned on continent and year.
For example, this is how my partitioned delta table looks like.
So I tried implementing the merge on partition level and tried to run merge activity on multiple partitions parallelly.
i.e. I have created seperate pipelines with the filters in queries on partition levels. Image can be seen below.
merge into main_db.main_delta_table main_dt USING inc_df df ON
main_dt.continent in ('AFRICA') AND main_dt.year in (‘202301’) AND
main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND
main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND
main_.rule_read_start=df.rule_read_start AND
main_.company = df.company
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
But I am seeing an error with concurrency.
- com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were added to partition [continent=AFRICA, year=2021] by a concurrent update. Please try the operation again.
I understand that the error is telling me that it cannot update files concurrently.
But I have huge volume of data in production and I don't want to perform merge on table level where there are almost 1billion records without proper filters.
Trial2:
As an alternate approach,
I saved my incremental dataframe in an S3 bucket (like a staging dir) and end my streaming pipeline there.
Then I have a seperate PySpark job that reads data from that S3 staging dir and performs merge into my main delta table, once again on partition level (I have specified partitions in those jobs as filters)
But I am facing the same exception/error there as well.
Could anyone let me know how can I design and optimise my streaming pipeline to merge data into delta table on partition level by having multiple jobs parallelly (jobs running on indivdual partitions)
Trial3:
I also made another attempt in a different approach as mentioned in the link and ConcurrentAppendException section from that page.
base_delta = DeltaTable.forPath(spark,'s3://PATH_OF_BASE_DELTA_TABLE')
base_delta.alias("main_dt").merge(
source=final_incremental_df.alias("df"),
condition="main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND main_.rule_read_start=df.rule_read_start AND main_.company = df.company, continent='Africa'")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()
and
base_delta = DeltaTable.forPath(spark,'s3://PATH_OF_BASE_DELTA_TABLE')
base_delta.alias("main_dt").merge(
source=final_incremental_df.alias("df"),
condition="main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND main_.rule_read_start=df.rule_read_start AND main_.company = df.company, continent='ASIA'")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()
I ran the above merge operations in two separate pipelines.
But I am still facing the same issue.
In your trial 3, you need to change the merge condition.
Instead of
condition="main_dt.continent=df.continent AND [...]"
it should be
condition="main_dt.continent='Africa' AND [...]"
You should also delete the continent='Africa' from the end of the condition.
Here is the documentation for reference.

How spark (2.3 or new version) determine the number of tasks to read hive table files in gs bucket or hdfs?

Input Data:
a hive table (T) with 35 files (~1.5GB each, SequenceFile)
files are in a gs bucket
default fs.gs.block.size=~128MB
all other parameters are default
Experiment 1:
create a dataproc with 2 workers (4 core per worker)
run select count(*) from T;
Experiment 1 Result:
~650 tasks created to read the hive table files
each task read ~85MB data
Experiment 2:
create a dataproc with 64 workers (4 core per worker)
run select count(*) from T;
Experiment 2 Result:
~24,480 tasks created to read the hive table files
each task read ~2.5MB data
(seems to me 1 task read 2.5MB data is not a good idea as time to open the file would probably be longer than reading 2.5MB.)
Q1: Any idea how spark determines the number of tasks to read hive table data files?
I repeated the same experiments by putting the same data in hdfs and I got similar results.
My understanding is that the number of tasks to read hive table files should be the same as the number of blocks in hdfs. Q2: Is that correct? Q3: Is that also correct when data is in gs bucket (instead of hdfs)?
Thanks in advance!
The number of tasks in one stage is equal to the number of partitions of the input data, which is in turn determined by the data size and the related configs (dfs.blocksize (HDFS), fs.gs.block.size (GCS), mapreduce.input.fileinputformat.split.minsize, mapreduce.input.fileinputformat.split.maxsize). For a complex query which involves multiple stages, it is the sum of the number of tasks of all stages.
There is no difference between HDFS and GCS, except they use different configs for block size, dfs.blocksize vs fs.gs.block.size.
See the following related questions:
How are stages split into tasks in Spark?
How does Spark SQL decide the number of partitions it will use when loading data from a Hive table?

Location of WAL in Spark Structured Streaming

I have enabled WAL for my Structured Streaming Application. Where do I find the location of WAL logs?
I am able to see WAL for my Spark streaming process in the prefix receivedBlockMetadata . But, I don't see any prefix created for Structured Streaming
According to my understanding, WAL only works in spark streaming, not structred streaming.
Structured streaming implements fault tolerance based on checkpoint like flink global state. The checkpoint stores all the state including kafka offsets and others.The location is specified in your code .
In Spark Structure Streaming, now WAL with every message from the receiver.
Only two logs with metadata for every batch: offset and commit logs.
You can find details of implementation in org.apache.spark.sql.execution.streaming.StreamExecution. ->
/**
* A write-ahead-log that records the offsets that are present in each batch. In order to ensure
* that a given batch will always consist of the same data, we write to this log *before* any
* processing is done. Thus, the Nth record in this log indicated data that is currently being
* processed and the N-1th entry indicates which offsets have been durably committed to the sink.
*/
val offsetLog = new OffsetSeqLog(sparkSession, checkpointFile("offsets"))
/**
* A log that records the batch ids that have completed. This is used to check if a batch was
* fully processed, and its output was committed to the sink, hence no need to process it again.
* This is used (for instance) during restart, to help identify which batch to run next.
*/
val commitLog = new CommitLog(sparkSession, checkpointFile("commits"))
Both of them available in a checkpointLocation in folder offsets and commits.
In Structure Streaming logs contain only offset information.

Spark structured streaming from Kafka checkpoint and acknowledgement

In my spark structured streaming application, I am reading messages from Kafka, filtering them and then finally persisting to Cassandra. I am using spark 2.4.1. From the structured streaming documentation
Fault Tolerance Semantics
Delivering end-to-end exactly-once semantics was one of key goals behind the design of Structured Streaming. To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers) to track the read position in the stream. The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure.
But I am not sure how does Spark actually achieve this. In my case, if the Cassandra cluster is down leading to failures in the write operation, will the checkpoint for Kafka not record those offsets.
Is the Kafka checkpoint offset based only on successful reads from Kafka, or the entire operation including write is considered for each message?
Spark Structured Streaming is not commiting offsets to kafka as a "normal" kafka consumer would do.
Spark is managing the offsets internally with a checkpointing mechanism.
Have a look at the first response of following question which gives a good explanation about how the state is managed with checkpoints and commitslog: How to get Kafka offsets for structured query for manual and reliable offset management?
Spark uses multiple log files to ensure fault tolerance.
The ones relevant to your query are the offset log and the commit log.
from the StreamExecution class doc:
/**
* A write-ahead-log that records the offsets that are present in each batch. In order to ensure
* that a given batch will always consist of the same data, we write to this log *before* any
* processing is done. Thus, the Nth record in this log indicated data that is currently being
* processed and the N-1th entry indicates which offsets have been durably committed to the sink.
*/
val offsetLog = new OffsetSeqLog(sparkSession, checkpointFile("offsets"))
/**
* A log that records the batch ids that have completed. This is used to check if a batch was
* fully processed, and its output was committed to the sink, hence no need to process it again.
* This is used (for instance) during restart, to help identify which batch to run next.
*/
val commitLog = new CommitLog(sparkSession, checkpointFile("commits"))
so when it reads from Kafka it writes the offsets to the offsetLog and only after processing the data and writing it to the sink (in your case Cassandra) it writes the offsets to the commitLog.

Spark structured streaming consistency across sinks

I'd like to understand better the consistency model of Spark 2.2 structured streaming in the following case :
one source (Kinesis)
2 queries from this source towards 2 different sinks : one file sink for archive purpose (S3), and another sink for processed data (DB or file, not yet decided)
I'd like to understand if there's any consistency guarantee across sinks, at least under certain circumstances :
Can one of the sink be way ahead of the other ? Or are they consuming data at the same speed on the source (since its the same source) ? Can they be synchronous ?
If I (gracefully) stop the stream application, will the data on the 2 sinks consistent ?
The reason is I'd like to build a Kappa-like processing app, with the ability to suspend/shutdown the streaming part when I want to reprocess some history, and, when I resume the streaming, avoid reprocessing something that has already been processed (as being in the history), or missing some (eg. some data that has not been committed to the archive, and then skipped as already processed when the streaming resume)
One important thing to keep in mind is the 2 sinks will be used from 2 distinct queries, each reading independently from the source. So checkpointing is done per-query.
Whenever you call start on a DataStreamWriter that results in a new query and if you set checkpointLocation each query will have its own checkpointing to track offsets from the sink.
val input = spark.readStream....
val query1 = input.select('colA, 'colB)
.writeStream
.format("parquet")
.option("checkpointLocation", "path/to/checkpoint/dir1")
.start("/path1")
val query2 = input.select('colA, 'colB)
.writeStream
.format("csv")
.option("checkpointLocation", "path/to/checkpoint/dir2")
.start("/path2")
So each query is reading from the source and tracking offsets independently. Which then also means, each query can be at different offsets of the input stream and you can restart either or both without impacting the other.
UPDATE
I wanted to make another suggestion now that Databricks Delta is open sourced. A common pattern I've used is landing data from upstream sources directly into an append-only Delta table. Then, with Structured Streaming, you can efficiently subscribe to the table and process the new records incrementally. Delta's internal transaction log is more efficient than S3 file listings required with the basic file source. This ensures you have a consistent source of data across multiple queries, pulling from S3 vs Kinesis.
What Silvio has written is absolutely correct.
Writing to 2 sinks will start two streaming queries running independently of each other ( effectively 2 streaming applications reading same data 2 times and processing 2 times and checkpointing on their own).
I want to just add that if you want both the queries to stop/pause at the same time in case of restart or failure of any one of the query, there is option of using an api : awaitAnyTermination()
Instead of using :
query.start().awaitTermination()
use :
sparkSession.streams.awaitAnyTermination()
adding excerpts from the api documentation :
/**
* Wait until any of the queries on the associated SQLContext has terminated since the
* creation of the context, or since `resetTerminated()` was called. If any query was terminated
* with an exception, then the exception will be thrown.
*
* If a query has terminated, then subsequent calls to `awaitAnyTermination()` will either
* return immediately (if the query was terminated by `query.stop()`),
* or throw the exception immediately (if the query was terminated with exception). Use
* `resetTerminated()` to clear past terminations and wait for new terminations.
*
* In the case where multiple queries have terminated since `resetTermination()` was called,
* if any query has terminated with exception, then `awaitAnyTermination()` will
* throw any of the exception. For correctly documenting exceptions across multiple queries,
* users need to stop all of them after any of them terminates with exception, and then check the
* `query.exception()` for each query.
*
* #throws StreamingQueryException if any query has terminated with an exception
*
* #since 2.0.0
*/
#throws[StreamingQueryException]
def awaitAnyTermination(): Unit = {
awaitTerminationLock.synchronized {
while (lastTerminatedQuery == null) {
awaitTerminationLock.wait(10)
}
if (lastTerminatedQuery != null && lastTerminatedQuery.exception.nonEmpty) {
throw lastTerminatedQuery.exception.get
}
}
}

Resources