Is spark structured streaming exactly-once when partitioning by system time? - apache-spark

Let's say I have a kafka topic without any duplicate messages.
If I consumed this topic with spark structured streaming and added a column with currentTime() and partitioned by this time column and saved records to s3 would there be a risk of creating duplicates in s3 in case of some failures?
Or spark is smart enough to deliver these messages exactly once?

Related

Can Apache Spark repartition the data received from single Kafka partition

We have a Kafka topic with multiple partitions and the load is uneven/skewed in each partition. We can't change the partitioning strategy for Kafka.
I am looking for some ways to repartition the data so that the load is equally processed by all the nodes present within the Apache Spark cluster.
Is it possible to repartition the data across all the nodes received from a single Kafka partition?
If yes, is there a way we can do it efficiently as we have to maintain the state stores also while aggregating the data

Does Spark Structured Streaming maintain the order of Kafka messages?

I have a Spark Structured Streaming application that consumes messages from multiple Kafka topics and writes the results to another Kafka topic. To maintain the integrity of the data, it's imperative that the order of messages in source partitions is maintained. So if message A precedes message B in a partition, processed(A) should be written to the output topic before processed(B) (processed A and B will go to the same partition too as the same hash string is used).
Does Spark Structured Streaming guarantee this?

Spark structured streaming from Kafka checkpoint and acknowledgement

In my spark structured streaming application, I am reading messages from Kafka, filtering them and then finally persisting to Cassandra. I am using spark 2.4.1. From the structured streaming documentation
Fault Tolerance Semantics
Delivering end-to-end exactly-once semantics was one of key goals behind the design of Structured Streaming. To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers) to track the read position in the stream. The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure.
But I am not sure how does Spark actually achieve this. In my case, if the Cassandra cluster is down leading to failures in the write operation, will the checkpoint for Kafka not record those offsets.
Is the Kafka checkpoint offset based only on successful reads from Kafka, or the entire operation including write is considered for each message?
Spark Structured Streaming is not commiting offsets to kafka as a "normal" kafka consumer would do.
Spark is managing the offsets internally with a checkpointing mechanism.
Have a look at the first response of following question which gives a good explanation about how the state is managed with checkpoints and commitslog: How to get Kafka offsets for structured query for manual and reliable offset management?
Spark uses multiple log files to ensure fault tolerance.
The ones relevant to your query are the offset log and the commit log.
from the StreamExecution class doc:
/**
* A write-ahead-log that records the offsets that are present in each batch. In order to ensure
* that a given batch will always consist of the same data, we write to this log *before* any
* processing is done. Thus, the Nth record in this log indicated data that is currently being
* processed and the N-1th entry indicates which offsets have been durably committed to the sink.
*/
val offsetLog = new OffsetSeqLog(sparkSession, checkpointFile("offsets"))
/**
* A log that records the batch ids that have completed. This is used to check if a batch was
* fully processed, and its output was committed to the sink, hence no need to process it again.
* This is used (for instance) during restart, to help identify which batch to run next.
*/
val commitLog = new CommitLog(sparkSession, checkpointFile("commits"))
so when it reads from Kafka it writes the offsets to the offsetLog and only after processing the data and writing it to the sink (in your case Cassandra) it writes the offsets to the commitLog.

How to export data from hive to kafka

I need to export data from Hive to Kafka topics based on some events in another Kafka topic. I know I can read data from hive in Spark job using HQL and write it to Kafka from the Spark, but is there a better way?
This can be achieved using unstructured streaming. The steps mentioned below :
Create a Spark Streaming Job which connects to the required topic and fetched the required data export information.
From stream , do a collect and get your data export requirement in Driver variables.
Create a data frame using the specified condition
Write the data frame into the required topic using kafkaUtils.
Provide a polling interval based on your data volume and kafka write throughputs.
Typically, you do this the other way around (Kafka to HDFS/Hive).
But you are welcome to try using the Kafka Connect JDBC plugin to read from a Hive table on a scheduled basis, which converts the rows into structured key-value Kafka messages.
Otherwise, I would re-evaulate other tools because Hive is slow. Couchbase or Cassandra offer much better CDC features for ingestion into Kafka. Or re-write the upstream applications that inserted into Hive to begin with, rather to write immediately into Kafka, from which you can join with other topics, for example.

Kafka Partition+Spark Streaming Context

Scenario-I have 1 topic with 2 partitions with different data set collections say A,B.I am aware that the the dstream can consume the messages at the partition level and the topic level.
Query-Can we use two different streaming contexts for the each partition or a single streaming context for the entire topic and later filter the partition level data?I am concerned about the performance on increasing the no of streaming contexts.
Quoting from the documentation.
Simplified Parallelism: No need to create multiple input Kafka streams
and union them. With directStream, Spark Streaming will create as many
RDD partitions as there are Kafka partitions to consume, which will
all read data from Kafka in parallel. So there is a one-to-one mapping
between Kafka and RDD partitions, which is easier to understand and
tune.
Therefore, if you are using Direct Stream based Spark Streaming consumer it should handle the parallelism.

Resources