I have Spark job reads millions of records from Cassandra, filter out(business rules) and write to Kinesis stream. I don't find any example and testimonial on how to invoke KPL(Kinesis Producer Library) from Spark. Is that correct approach? Do I have any other option?
you can create KPL producer per partition and then for each partition you can send the message. Keep the partitions small to avoid overloading task/core nodes.
Related
I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.
I have a Dataframe that I want to output to Kafka. This can be done manually doing a forEach using a Kafka producer or I can use a Kafka sink (if I start using Spark structured streaming).
I'd like to achieve an exactly once semantic in this whole process, so I want to be sure that I'll never have the same message committed twice.
If I use a Kafka producer I can enable the idempotency through Kafka properties, for what I've seen this is implemented using sequence numbers and producersId, but I believe that in case of stage/task failures the Spark retry mechanism might create duplicates on Kafka, for example if a worker node fails, the entire stage will be retried and will be an entire new producer pushing messages causing duplicates?
Seeing the fault tolerance table for kafka sink here I can see that:
Kafka Sink supports at-least-once semantic, so the same output can be sinked more than once.
Is it possible to achieve exactly once semantic with Spark + Kafka producers or Kafka sink?
If is possible, how?
Kafka doesn't support exactly-once semantic. They have a guarantee only for at-least-once semantic. They just propose how to avoid duplicate messages. If your data has a unique key and is stored in a database or filesystem etc., you can avoid duplicate messages.
For example, you sink your data into HBase, each message has a unique key as an HBase row key. when it gets the message that has the same key, the message will be overwritten.
I hope this article will be helpful:
https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exactly-once/
How do i compare the received record with previous record of same key in spark structured streaming. Can this be done using groupByKey and mapGroupWithState?
groupByKey(user)
mapGroupsWithState(GroupStateTimeout.NoTimeout)(updateAcrossEvents)
//Sample code from Spark Definitive Guide
There is one more question arising when we perform the above operations
I don't think so sequence of record will be maintained as the record is received it will partitioned and stored across worker nodes and when we apply groupByKey shuffle happens and all records with same key will be in the same worker node, but doesn't maintain the sequence.
You can use mapGroupsWithState for this. You will have to save the previous record in the group state and compare it with the incoming record.
What do you use as your source? If the source is Kafka you will have to partition the Kafka topic by the key that you are using.
Scenario-I have 1 topic with 2 partitions with different data set collections say A,B.I am aware that the the dstream can consume the messages at the partition level and the topic level.
Query-Can we use two different streaming contexts for the each partition or a single streaming context for the entire topic and later filter the partition level data?I am concerned about the performance on increasing the no of streaming contexts.
Quoting from the documentation.
Simplified Parallelism: No need to create multiple input Kafka streams
and union them. With directStream, Spark Streaming will create as many
RDD partitions as there are Kafka partitions to consume, which will
all read data from Kafka in parallel. So there is a one-to-one mapping
between Kafka and RDD partitions, which is easier to understand and
tune.
Therefore, if you are using Direct Stream based Spark Streaming consumer it should handle the parallelism.
Currently, I am working on a use-case which requires reading JSON messages from Kafka and process them in Spark via Spark Streaming. We are expecting around 35 Million records per day. With this kind of a load, is it preferred to move the parsing logic (and some filtering logic based on JValue) to Kafka using Custom Kafka Deserializer (extending org.apache.kafka.common.serialization.Deserializer class). Will this have any performance overhead?
Thank you.