How to split spark DataFrame while sending to kafka? - apache-spark

I am using the following statement to write my dataframe to kafka.
dataFrame.write.format("kafka")
.options(options).save()
Unfortunately above code is writing huge dataframe into kafka as single message, which is causing multiple issues, I want kafka to send a single data frame divided into multiple messages, also I dont want to send msg by msg for each records. I want to send as chunks which my kafka server should easily accepts without changing any configs in server side, please dont suggest to split using rownum, which I can do myself. Is there any builtin option in spark to divide while sending to kafka?

Related

Is it possible to have a single kafka stream for multiple queries in structured streaming?

I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.

Real time metrics in spark structured streaming

I would like to use external metrics system to monitor stream progress in spark. For this I should send notifications with metrics as soon as possible (number of read, transformed and written records)
StreamExecution uses ProgressReporter to send QueryProgressEvents with statistics (numInputRows, processedRowsPerSecond etc) to StreamingQueryListener. The problem is it happens when all data in batch are processed. However I would like to get a notification with the number of input rows as soon as they read from source (before transformation and write happens) and then number written records when data sent to a sink.
Is there a way to get such kind of metrics per batch in structured streaming in real time?
Metrics for structured streaming are not currently implemented out of the box anywhere besides the databricks platform. The only way to get them via open source spark is to extend the streaming query listener class and write your own.

Use Spark to Write Kafka Messages Directly to a File

For a class project, I need a Spark Java program to listen as a Kafka consumer and write all of a Kafka topic's received messages to a file (e.g. "/user/zaydh/my_text_file.txt").
I am able to receive the messages in as a JavaPairReceiverInputDStream object; I can also convert it to a JavaDStream<String> (this is from the Spark Kafka example).
However, I could not find a good Java syntax to write this data to what is a essentially a single log file. I tried using foreachRDD on the JavaDStream object, but I could not find a clean, parallel-safe way to sink it to a single log file.
I understand this approach is non-traditional or non-ideal, but it is a requirement. Any guidance is much appreciated.
When you think of a stream , you got to think of it as something that wont stop giving out data .
Hence if Spark streaming had a way to save all the RDDs coming in to a single file , it would keep growing to a huge size (and the stream isnt supposed to stop remember ? :))
But in this case you can make use of the saveAsTextFile utility of an RDD,
Which will create many file in your output directory depending on your batch interval thats specified while creating the streaming context JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1))
You can then merge these fileparts into one using somthing like how-to-merge-all-text-files-in-a-directory-into-one

SparkStreaming keep processing even no data in kafka

I'm using Spark Steaming to consume data from Kafka with the code snippet like :
rdd.foreachRdd{rdd=>rdd.foreachPartition{...}}
I'm using foreachPartition because I need to create connection with Hbase, I don't wanna open/close connection by each record.
But I found that when there is no data in Kafka, spark streaming is still processing foreachRdd and foreachPartition.
This caused many Hbase connections were created even though there were no any data were consumed. I really don't like this, how should I make Spark stop doing this when there is no data was consumed from Kafka please.
Simply check that there are items in the RDD. So your code could be:
rdd.foreachRdd{rdd=> if(rdd.isEmpty == false) rdd.foreachPartition{...}}

Writing same Spark Streaming Output to different destinations

I have a DStream and I want to write each element to a socket and to cassandra DB.
I found a solution that use Apache Kafka and two consumer, one write to database and other write to socket.
Is there a way to do that without using this workaround?
I use Java so please post code on this language.
You just need to apply two different actions to the rdd within the DStream: One to save to cassandra and one to send the data to whatever other output.
Also, cache the rdd before these actions to improve performance.
(in pseudo code as I don't do Java)
dstream.foreachRDD{rdd =>
rdd.cache()
rdd.saveToCassandra(...)
rdd.foreach(...) // or rdd.foreachPartition(...)
}

Resources