Spark structured streaming multiple applications joined by Kafka - apache-spark

I have two Spark structured streaming applications. The first makes some windowing aggregations and outputs to Kafka in "update" mode. The second reads this Kafka stream and does some further processing and again outputs in "update" mode.
Will the second application reading the Kafka stream read in the most up to date data?

Related

How to limit number of batches to run in Spark Structured Streaming forEachBatch?

I'm reading data from Kafka in batch fashion using readStream then doing some transfromations and writing the data using forEachBacth & writeStream.
I have a usecase to hold the job for sometime and so i want to limit the job for x number of batches. Is it possible to do in Spark Structured Streaming ? Specifically, Spark 2.4.8

How do spark streaming on hive tables using sql in NON-real time?

We have some data (millions) in hive tables which comes everyday. Next day, once the over-night ingestion is complete different applications query us for data (using sql)
We take this sql and make a call on spark
spark.sqlContext.sql(statement) // hive-metastore integration is enabled
This is causing too much memory usage on spark driver, can we use spark streaming (or structured streaming), to stream the results in a piped fashion rather than collecting everything on driver and then sending to clients ?
We don't want to send out the data as soon it comes ( in typical streaming apps), but want to send a streaming data to clients when they ask (PULL) for data.
IIUC..
Spark Streaming is mainly designed to process streaming data by converting into batches of Milliseconds to Seconds.
You can look over streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) provides you a very good functionality for Spark to write
Streaming processed output Sink in micro-batch manner.
Nevertheless Spark structured streaming don't have a standard JDBC source defined to read from.
Work out for an option to directly store Hive underlying files in compressed and structured manner, transfer them directly rather than selecting through spark.sql if every client needs same/similar data or partition them based on where condition of spark.sql query and transfer needed files further.
Source:
Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees.
ForeachBatch:
foreachBatch(...) allows you to specify a function that is executed on the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.

Why Spark Structured Streaming is ideal for real-time operations?

I wanna construct a real-time application but I don't know if I should use Spark Streaming or Spark Structured Streaming.
I read online that Structured Streaming is ideal for real-time applications but is not clear why...
Can someone explain it?
Spark Streaming works on something we call a micro batch. ... Each batch represents an RDD. Structured Streaming works on the same architecture of polling the data after some duration, based on your trigger interval, but it has some distinction from the Spark Streaming which makes it more inclined towards real streaming.
For developers all they need to worry is that Spark streaming you will you RDDs but in Spark Structured Streaming you get Dataframes and DataSet.
If you want so very low level(i.e. per record) operations go for RDDs(i.e. Spark Streaming) and but your application can build on Dataframes and querying them like SQL in real time then go for DataFrames(i.e. Spark Structured Streaming)
Eventually RDDs can be converted to Dataframes and vice versa

Does Spark Structured Streaming maintain the order of Kafka messages?

I have a Spark Structured Streaming application that consumes messages from multiple Kafka topics and writes the results to another Kafka topic. To maintain the integrity of the data, it's imperative that the order of messages in source partitions is maintained. So if message A precedes message B in a partition, processed(A) should be written to the output topic before processed(B) (processed A and B will go to the same partition too as the same hash string is used).
Does Spark Structured Streaming guarantee this?

Connect Spark Streaming to Spark Batch automatically

I'm receiving streaming data from Kafka, which I'm reading as a dataframe with Structured Spark Streaming.
The problem is that I need to perform multiple aggregations on the same column and non-time-based window operations with that results.
AFAIK that's still not possible in Spark Structured Streaming, so I want to start a Spark batch job triggered after some time.
How could I achive that? Is there any way to start a python script like with spark submit?

Resources