What is the advantage of processing data stream in batches over processing every data element in data stream? Why is it not possible to apply processing on each data item as it arrives?
DStream uses this concept in Spark Streaming.
Related
We have some data (millions) in hive tables which comes everyday. Next day, once the over-night ingestion is complete different applications query us for data (using sql)
We take this sql and make a call on spark
spark.sqlContext.sql(statement) // hive-metastore integration is enabled
This is causing too much memory usage on spark driver, can we use spark streaming (or structured streaming), to stream the results in a piped fashion rather than collecting everything on driver and then sending to clients ?
We don't want to send out the data as soon it comes ( in typical streaming apps), but want to send a streaming data to clients when they ask (PULL) for data.
IIUC..
Spark Streaming is mainly designed to process streaming data by converting into batches of Milliseconds to Seconds.
You can look over streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) provides you a very good functionality for Spark to write
Streaming processed output Sink in micro-batch manner.
Nevertheless Spark structured streaming don't have a standard JDBC source defined to read from.
Work out for an option to directly store Hive underlying files in compressed and structured manner, transfer them directly rather than selecting through spark.sql if every client needs same/similar data or partition them based on where condition of spark.sql query and transfer needed files further.
Source:
Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees.
ForeachBatch:
foreachBatch(...) allows you to specify a function that is executed on the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.
How the delay will works in Structured streaming jobs. will it create a delay in processing the number of files coming in the stream?
I want to use Spark Structured Streaming to filter an Event Hub stream to separate objects of different type (JSON objects with different schemas) and write each object type to it's own file store.
Am I right if I assume the only supported way to do this is to create a ForeachWriter, cache the microbatch and do the filtering and writing in the process method?
I don't want to create one read stream per filtered write stream, that would severely hamper the egress capacity from the Event Hub.
Current setting: a Spark Streaming job processes a Kafka topic of timeseries data. About every second new data comes in of different sensors. Also, the batch interval is 1 Second. By means of updateStateByKey() stateful data is computed as a new stream. As soon as this stateful data crosses a treshold, an event is generated on a Kafka topic. When the value later drops below the treshhold, again an event is fired that topic.
So far, so good.
Problem: when applying a new algorithm on the data by reconsuming the Kafka topic, I would like this to go fast. But this means that every batch contains (hundreds of) thousands messages. Moving these in 1 batch to updateStateByKey() results in 1 computed value for that key on the resulting stream.
Of course that's unacceptable as loads of data points are reduced to a single one. Alarm events that will be generated on a real-time stream will not be on the recomputed stream. So comparing algorithms this way is totally useless.
Question: How can I avoid this? Preferably not switching frameworks. It seems to me I'm looking for a true streaming (1 event a a time) framework. On the other hand Spark streaming is new to me, so I'm definitely missing a lot there.
In spark 1.6, a new API mapWithState for interacting with state has been introduced. I believe that will solve your problem.
Have a look at it here.
Spark stream is basically micro batch. It means it will run periodically to handle a batch of data during that interval.
Some other stream calculation engine like Storm can be trigger by event. It means when some data comes (a event occur), the data will be handled immediately.
I wonder why Spark Stream can not do the same? For example, some data comes, then spark stream turn it into a RDD immediately and run calculation on it. Is it hard? Or does not make sense? Why have to wait for a while (the interval) then handle the data together.