Spark structured streaming window when no stream - apache-spark

I want to log number of records read to database from incoming stream of spark structured streaming. I'm using foreachbatch to transform incoming stream batch and write to desired location. I want to log 0 records read if there are no records in a particular hour. But foreach batch does not execute when there is no stream. Can anyone help me with it? My code is as below:
val incomingStream = spark.readStream.format("eventhubs")
.options(customEventhubParameters.toMap).load()
val query=incomingStream.writeStream.foreachBatch{
(batchDF: DataFrame, batchId: Long) =>
writeStreamToDataLake(batchDF,batchId,partitionColumn,
fileLocation,errorFilePath,eventHubName,configMeta)
}.option("checkpointLocation",fileLocation+checkpointFolder+"/"+eventHubName)
.trigger(Trigger.ProcessingTime(triggerTime.toLong))
.start().awaitTermination()

This is how it works and even mods, extensions to StreamingQueryListener are invoked only when there is something to process and thus status changes of the stream.
There probably is another way, but I would say "think outside of the box" and pre-popualte with 0 per timeframe such a database and when querying AGGRegate and you will have the correct answer.
https://medium.com/#johankok/structured-streaming-in-a-flash-576cdb17bbee can give some insight plus the Spark: The Definitive Guide.

Related

Is it possible to have a single kafka stream for multiple queries in structured streaming?

I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.

In Pyspark Structured Streaming, how can I discard already generated output before writing to Kafka?

I am trying to do Structured Streaming (Spark 2.4.0) on Kafka source data where I am reading latest data and performing aggregations on a 10 minute window. I am using "update" mode while writing the data.
For example, the data schema is as below:
tx_id, cust_id, product, timestamp
My aim is to find customers who have bought more than 3 products in last 10 minutes. Let's say prod is the dataframe which is read from kafka, then windowed_df is defined as:
windowed_df_1 = prod.groupBy(window("timestamp", "10 minutes"), cust_id).count()
windowed_df = windowed_df_1.filter(col("count")>=3)
Then I am joining this with a master dataframe from hive table "customer_master" to get cust_name:
final_df = windowed_df.join(customer_master, "cust_id")
And finally, write this dataframe to Kafka sink (or console for simplicity)
query = final_df.writeStream.outputMode("update").format("console").option("truncate",False).trigger(processingTime='2 minutes').start()
query.awaitTermination()
Now, when this code runs every 2 minutes, in the subsequent runs, I want to discard all those customers who were already part of my output. I don't want them in my output even if they buy any product again.
Can I write the stream output temporarily somewhere (may be a hive table) and do an "anti-join" for each execution ?
This way I can also have a history maintained in a hive table.
I also read somewhere that we can write the output to a memory sink and then use df.write to save it in HDFS/Hive. But what if we terminate the job and re-run ? The in-memory table will be lost in this case I suppose.
Please help as I am new to Structured Streaming.
**
Update: -
**
I also tried below code to write output in Hive table as well as Console(or Kafka sink):
def write_to_hive(df, epoch_id):
df.persist()
df.write.format("hive").mode("append").saveAsTable("hive_tab_name")
pass
final_df.writeStream.outputMode("update").format("console").option("truncate", False).start()
final_df.writeStream.outputMode("update").foreachBatch(write_to_hive).start()
But this only performs the 1st action, i.e. write to Console.
If i write "foreachBatch" first, it will save to Hive table but does not print to console.
I want to write to 2 different sinks. Please help.

Spark Structured Streaming: join stream with data that should be read every micro batch

I have a stream from HDFS and I need to join it with my metadata that is also in HDFS, both Parquets.
My metadata sometimes got updated and I need to join with fresh and most recent, that means read metadata from HDFS every stream micro batch ideally.
I tried to test this, but unfortunately Spark reads metadata once that cache files(supposedly), even if I tried with spark.sql.parquet.cacheMetadata=false.
Is there a way how to read every micro batch? Foreach Writer is not what I'm looking for?
Here's code examples:
spark.sql("SET spark.sql.streaming.schemaInference=true")
spark.sql("SET spark.sql.parquet.cacheMetadata=false")
val stream = spark.readStream.parquet("/tmp/streaming/")
val metadata = spark.read.parquet("/tmp/metadata/")
val joinedStream = stream.join(metadata, Seq("id"))
joinedStream.writeStream.option("checkpointLocation", "/tmp/streaming-test/checkpoint").format("console").start()
/tmp/metadata/ got updated with spark append mode.
As far as I understand, with metadata accessing through JDBC jdbc source and spark structured streaming, Spark will query each micro batch.
As far as I found, there are two options:
Create temp view and refresh it using interval:
metadata.createOrReplaceTempView("metadata")
and trigger refresh in separate thread:
spark.catalog.refreshTable("metadata")
NOTE: in this case spark will read the same path only, it does not work if you need read metadata from different folders on HDFS, e.g. with timestamps etc.
Restart stream with interval as Tathagata Das suggested
This way is not suitable for me, since my metadata might be refreshed several times per hour.

Avoiding multiple streaming queries

I have a structured streaming query which sinks to Kafka. This query has a complex aggregation logic.
I would like to sink the output DF of this query to multiple Kafka topics each partitioned on a different ‘key’ column. I don't want to have multiple Kafka sinks for each of the different Kafka topics because that would mean running multiple streaming queries - one for each Kafka topic, especially since my aggregation logic is complex.
Questions:
Is there a way to output the results of a structured streaming query to multiple Kafka topics each with a different key column but without having to execute multiple streaming queries?
If not, would it be efficient to cascade the multiple queries such that the first query does the complex aggregation and writes output to Kafka and then the other queries just read the output of the first query and write their topics to Kafka thus avoiding doing the complex aggregation again?
Thanks in advance for any help.
So the answer was kind of staring at me in the eye. It's documented as well. Link below.
One can write to multiple Kafka topics from a single query. If your dataframe that you want to write has a column named "topic" (along with "key", and "value" columns), it will write the contents of a row to the topic in that row. This automatically works. So the only thing you need to figure out is how to generate the value of that column.
This is documented - https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka
I am also looking for solution of this problem and in my case its not necessarily kafka sink. I want to write some records of a dataframe in sink1 while some other records in sink2 (depending upon some condition, without reading the same data twice in 2 streaming queries).
Currently it does not seem possible as per current implementation ( createSink() method in DataSource.scala provides support for a single sink).
However, In Spark 2.4.0 there is a new api coming: foreachBatch() which will give handle to a dataframe microbatch which can be used to cache the dataframe, write to different sinks or processing multiple times before uncaching aagin.
Something like this:
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.cache()
batchDF.write.format(...).save(...) // location 1
batchDF.write.format(...).save(...) // location 2
batchDF.uncache()
}
right now this feature available in databricks runtime :
https://docs.databricks.com/spark/latest/structured-streaming/foreach.html#reuse-existing-batch-data-sources-with-foreachbatch
EDIT 15/Nov/18 :
It is available now in Spark 2.4.0 ( https://issues.apache.org/jira/browse/SPARK-24565)
There is no way to have a single read and multiple writes in structured streaming out of the box. The only way is to implement custom sink that will write into multiple topics.
Whenever you call dataset.writeStream().start() spark starts a new stream that reads from a source (readStream()) and writes into a sink (writeStream()).
Even if you try to cascade it spark will create two separate streams with one source and one sink each. In other words, it will read, process and write data twice:
Dataset df = <aggregation>;
StreamingQuery sq1 = df.writeStream()...start();
StreamingQuery sq2 = df.writeStream()...start();
There is a way to cache read data in spark streaming but this option is not available for structured streaming yet.

Saving values from spark to Cassandra

I need to store the values from kafka->spark streaming->cassandra.
Now, I am receiving the values from kafka->spark and I have a spark job to save values into the cassandra db. However, I'm facing a problem with the datatype dstream.
In this following snippet you can see how I'm trying to convert the DStream into python friendly list object so that I can work with it but it gives an error.
input at kafka producer:
Byrne 24 San Diego robbyrne#email.com Rob
spark-job:
map1={'spark-kafka':1}
kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", map1)
lines = kafkaStream.map(lambda x: x[1])
words = lines.flatMap(lambda line: line.split(" "))
words.pprint() # outputs-> Byrne 24 SanDiego robbyrne#email.com Rob
list=[lambda word for word in words]
#gives an error -> TypeError: 'TransformedDStream' object is not iterable
This is how I'm saving values from spark->cassandra
rdd2=sc.parallelize([{
... "lastname":'Byrne',
... "age":24,
... "city":"SanDiego",
... "email":"robbyrne#email.com",
... "firstname":"Rob"}])
rdd2.saveToCassandra("keyspace2","users")
What's the best way of converting the DStream object to a dictionary or what's the best way of doing what I'm trying to do here?
I just need the values received from kafka (in the form of DStream) to be saved in Cassandra.
Thanks and any help would be nice!
Versions:
Cassandra v2.1.12
Spark v1.4.1
Scala 2.10
Like everything 'sparky', I think a short explanation is due since even if you are familiar with RDDs, DStreams are of an even higher concept:
A Discretized Stream (DStream), is a continuous sequence of RDDs of the same type, representing a continuous stream of data. In your case, DStreams are created from live Kafka data.
While a Spark Streaming program is running, each DStream periodically generates a RDD from live Kafka data
Now, to iterate over received RDDs, you need to use DStream#foreachRDD (and as implied by its name, it serves a similar purpose as foreach, but this time, to iterate over RDDs).
Once you have an RDD, you can invoke rdd.collect() or rdd.take() or any other standard API for RDDs.
Now, as a closing note, to make things even more fun, Spark introduced a new receiver-less “direct” approach to ensure stronger end-to-end guarantees.
(KafkaUtils.createDirectStream which requires Spark 1.3+)
Instead of using receivers to receive data, this approach periodically queries Kafka for the latest offsets in each topic+partition, and accordingly defines the offset ranges to process in each batch. When the jobs to process the data are launched, Kafka’s simple consumer API is used to read the defined ranges of offsets from Kafka.
(which is a nice way to say you will have to "mess" with the offsets yourself)
See Direct Streams Approach for further details.
See here for a scala code example
According to the official doc of the spark-cassandra connector: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md
import com.datastax.spark.connector.streaming._
val ssc = new StreamingContext(conf, Seconds(n))
val stream = ...
val wc = stream
.map(...)
.filter(...)
.saveToCassandra("streaming_test", "words", SomeColumns("word", "count"))
ssc.start()
Actually, I found the answer in this tutorial http://katychuang.me/blog/2015-09-30-kafka_spark.html.

Resources