Spark Streaming with side inputs using Java - apache-spark

I am looking for some sample Java code for Spark streaming using some side inputs.
The Streaming input will be joined with some static input from local file / HDFS file.
Believe this is a standard use case; unfortunately could not find any guidance after extensive searches.
Code should provide the following:
How to create the session / context.
How to read the streaming data (preferably from Kafka).
How to read the static file (I need to use csv format).
How to join the Streaming data with the Static data (preferably using SQL).
I have tried searching for sample code and also checked the Apache Spark Programming guide.
However, could not find any sample code covering Streaming + static input join.

Related

is it possible to let spark structured stream(update mode) to write to db?

I use spark(3.0.0) structured streaming to read topic from kafka.
I've used joins and then used mapGropusWithState to get my stream data, so I have to use update mode, based on my understanding from the spark offical guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
Below section of the spark offical guide says nothing about DB sink, and It does not support write to files either for update mode: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks
Currently I output it to console, and I would like to to store the data in files or DB.
So my question is:
how can I write the stream data to db or file in my situation?
Do i have to write the data to kafka and then use kafka connect to read them back to files/db?
p.s. I followed the articles to get the aggregated streaming query.
- https://stackoverflow.com/questions/62738727/how-to-deduplicate-and-keep-latest-based-on-timestamp-field-in-spark-structured
- https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html
- will also try one more time for below using java api
(https://stackoverflow.com/questions/50933606/spark-streaming-select-record-with-max-timestamp-for-each-id-in-dataframe-pysp)
I got confused by the OUTPUT and WRITE. Also I was wrongly assuming the DB and FILE Sink are in parallel term in the OUTPUT SINK section of the doc(and so one cannot see DB sink in the OUTPUT SINKs section of the guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks).
I just realized that the OUTPUT mode (append/update/complete) is to do query streaming query constraints. But it has nothing to do with how to WRITE to the SINK. I also realized the DB writing can be achieved by using the FOREACH SINK (initially I just understood it is for extra transformation).
I found these articles/discussions are useful
https://www.waitingforcode.com/apache-spark-structured-streaming/output-modes-structured-streaming/read#what_is_the_difference_with_SaveMode
How to write streaming dataframe to PostgreSQL?
https://linuxize.com/post/how-to-list-databases-tables-in-postgreqsl/
so later on, read the official guide again, confirmed the for each batch can also do custom logic etc when WRITING to a STORAGE.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch

Custom sink for spark structured streaming

For spark structured streaming we have written a custom source reader to read data from our "custom source". To write it , I was following the example of "kafka custom source" code in spark code itself.
We also have a requirement of writing our output to a custom location. For that I saw that spark provides "foreach" and "foreachbatch"(in spark 2.3.0 onwards). I found that using either of these is a very simple implementation and also at first look , I feel most of my custom sink implementation requirement can be met.
But when I looked at kafka code in spark, I saw that kafka uses "StreamSinkProvider"(https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-StreamSinkProvider.html) instead of using "foreach" or "foreachbatch".
Now I wanted to find out what are various pros/cons of using wither of these options(in terms of performance/ flexibility etc.) ? Is one better than another ? Anyone have any experience working on either of these options how they go along in actual use case ?

How to save results of streaming query as PDF / XLSX (for report generation)?

Curious to know if we can generate PDF or XLSX files for report using spark streaming / spark structured streaming. As per the official document there is File Sink but is PDF and XLSX supported? if so can we make use of it for report generations?
if we can generate PDF or XLSX files for report using spark streaming / spark structured streaming
If you want to generate PDF/XLSX files in a distributed streaming manner, you could really use Spark Structured Streaming.
As per the official document there is File Sink but is PDF and XLSX supported?
No. There is no direct support for PDF/XLSX formats so you'd have to write a custom data source yourself (with a streaming sink).
if so can we make use of it for report generations?
I've never heard of such data source before, but it's certainly possible to write one yourself.
Think of Spark as a general-purpose computation platform and whatever can be modelled (designed) as a distributed computation should certainly be doable using Spark machinery.

Zeppelin with Spark Structured Streaming Example

I am trying to visualize spark structured streams in Zeppelin. I am able to achieve using memory sink(spark.apache). But it is not reliable solution for high data volumes. What will be the better solution?
Example implementation or demo would be helpful.
Thanks,
Rilwan
Thanks for asking the question!! Having 2+ years of experience for developing Spark Monitoring Tools, I think I will be able to resolve your doubt!!
There are two types of processing available when data is coming to spark as stream.
Discretized Stream or DStream: In this mode, spark provides you data
in RDD format and you have to write your own logic to handle the
RDD.
Pros:
1. If you want to do some processing before saving the streaming data, RDD is the best way to handle compared to DataFrame.
2. DStream provides you a nice Streaming UI where it graphically show how much data havebeen processed. Check this link - https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html#monitoring-applications
Cons:
1. Handling Raw RDD is not so convenient and easy.
Structured Stream: In this mode, spark provides you data in a
DataFrame format, you need to mention where to store/send the data.
Pros:
1. Spark Streaming comes with some predefined sources and sinks which are very common and 95% of real-life scenarios can be resolved by plugging in these. Check this link - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Cons:
1. There is no Streaming UI available with Structured Streaming :( .Although you can get the metrices and create your own UI. Check this link - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries
You can also put store the metrices in some plaintext file, read the file in Zeppelin through spark.read.json, and plot your own graph.

Spark Streaming : source HBase

Is it possible to have a spark-streaming job setup to keep track of an HBase table and read new/updated rows every batch? The blog here says that HDFS files come under supported sources. But they seem to be using the following static API :
sc.newAPIHadoopRDD(..)
I can't find any documentation around this. Is it possible to stream from hbase using spark streaming context? Any help is appreciated.
Thanks!
The link provided does the following
Read the streaming data - convert it into HBase put and then add to HBase table. Until this, its streaming. Which means your ingestion process is streaming.
The stats calculation part, I think is batch - this uses newAPIHadoopRDD. This method will treat the data reading part as files. In this case, the files are from Hbase - thats the reason for the following input formats
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
If you want to read the updates in a HBase as streaming, then you should have a handle of WAL(write ahead logs) of HBase at the back end, and then perform your operations. HBase-indexer is a good place to start to read any updates in HBase.
I have used hbase-indexer to read hbase updates at the back end and direct them to solr as they arrive. Hope this helps.

Resources