is it possible to let spark structured stream(update mode) to write to db? - apache-spark

I use spark(3.0.0) structured streaming to read topic from kafka.
I've used joins and then used mapGropusWithState to get my stream data, so I have to use update mode, based on my understanding from the spark offical guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
Below section of the spark offical guide says nothing about DB sink, and It does not support write to files either for update mode: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks
Currently I output it to console, and I would like to to store the data in files or DB.
So my question is:
how can I write the stream data to db or file in my situation?
Do i have to write the data to kafka and then use kafka connect to read them back to files/db?
p.s. I followed the articles to get the aggregated streaming query.
- https://stackoverflow.com/questions/62738727/how-to-deduplicate-and-keep-latest-based-on-timestamp-field-in-spark-structured
- https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html
- will also try one more time for below using java api
(https://stackoverflow.com/questions/50933606/spark-streaming-select-record-with-max-timestamp-for-each-id-in-dataframe-pysp)

I got confused by the OUTPUT and WRITE. Also I was wrongly assuming the DB and FILE Sink are in parallel term in the OUTPUT SINK section of the doc(and so one cannot see DB sink in the OUTPUT SINKs section of the guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks).
I just realized that the OUTPUT mode (append/update/complete) is to do query streaming query constraints. But it has nothing to do with how to WRITE to the SINK. I also realized the DB writing can be achieved by using the FOREACH SINK (initially I just understood it is for extra transformation).
I found these articles/discussions are useful
https://www.waitingforcode.com/apache-spark-structured-streaming/output-modes-structured-streaming/read#what_is_the_difference_with_SaveMode
How to write streaming dataframe to PostgreSQL?
https://linuxize.com/post/how-to-list-databases-tables-in-postgreqsl/
so later on, read the official guide again, confirmed the for each batch can also do custom logic etc when WRITING to a STORAGE.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch

Related

How Kafka sink supports update mode in structured streaming?

I have read about the different output modes like:
Complete Mode - The entire updated Result Table will be written to the sink.
Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage.
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage
At first I thought I understand the above explanations.
Then I come across this:
File sink supported modes: Append
Kafka sink supported modes: Append,Update,Complete
Wait!! What??!!
Why couldn't we just write out the entire result table to file?
How can we an already existing entry in Kafka update? It's a stream, you can't just look for certain messages and change/update them.
This makes no sense at all.
Could you help me understand this? I just dont get how this works technically
Spark writes one file per partition, often with one file per executor. Executors run in a distributed fashion. Files are local to each executor, so append just makes sense - you cannot full replace individual files, especially without losing data within the stream. So that leaves you with "appending new files to the filesystem", or inserting into existing files.
Kafka has no update functionality... Kafka Integration Guide doesn't mention any of these modes, so it is unclear what you are referring to. You use write or writeStream. It will always "append" the "complete" dataframe batch(es) to the end of the Kafka topic. The way Kafka implements something like updates is using compacted topics, but this has nothing to do with Spark.

How to export data from hive to kafka

I need to export data from Hive to Kafka topics based on some events in another Kafka topic. I know I can read data from hive in Spark job using HQL and write it to Kafka from the Spark, but is there a better way?
This can be achieved using unstructured streaming. The steps mentioned below :
Create a Spark Streaming Job which connects to the required topic and fetched the required data export information.
From stream , do a collect and get your data export requirement in Driver variables.
Create a data frame using the specified condition
Write the data frame into the required topic using kafkaUtils.
Provide a polling interval based on your data volume and kafka write throughputs.
Typically, you do this the other way around (Kafka to HDFS/Hive).
But you are welcome to try using the Kafka Connect JDBC plugin to read from a Hive table on a scheduled basis, which converts the rows into structured key-value Kafka messages.
Otherwise, I would re-evaulate other tools because Hive is slow. Couchbase or Cassandra offer much better CDC features for ingestion into Kafka. Or re-write the upstream applications that inserted into Hive to begin with, rather to write immediately into Kafka, from which you can join with other topics, for example.

Storing Kafka Offsets in a File vs Hbase

I am developing a Spark-Kafka Streaming program where i need to capture the kafka partition offsets, inorder to handle failure scenarios.
Most of the devs are using Hbase as a storage for offsets, but how would it be if i use a file on hdfs or local disk to store offsets which is simple and easy?
I am trying to avoid using a Nosql for storing offsets.
Can i know what are the advantages and disadvantages of using a file over hbase for storing offsets?
Just use Kafka. Out of the box, Apache Kafka stores consumer offsets within Kafka itself.
I too have similar usecase, i prefer hbase because of following reasons-
Easy retrieval, it stores data in sorted order of rowkey. Its helpful when the offsets belong to different data group.
I had to capture start and end offset for a group of data where capturing start is easy but end offset..it though to capture in streaming mode. So I don't wanted to open a file update only end offset and close it.I had a thought of S3 as well but S3 objects are immutable.
Zookeeper can also be one option.
Hope it helps .

Spark Streaming : source HBase

Is it possible to have a spark-streaming job setup to keep track of an HBase table and read new/updated rows every batch? The blog here says that HDFS files come under supported sources. But they seem to be using the following static API :
sc.newAPIHadoopRDD(..)
I can't find any documentation around this. Is it possible to stream from hbase using spark streaming context? Any help is appreciated.
Thanks!
The link provided does the following
Read the streaming data - convert it into HBase put and then add to HBase table. Until this, its streaming. Which means your ingestion process is streaming.
The stats calculation part, I think is batch - this uses newAPIHadoopRDD. This method will treat the data reading part as files. In this case, the files are from Hbase - thats the reason for the following input formats
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
If you want to read the updates in a HBase as streaming, then you should have a handle of WAL(write ahead logs) of HBase at the back end, and then perform your operations. HBase-indexer is a good place to start to read any updates in HBase.
I have used hbase-indexer to read hbase updates at the back end and direct them to solr as they arrive. Hope this helps.

Cassandra Loading Options

I have deployed a 9 node DataStax Cluster in Google Cloud. I am new to Cassandra and not sure how generally people push the data to Cassandra.
My requirement is read the data from flatfiles and RDBMs table and load into Cassandra which is deployed in Google Cloud.
These are the options I see.
1. Use Spark and Kafka
2. SStables
3. Copy Command
4. Java Batch
5. Data Flow ( Google product )
Is there any other options and which one is best.
Thanks,
For flat files you have 2 most effective options:
Use Spark - it will load data in parallel, but requires some coding.
Use DSBulk for batch loading of data from command line. It supports loading from CSV and JSON, and very effective. DataStax's Academy blog just started a series of the blog posts on DSBulk, and first post will provide you enough information to start with it. Also, if you have big files, consider to split them into smaller ones, as it will allow DSBulk to perform parallel load using all available threads.
For loading data from RDBMS, it depends on what you want to do - load data once, or need to update data as they change in the DB. For first option you can use Spark with JDBC source (but it has some limitations too), and then saving data into DSE. For 2nd, you may need to use something like Debezium, that supports streaming of change data from some databases into Kafka. And then from Kafka you can use DataStax Kafka Connector for submitting data into DSE.
CQLSH's COPY command isn't as effective/flexible as DSBulk, so I won't recommend to use it.
And never use CQL Batch for data loading, until you know how it works - it's very different from RDBMS world, and if it's used incorrectly it will really make loading less effective than executing separate statements asynchronously. (DSBulk uses batches under the hood, but it's different story).

Resources