Storing Kafka Offsets in a File vs Hbase - apache-spark

I am developing a Spark-Kafka Streaming program where i need to capture the kafka partition offsets, inorder to handle failure scenarios.
Most of the devs are using Hbase as a storage for offsets, but how would it be if i use a file on hdfs or local disk to store offsets which is simple and easy?
I am trying to avoid using a Nosql for storing offsets.
Can i know what are the advantages and disadvantages of using a file over hbase for storing offsets?

Just use Kafka. Out of the box, Apache Kafka stores consumer offsets within Kafka itself.

I too have similar usecase, i prefer hbase because of following reasons-
Easy retrieval, it stores data in sorted order of rowkey. Its helpful when the offsets belong to different data group.
I had to capture start and end offset for a group of data where capturing start is easy but end offset..it though to capture in streaming mode. So I don't wanted to open a file update only end offset and close it.I had a thought of S3 as well but S3 objects are immutable.
Zookeeper can also be one option.
Hope it helps .

Related

How Kafka sink supports update mode in structured streaming?

I have read about the different output modes like:
Complete Mode - The entire updated Result Table will be written to the sink.
Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage.
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage
At first I thought I understand the above explanations.
Then I come across this:
File sink supported modes: Append
Kafka sink supported modes: Append,Update,Complete
Wait!! What??!!
Why couldn't we just write out the entire result table to file?
How can we an already existing entry in Kafka update? It's a stream, you can't just look for certain messages and change/update them.
This makes no sense at all.
Could you help me understand this? I just dont get how this works technically
Spark writes one file per partition, often with one file per executor. Executors run in a distributed fashion. Files are local to each executor, so append just makes sense - you cannot full replace individual files, especially without losing data within the stream. So that leaves you with "appending new files to the filesystem", or inserting into existing files.
Kafka has no update functionality... Kafka Integration Guide doesn't mention any of these modes, so it is unclear what you are referring to. You use write or writeStream. It will always "append" the "complete" dataframe batch(es) to the end of the Kafka topic. The way Kafka implements something like updates is using compacted topics, but this has nothing to do with Spark.

Specifying checkpoint location when structured streaming the data from kafka topics

I have built a spark structured streaming application which reads the data from kafka topics,I have specified the starting offsets as latest and what happens if there is any failure from spark side, from which point/offset the data will continue to read after restarting and is it good idea to have checkpoint specified in the write stream to make sure we are reading from the point where the application/spark has failed?
Please let me know.
I would advise you to set offsets to earliest and configure a checkpointLocation (HDFS, MinIO, other). The setting kafka.group.id will not commit offsets back to Kafka (even in Spark 3+), unless you commit them manually using foreachBatch.
You can use checkpoints, yes, or you can set kafka.group.id (in Spark 3+, at least).
Otherwise, it may start back at the end of the topic

Why there is no JDBC Spark Streaming receiver?

I suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once.
I was surprised that there is no JDBC Spark Streaming receiver implementation. Implementing Receiver doesn't look difficult.
Could you describe why such receiver doesn't exist (is this approach a bad idea?) or provide links to implementations.
I've found Stratio/datasource-receiver. But it reads all data in a DataFrame before processing by Spark Streaming.
Thanks!
First of all actual streaming source would require a reliable mechanism for monitoring updates, which is simply not a part of JDBC interface nor it is a standardized (if at all) feature of major RDBMs, not to mention other platforms, which can be accessed through JDBC. It means that streaming from a source like this typically requires using log replication or similar facilities and is highly resource dependent.
At the same what you describe
suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once
is really not an use case for streaming. Streaming deals with infinite streams of data, while you ask is simply as scenario for partitioning and such capabilities are already a part of the standard JDBC connector (either by range or by predicate).
Additionally receiver based solutions simply don't scale well and effectively model a sequential process. As a result their applications are fairly limited, and wouldn't be even less appealing if data was bounded (if you're going to read finite data sequentially on a single node, there is no value in adding Spark to the equation).
I don't think it is a bad idea since in some cases you have constraints that are outside your power,e.g. legacy systems to which you cannot apply strategies such as CDC but to which you still have to consume as a source of stream data.
On the other hand, Spark Structure Streaming engine, in micro-batch mode, requires the definition of an offset than can be advanced, as you can see in this class. So, if your table has some column that can be used as an offset, you can definitely stream from it, although RDMDS are not the "streaming-friendly" as far as I know.
I have developed Jdbc2s which is a DataSource V1 streaming source for Spark. It's also deployed to Maven Central, if you need. Coordinates are in the documentation.

Spark Streaming : source HBase

Is it possible to have a spark-streaming job setup to keep track of an HBase table and read new/updated rows every batch? The blog here says that HDFS files come under supported sources. But they seem to be using the following static API :
sc.newAPIHadoopRDD(..)
I can't find any documentation around this. Is it possible to stream from hbase using spark streaming context? Any help is appreciated.
Thanks!
The link provided does the following
Read the streaming data - convert it into HBase put and then add to HBase table. Until this, its streaming. Which means your ingestion process is streaming.
The stats calculation part, I think is batch - this uses newAPIHadoopRDD. This method will treat the data reading part as files. In this case, the files are from Hbase - thats the reason for the following input formats
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
If you want to read the updates in a HBase as streaming, then you should have a handle of WAL(write ahead logs) of HBase at the back end, and then perform your operations. HBase-indexer is a good place to start to read any updates in HBase.
I have used hbase-indexer to read hbase updates at the back end and direct them to solr as they arrive. Hope this helps.

What is the best way to store incoming streaming data?

What is a better choice for a long-term store (many writes, few reads) of data processed through Spark Streaming: Parquet, HBase or Cassandra? Or something else? What are the trade-offs?
In my experience we have used Hbase as datastore for spark streaming data(we also has same scenario many writes and few reads), since we are using hadoop, hbase has native integration with hadoop and it went well..
Above we have used tostore hight rate of messages coming over from solace.
HBase is well suited for doing Range based scans. Casandra is known for availablity and many other things...
However, I can also observe one general trend in many projects, they are simply storing rawdata in hdfs (parquet + avro) in partitioned structure through spark streaming with spark dataframe(SaveMode.Append) and they are processing rawdata with Spark
Ex of partitioned structure in hdfs :
completion ofbusinessdate/environment/businesssubtype/message type etc....
in this case there is no need for going to Hbase or any other data store.
But one common issue in above approach is when you are getting small and tiny files, through streaming then you would need to repartion(1) or colelese or FileUtils.copymerge to meet block size requirements to single partitioned file. Apart from that above approach also would be fine.
Here is some thing called CAP theorm based on which decision can be taken.
Consistency (all nodes see the same data at the same time).
Availability (every request receives a response about whether it
succeeded or failed).
Partition tolerance (the system continues to
operate despite arbitrary partitioning due to network failures)
Casandra supports AP.
Hbase supports CP.
Look at detailed analysis given here

Resources