Spark Streaming : source HBase - apache-spark

Is it possible to have a spark-streaming job setup to keep track of an HBase table and read new/updated rows every batch? The blog here says that HDFS files come under supported sources. But they seem to be using the following static API :
sc.newAPIHadoopRDD(..)
I can't find any documentation around this. Is it possible to stream from hbase using spark streaming context? Any help is appreciated.
Thanks!

The link provided does the following
Read the streaming data - convert it into HBase put and then add to HBase table. Until this, its streaming. Which means your ingestion process is streaming.
The stats calculation part, I think is batch - this uses newAPIHadoopRDD. This method will treat the data reading part as files. In this case, the files are from Hbase - thats the reason for the following input formats
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
If you want to read the updates in a HBase as streaming, then you should have a handle of WAL(write ahead logs) of HBase at the back end, and then perform your operations. HBase-indexer is a good place to start to read any updates in HBase.
I have used hbase-indexer to read hbase updates at the back end and direct them to solr as they arrive. Hope this helps.

Related

is it possible to let spark structured stream(update mode) to write to db?

I use spark(3.0.0) structured streaming to read topic from kafka.
I've used joins and then used mapGropusWithState to get my stream data, so I have to use update mode, based on my understanding from the spark offical guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
Below section of the spark offical guide says nothing about DB sink, and It does not support write to files either for update mode: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks
Currently I output it to console, and I would like to to store the data in files or DB.
So my question is:
how can I write the stream data to db or file in my situation?
Do i have to write the data to kafka and then use kafka connect to read them back to files/db?
p.s. I followed the articles to get the aggregated streaming query.
- https://stackoverflow.com/questions/62738727/how-to-deduplicate-and-keep-latest-based-on-timestamp-field-in-spark-structured
- https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html
- will also try one more time for below using java api
(https://stackoverflow.com/questions/50933606/spark-streaming-select-record-with-max-timestamp-for-each-id-in-dataframe-pysp)
I got confused by the OUTPUT and WRITE. Also I was wrongly assuming the DB and FILE Sink are in parallel term in the OUTPUT SINK section of the doc(and so one cannot see DB sink in the OUTPUT SINKs section of the guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks).
I just realized that the OUTPUT mode (append/update/complete) is to do query streaming query constraints. But it has nothing to do with how to WRITE to the SINK. I also realized the DB writing can be achieved by using the FOREACH SINK (initially I just understood it is for extra transformation).
I found these articles/discussions are useful
https://www.waitingforcode.com/apache-spark-structured-streaming/output-modes-structured-streaming/read#what_is_the_difference_with_SaveMode
How to write streaming dataframe to PostgreSQL?
https://linuxize.com/post/how-to-list-databases-tables-in-postgreqsl/
so later on, read the official guide again, confirmed the for each batch can also do custom logic etc when WRITING to a STORAGE.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch

Spark Structured Streaming - Streaming data joined with static data which will be refreshed every 5 mins

For spark structured streaming job one input is coming from a kafka topic while second input is a file (which will be refreshed every 5 mins by a python API). I need to join these 2 inputs and write to a kafka topic.
The issue I am facing is when second input file is being refreshed and spark streaming job is reading the file at the same time I get the error below:
File file:/home/hduser/code/new/collect_ip1/part-00163-55e17a3c-f524-4dac-89a4-b9e12f1a79df-c000.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by recreating the Dataset/DataFrame involved.
Any help will be appreciated.
Use HBase as your store for static. It is more work for sure but allows for concurrent updating.
Where I work, all Spark Streaming uses HBase for lookup of data. Far faster. What if you have a 100M customers for a microbatch of 10k records? I know it was a lot of work initially.
See https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
If you have a small static ref table, then static join is fine, but you also have updating, causing issues.

Spark Structured Streaming: join stream with data that should be read every micro batch

I have a stream from HDFS and I need to join it with my metadata that is also in HDFS, both Parquets.
My metadata sometimes got updated and I need to join with fresh and most recent, that means read metadata from HDFS every stream micro batch ideally.
I tried to test this, but unfortunately Spark reads metadata once that cache files(supposedly), even if I tried with spark.sql.parquet.cacheMetadata=false.
Is there a way how to read every micro batch? Foreach Writer is not what I'm looking for?
Here's code examples:
spark.sql("SET spark.sql.streaming.schemaInference=true")
spark.sql("SET spark.sql.parquet.cacheMetadata=false")
val stream = spark.readStream.parquet("/tmp/streaming/")
val metadata = spark.read.parquet("/tmp/metadata/")
val joinedStream = stream.join(metadata, Seq("id"))
joinedStream.writeStream.option("checkpointLocation", "/tmp/streaming-test/checkpoint").format("console").start()
/tmp/metadata/ got updated with spark append mode.
As far as I understand, with metadata accessing through JDBC jdbc source and spark structured streaming, Spark will query each micro batch.
As far as I found, there are two options:
Create temp view and refresh it using interval:
metadata.createOrReplaceTempView("metadata")
and trigger refresh in separate thread:
spark.catalog.refreshTable("metadata")
NOTE: in this case spark will read the same path only, it does not work if you need read metadata from different folders on HDFS, e.g. with timestamps etc.
Restart stream with interval as Tathagata Das suggested
This way is not suitable for me, since my metadata might be refreshed several times per hour.

Spark Streaming to Hive, too many small files per partition

I have a spark streaming job with a batch interval of 2 mins(configurable).
This job reads from a Kafka topic and creates a Dataset and applies a schema on top of it and inserts these records into the Hive table.
The Spark Job creates one file per batch interval in the Hive partition like below:
dataset.coalesce(1).write().mode(SaveMode.Append).insertInto(targetEntityName);
Now the data that comes in is not that big, and if I increase the batch duration to maybe 10mins or so, then even I might end up getting only 2-3mb of data, which is way less than the block size.
This is the expected behaviour in Spark Streaming.
I am looking for efficient ways to do a post processing to merge all these small files and create one big file.
If anyone's done it before, please share your ideas.
I would encourage you to not use Spark to stream data from Kafka to HDFS.
Kafka Connect HDFS Plugin by Confluent (or Apache Gobblin by LinkedIn) exist for this very purpose. Both offer Hive integration.
Find my comments about compaction of small files in this Github issue
If you need to write Spark code to process Kafka data into a schema, then you can still do that, and write into another topic in (preferably) Avro format, which Hive can easily read without a predefined table schema
I personally have written a "compaction" process that actually grabs a bunch of hourly Avro data partitions from a Hive table, then converts into daily Parquet partitioned table for analytics. It's been working great so far.
If you want to batch the records before they land on HDFS, that's where Kafka Connect or Apache Nifi (mentioned in the link) can help, given that you have enough memory to store records before they are flushed to HDFS
I have exactly the same situation as you. I solved it by:
Lets assume that your new coming data are stored in a dataset: dataset1
1- Partition the table with a good partition key, in my case I have found that I can partition using a combination of keys to have around 100MB per partition.
2- Save using spark core not using spark sql:
a- load the whole partition in you memory (inside a dataset: dataset2) when you want to save
b- Then apply dataset union function: dataset3 = dataset1.union(dataset2)
c- make sure that the resulted dataset is partitioned as you wish e.g: dataset3.repartition(1)
d - save the resulting dataset in "OverWrite" mode to replace the existing file
If you need more details about any step please reach out.

Read from Hbase + Convert to DF + Run SQLs

Edit
My use case is a Spark streaming app (spark 2.1.1 + Kafka 0.10.2.1), wherein I read from Kafka and for each message/trigger need to pull data from HBase. post the pull, I need to run some SQL statements on the data (so received from HBase)
Naturally, I intend to push the processing (read from HBase & SQL execution) to the worker nodes to achieve parallelism.
So far, my attempts to convert the data from HBase to a data frame (so that i can launch SQK statements) are failing. Another gent mentioned that it's not "allowed " since that part is running on executors. However, this is my conscious choice to run those pieces on worker nodes.
Is that sound thinking? If not, why not?
What's the recommendation on that? or on the overall idea?
For every streamed rec, reading from hbase and sql seems to be "too much happening in streaming app".
Anyways, you can create connection for every partition to hbase and get records and then compare. Not sure about sql. If its just another reading for every streaming record, again handle at partition level in spark.
But the above approach will be time consuming - just make sure you finish all stuff before the next batch starts.
You also mentioned converting "hbase to dataframe" and "parallel". Both seemed to be in opposite direction. Because you start with dataframe(may be reading from hbase once and then you parallelize. Hope I cleared some of your doubts

Resources