How to export data from hive to kafka - apache-spark

I need to export data from Hive to Kafka topics based on some events in another Kafka topic. I know I can read data from hive in Spark job using HQL and write it to Kafka from the Spark, but is there a better way?

This can be achieved using unstructured streaming. The steps mentioned below :
Create a Spark Streaming Job which connects to the required topic and fetched the required data export information.
From stream , do a collect and get your data export requirement in Driver variables.
Create a data frame using the specified condition
Write the data frame into the required topic using kafkaUtils.
Provide a polling interval based on your data volume and kafka write throughputs.

Typically, you do this the other way around (Kafka to HDFS/Hive).
But you are welcome to try using the Kafka Connect JDBC plugin to read from a Hive table on a scheduled basis, which converts the rows into structured key-value Kafka messages.
Otherwise, I would re-evaulate other tools because Hive is slow. Couchbase or Cassandra offer much better CDC features for ingestion into Kafka. Or re-write the upstream applications that inserted into Hive to begin with, rather to write immediately into Kafka, from which you can join with other topics, for example.

Related

How do spark streaming on hive tables using sql in NON-real time?

We have some data (millions) in hive tables which comes everyday. Next day, once the over-night ingestion is complete different applications query us for data (using sql)
We take this sql and make a call on spark
spark.sqlContext.sql(statement) // hive-metastore integration is enabled
This is causing too much memory usage on spark driver, can we use spark streaming (or structured streaming), to stream the results in a piped fashion rather than collecting everything on driver and then sending to clients ?
We don't want to send out the data as soon it comes ( in typical streaming apps), but want to send a streaming data to clients when they ask (PULL) for data.
IIUC..
Spark Streaming is mainly designed to process streaming data by converting into batches of Milliseconds to Seconds.
You can look over streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) provides you a very good functionality for Spark to write
Streaming processed output Sink in micro-batch manner.
Nevertheless Spark structured streaming don't have a standard JDBC source defined to read from.
Work out for an option to directly store Hive underlying files in compressed and structured manner, transfer them directly rather than selecting through spark.sql if every client needs same/similar data or partition them based on where condition of spark.sql query and transfer needed files further.
Source:
Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees.
ForeachBatch:
foreachBatch(...) allows you to specify a function that is executed on the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.

Need to load data from Hadoop to Druid after applying transformations. If I use Spark, can we load data from Spark RDD or dataframe to Druid directly?

I have data present in hive tables. I want to apply bunch of transformations before loading that data into druid. So there are ways but I'm not sure about those.
1. Save that table after applying transformation and then Bulk load through hadoop ingestion method. But i want to avoid extra write on the server.
2. Using tranquility. But it is for Spark Streaming and only for Scala and Java, not for Python. Am I right on this?
Is there any other way I can achieve this?
You can achieve it by using druid kafka integration.
I think you should read data from tables in spark apply transformation and then write back it to kafka stream.
Once you setup druid kafka integration it will read data from kafka and will push to druid datasource.
Here is documentation about druid kafka integration https://druid.apache.org/docs/latest/tutorials/tutorial-kafka.html
(Disclaimer: I am a contributor for rovio-ingest)
With rovio-ingest you can batch ingest a Hive table to Druid with Spark. This avoids the extra write.

Does Spark Structured Streaming Kafka Writer supports writing data to particular partition?

Does Spark Structured Streaming's Kafka Writer supports writing data to particular partition? In Spark Structured Streaming Documentation, it is no where mentioned that writing data to specific partition is not supported.
Also I can't see an option to pass "partition id" in section
"Writing Data to Kafka"
If it is not supported, any future plans to support or reasons why this is not supported.
The keys determine which partition to write to - no, you can't hard-code a partition value within Spark's write methods.
Spark does allow you to configure kafka.partitioner.class, though, which would allow you to define the partition number based on the keys of the data
Kafka’s own configurations can be set via DataStreamReader.option with kafka. prefix, e.g, stream.option("kafka.bootstrap.servers", "host:port"). For possible kafka parameters, see ... Kafka producer config docs for parameters related to writing data.

Spark Streaming : source HBase

Is it possible to have a spark-streaming job setup to keep track of an HBase table and read new/updated rows every batch? The blog here says that HDFS files come under supported sources. But they seem to be using the following static API :
sc.newAPIHadoopRDD(..)
I can't find any documentation around this. Is it possible to stream from hbase using spark streaming context? Any help is appreciated.
Thanks!
The link provided does the following
Read the streaming data - convert it into HBase put and then add to HBase table. Until this, its streaming. Which means your ingestion process is streaming.
The stats calculation part, I think is batch - this uses newAPIHadoopRDD. This method will treat the data reading part as files. In this case, the files are from Hbase - thats the reason for the following input formats
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
If you want to read the updates in a HBase as streaming, then you should have a handle of WAL(write ahead logs) of HBase at the back end, and then perform your operations. HBase-indexer is a good place to start to read any updates in HBase.
I have used hbase-indexer to read hbase updates at the back end and direct them to solr as they arrive. Hope this helps.

SparkStreaming keep processing even no data in kafka

I'm using Spark Steaming to consume data from Kafka with the code snippet like :
rdd.foreachRdd{rdd=>rdd.foreachPartition{...}}
I'm using foreachPartition because I need to create connection with Hbase, I don't wanna open/close connection by each record.
But I found that when there is no data in Kafka, spark streaming is still processing foreachRdd and foreachPartition.
This caused many Hbase connections were created even though there were no any data were consumed. I really don't like this, how should I make Spark stop doing this when there is no data was consumed from Kafka please.
Simply check that there are items in the RDD. So your code could be:
rdd.foreachRdd{rdd=> if(rdd.isEmpty == false) rdd.foreachPartition{...}}

Resources