I have a spark structured stream application which reads data from Kafka and writes to HBase and Kafka, during this process, after reading data from kafka, I would like lookup HBase table, can someone please suggest options available, with sample code if you have one.
I have tried to use SHC (Spark HBase Connector), but using SHC to read HBASE table means data frame iteration on another data frame, which dynamic filter, which spark doesn't seem to like.
Related
We have some data (millions) in hive tables which comes everyday. Next day, once the over-night ingestion is complete different applications query us for data (using sql)
We take this sql and make a call on spark
spark.sqlContext.sql(statement) // hive-metastore integration is enabled
This is causing too much memory usage on spark driver, can we use spark streaming (or structured streaming), to stream the results in a piped fashion rather than collecting everything on driver and then sending to clients ?
We don't want to send out the data as soon it comes ( in typical streaming apps), but want to send a streaming data to clients when they ask (PULL) for data.
IIUC..
Spark Streaming is mainly designed to process streaming data by converting into batches of Milliseconds to Seconds.
You can look over streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) provides you a very good functionality for Spark to write
Streaming processed output Sink in micro-batch manner.
Nevertheless Spark structured streaming don't have a standard JDBC source defined to read from.
Work out for an option to directly store Hive underlying files in compressed and structured manner, transfer them directly rather than selecting through spark.sql if every client needs same/similar data or partition them based on where condition of spark.sql query and transfer needed files further.
Source:
Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees.
ForeachBatch:
foreachBatch(...) allows you to specify a function that is executed on the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.
I have data present in hive tables. I want to apply bunch of transformations before loading that data into druid. So there are ways but I'm not sure about those.
1. Save that table after applying transformation and then Bulk load through hadoop ingestion method. But i want to avoid extra write on the server.
2. Using tranquility. But it is for Spark Streaming and only for Scala and Java, not for Python. Am I right on this?
Is there any other way I can achieve this?
You can achieve it by using druid kafka integration.
I think you should read data from tables in spark apply transformation and then write back it to kafka stream.
Once you setup druid kafka integration it will read data from kafka and will push to druid datasource.
Here is documentation about druid kafka integration https://druid.apache.org/docs/latest/tutorials/tutorial-kafka.html
(Disclaimer: I am a contributor for rovio-ingest)
With rovio-ingest you can batch ingest a Hive table to Druid with Spark. This avoids the extra write.
I need to export data from Hive to Kafka topics based on some events in another Kafka topic. I know I can read data from hive in Spark job using HQL and write it to Kafka from the Spark, but is there a better way?
This can be achieved using unstructured streaming. The steps mentioned below :
Create a Spark Streaming Job which connects to the required topic and fetched the required data export information.
From stream , do a collect and get your data export requirement in Driver variables.
Create a data frame using the specified condition
Write the data frame into the required topic using kafkaUtils.
Provide a polling interval based on your data volume and kafka write throughputs.
Typically, you do this the other way around (Kafka to HDFS/Hive).
But you are welcome to try using the Kafka Connect JDBC plugin to read from a Hive table on a scheduled basis, which converts the rows into structured key-value Kafka messages.
Otherwise, I would re-evaulate other tools because Hive is slow. Couchbase or Cassandra offer much better CDC features for ingestion into Kafka. Or re-write the upstream applications that inserted into Hive to begin with, rather to write immediately into Kafka, from which you can join with other topics, for example.
I am working on writing a Spark job which reads the data from the Hive and store in HBase for real time access. The executor makes the connection with HBase, what is the right approach to insert the data into. I have thought of following two approaches.
Which one is more appropriate or is there any other approach?
Write data directly from Spark Job to Hbase
Write data from Spark to HDFS and later move it to Hbase
I am streaming data into Spark Structured Streaming 2.1.1 using Kafka with a writeStream() to append into parquet. This works.
I can create a temporary table over the parquet files using
spark.read.parquet ("/user/markteehan/interval24" ).registerTempTable("interval24")
However this is only visible in the same spark session; and the "read.parquet" must be re-run to collect new data. Setting ".queryName()" for the writeStream doesnt create a table in the hive metastore.
What is the best technique to run SQL dynamically on the parquet data?