Does Spark Structured Streaming using Trigger.Once allow for a direct connection to KAFKA and use of MERGE statement? Or must the data for this be from a delta table?
This https://docs.databricks.com/_static/notebooks/merge-in-scd-type-2.html assumes tables as input. I cannot find an example with KAFKA being used with Trigger.Once. OK, the weekend is coming and I will fire up this and that, but it is an interesting point that I would like to know in advance.
Yes, it's possible to use Trigger.Once (or better newer Trigger.AvailableNow) with Kafka, and then use foreachBatch to execute MERGE.
The only thing that you need to take into account is that data shouldn't expire between executions.
Related
I need to export data from Hive to Kafka topics based on some events in another Kafka topic. I know I can read data from hive in Spark job using HQL and write it to Kafka from the Spark, but is there a better way?
This can be achieved using unstructured streaming. The steps mentioned below :
Create a Spark Streaming Job which connects to the required topic and fetched the required data export information.
From stream , do a collect and get your data export requirement in Driver variables.
Create a data frame using the specified condition
Write the data frame into the required topic using kafkaUtils.
Provide a polling interval based on your data volume and kafka write throughputs.
Typically, you do this the other way around (Kafka to HDFS/Hive).
But you are welcome to try using the Kafka Connect JDBC plugin to read from a Hive table on a scheduled basis, which converts the rows into structured key-value Kafka messages.
Otherwise, I would re-evaulate other tools because Hive is slow. Couchbase or Cassandra offer much better CDC features for ingestion into Kafka. Or re-write the upstream applications that inserted into Hive to begin with, rather to write immediately into Kafka, from which you can join with other topics, for example.
With dStreams, from the official documentation:
Queue of RDDs as a Stream: For testing a Spark Streaming application
with test data, one can also create a DStream based on a queue of
RDDs, using streamingContext.queueStream(queueOfRDDs). Each RDD pushed
into the queue will be treated as a batch of data in the DStream, and
processed like a stream.
So, for Structured Streaming, can I or can I not use QueueStream as input?
Not able able to find anything in the Structured Streaming Guide 2.3 or 2.4.
I do note memoryStream. This is the way to go? I think so, and if so, why would QueueStream not be an option anymore?
I have converted QueueStreams to Memory Stream as input and it works fine, but is that what is required?
My understanding is that for Structured Streaming I cannot use QueueStream - as it is a dStream.
Simulating Streaming input with Structured Streaming does work with memoryStream.
Am using now kafka in Python.
Was wondering if Spark Kafka is needed or can we use just use kafka
through pyKafka.
My concern was Spark creates overhead (pyspark) in the process,
and if we don't use any spark functions, just Kafka streaming is required.
What are the inconvenients of using Pyspark and kafka spark ?
It totally depends on the use case at hand, as all mentioned in the comments, however I passed with the same situation a couple of months ago, I will try to transfer my knowledge and how I decided to move to kafka-streams instead of spark-streaming.
In my use case, we only used spark to do a realtime streaming from kafka, and don't do any sort of map-reduce, windowing, filtering, aggregation.
Given the above case, I did the comparison based on 3 dimentions:
Technicality
DevOps
Cost
Below image show the table of comparison I did to convince my team to migrate to use kafka-streams and suppress spark, Cost is not added in the image as it totally depends on your cluster size (HeadNode-WorkerNodes).
V.I. NOTE:
Again, this is based on your case, I just tried to give you a pointer how to do the comparison, but spark itself has lots of benefits, which is irrelevant to describe it in this question.
Please forgive if this question doesn't make sense, as I am just starting out with Spark and trying to understand it.
From what I've read, Spark is a good use case for doing real time analytics on streaming data, which can then be pushed to a downstream sink such as hdfs/hive/hbase etc.
I have 2 questions about that. I am not clear if there is only 1 spark streaming job running or multiple at any given time. Say I have different analytics I need to perform for each topic from Kafka or each source that is streaming into Kafka, and then push the results of those downstream.
Does Spark allow you to run multiple streaming jobs in parallel so you can keep aggregate analytics separate for each stream, or in this case each Kafka topic. If so, how is that done, any documentation you could point me to ?
Just to be clear, my use case is to stream from different sources, and each source could have potentially different analytics I need to perform as well as different data structure. I want to be able to have multiple Kafka topics and partitions. I understand each Kafka partition maps to a Spark partition, and it can be parallelized.
I am not sure how you run multiple Spark streaming jobs in parallel though, to be able to read from multiple Kafka topics, and tabulate separate analytics on those topics/streams.
If not Spark is this something thats possible to do in Flink ?
Second, how does one get started with Spark, it seems there is a company and or distro to choose for each component, Confluent-Kafka, Databricks-Spark, Hadoop-HW/CDH/MAPR. Does one really need all of these, or what is the minimal and easiest way to get going with a big data pipleine while limiting the number of vendors ? It seems like such a huge task to even start on a POC.
You have asked multiple questions so I'll address each one separately.
Does Spark allow you to run multiple streaming jobs in parallel?
Yes
Is there any documentation on Spark Streaming with Kafka?
https://spark.apache.org/docs/latest/streaming-kafka-integration.html
How does one get started?
a. Book: https://www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624/
b. Easy way to run/learn Spark: https://community.cloud.databricks.com
I agree with Akbar and John that we can run multiple streams reading from different sources in parallel.
I like add that if you want to share data between streams, you can use Spark SQL API. So you can register your RDD as a SQL table and access the same table in all the streams. This is possible since all the streams share the same SparkContext
Is it possible to have a spark-streaming job setup to keep track of an HBase table and read new/updated rows every batch? The blog here says that HDFS files come under supported sources. But they seem to be using the following static API :
sc.newAPIHadoopRDD(..)
I can't find any documentation around this. Is it possible to stream from hbase using spark streaming context? Any help is appreciated.
Thanks!
The link provided does the following
Read the streaming data - convert it into HBase put and then add to HBase table. Until this, its streaming. Which means your ingestion process is streaming.
The stats calculation part, I think is batch - this uses newAPIHadoopRDD. This method will treat the data reading part as files. In this case, the files are from Hbase - thats the reason for the following input formats
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
If you want to read the updates in a HBase as streaming, then you should have a handle of WAL(write ahead logs) of HBase at the back end, and then perform your operations. HBase-indexer is a good place to start to read any updates in HBase.
I have used hbase-indexer to read hbase updates at the back end and direct them to solr as they arrive. Hope this helps.