I'm reading through this blog post:
http://blog.jaceklaskowski.pl/2015/07/20/real-time-data-processing-using-apache-kafka-and-spark-streaming.html
It discusses about using Spark Streaming and Apache Kafka to do some near real time processing. I completely understand the article. It does show how I could use Spark Streaming to read messages from a Topic. I would like to know if there is a Spark Streaming API that I can use to write messages into Kakfa topic?
My use case is pretty simple. I have a set of data that I can read from a given source at a constant interval (say every second). I do this using reactive streams. I would like to do some analytics on this data using Spark. I want to have fault-tolerance, so Kafka comes into play. So what I would essentially do is the following (Please correct me if I was wrong):
Using reactive streams get the data from external source at constant intervals
Pipe the result into Kafka topic
Using Spark Streaming, create the streaming context for the consumer
Perform analytics on the consumed data
One another question though, is the Streaming API in Spark an implementation of the reactive streams specification? Does it have back pressure handling (Spark Streaming v1.5)?
No, at the moment, none of Spark Streaming's built-in receiver APIs are an implementation of the Reactive Streams implementation. But there's an issue for that you will want to follow.
But Spark Streaming 1.5 has internal back-pressure-based dynamic throttling. There's some work to extend that beyond throttling in the pipeline. This throttling is compatible with the Kafka direct stream API.
You can write to Kafka in a Spark Streaming application, here's one example.
(Full disclosure: I'm one of the implementers of some of the back-pressure work)
If you have to write the results stream to another Kafka topic let say 'topic_x', First, you must have columns by the name 'Key' and 'Value' in the result stream which you are trying to write to the topic_x.
result_stream = result_stream.selectExpr('CAST (key AS STRING)','CAST (value AS STRING)')
kafkaOutput = result_stream \
.writeStream \
.format('kafka') \
.option('kafka.bootstrap.servers','192.X.X.X:9092') \
.option('topic','topic_x') \
.option('checkpointLocation','./resultCheckpoint') \
.start()
kafkaOutput.awaitTermination()
For more details check the documentation at https://spark.apache.org/docs/2.4.1/structured-streaming-kafka-integration.html
Related
Im looking for a package, or a previous implementation of using redshift as the source for a structured streaming dataframe.
spark.readStream \
.format("io.github.spark_redshift_community.spark.redshift") \
.option('url', redshift_url) \
.option('forward_spark_s3_credentials', 'true') \
.load()
Using the format below you get errors on the read. such as:
Data source io.github.spark_redshift_community.spark.redshift does not support streamed reading
This is the same error if you downgrade from Spark 3 and use: com.databricks.spark.redshift
Is there a known workaround, or methodology/pattern i can use to implement (in pyspark) redshift as a readStream datasource
As the error says, this library does not support streaming reads/ writes to/ from Redshift.
Same can be confirmed from the project source at link. The format does not extend or implement Micro/ Continuous stream readers and writers.
There will be no true streaming easy ways to this. You may explore the following avenues,
Explore 3rd party libs. Search for JDBC streaming spark. Disclaimer: I have not used these and thus do not endorse these libs.
Create a Micro-batching strategy on a custom check-pointing mechanism.
Extended Note: AFAIK, Spark JDBC interfaces do not support Structured Streaming.
I need to export data from Hive to Kafka topics based on some events in another Kafka topic. I know I can read data from hive in Spark job using HQL and write it to Kafka from the Spark, but is there a better way?
This can be achieved using unstructured streaming. The steps mentioned below :
Create a Spark Streaming Job which connects to the required topic and fetched the required data export information.
From stream , do a collect and get your data export requirement in Driver variables.
Create a data frame using the specified condition
Write the data frame into the required topic using kafkaUtils.
Provide a polling interval based on your data volume and kafka write throughputs.
Typically, you do this the other way around (Kafka to HDFS/Hive).
But you are welcome to try using the Kafka Connect JDBC plugin to read from a Hive table on a scheduled basis, which converts the rows into structured key-value Kafka messages.
Otherwise, I would re-evaulate other tools because Hive is slow. Couchbase or Cassandra offer much better CDC features for ingestion into Kafka. Or re-write the upstream applications that inserted into Hive to begin with, rather to write immediately into Kafka, from which you can join with other topics, for example.
With dStreams, from the official documentation:
Queue of RDDs as a Stream: For testing a Spark Streaming application
with test data, one can also create a DStream based on a queue of
RDDs, using streamingContext.queueStream(queueOfRDDs). Each RDD pushed
into the queue will be treated as a batch of data in the DStream, and
processed like a stream.
So, for Structured Streaming, can I or can I not use QueueStream as input?
Not able able to find anything in the Structured Streaming Guide 2.3 or 2.4.
I do note memoryStream. This is the way to go? I think so, and if so, why would QueueStream not be an option anymore?
I have converted QueueStreams to Memory Stream as input and it works fine, but is that what is required?
My understanding is that for Structured Streaming I cannot use QueueStream - as it is a dStream.
Simulating Streaming input with Structured Streaming does work with memoryStream.
Currently, I am working on a use-case which requires reading JSON messages from Kafka and process them in Spark via Spark Streaming. We are expecting around 35 Million records per day. With this kind of a load, is it preferred to move the parsing logic (and some filtering logic based on JValue) to Kafka using Custom Kafka Deserializer (extending org.apache.kafka.common.serialization.Deserializer class). Will this have any performance overhead?
Thank you.
I have to design a spark streaming application with below use case. I am looking for best possible approach for this.
I have application which pushing data into 1000+ different topics each has different purpose . Spark streaming will receive data from each topic and after processing it will write back to corresponding another topic.
Ex.
Input Type 1 Topic --> Spark Streaming --> Output Type 1 Topic
Input Type 2 Topic --> Spark Streaming --> Output Type 2 Topic
Input Type 3 Topic --> Spark Streaming --> Output Type 3 Topic
.
.
.
Input Type N Topic --> Spark Streaming --> Output Type N Topic and so on.
I need to answer following questions.
Is it a good idea to launch 1000+ spark streaming application per topic basis ? Or I should have one streaming application for all topics as processing logic going to be same ?
If one streaming context , then how will I determine which RDD belongs to which Kafka topic , so that after processing I can write it back to its corresponding OUTPUT Topic?
Client may add/delete topic from Kafka , how do dynamically handle in Spark streaming ?
How do I restart job automatically on failure ?
Any other issue you guys see here ?
Highly appreicate your response.
1000 different Spark applications will not be maintainable, imagine deploying, or upgrading each application.
You will have to use the recommended "Direct approach" instead of the Receiver approach, otherwise your application is going to use more than 1000 cores, if you don't have more, it will be able to receive data from your Kafka's topic but not to process them. From Spark Streaming Doc :
Note that, if you want to receive multiple streams of data in parallel in your streaming application, you can create multiple input DStreams (discussed further in the Performance Tuning section). This will create multiple receivers which will simultaneously receive multiple data streams. But note that a Spark worker/executor is a long-running task, hence it occupies one of the cores allocated to the Spark Streaming application.
You can see in the Kafka Integration (there is one for Kafka 0.8 and one for 0.10) doc how to see in which topic belongs a message
If a client add new topics or partitions, you will need to update your Spark Streaming's topics conf, and redeploy it. If you use Kafka 0.10 you can also use RegEx for topics' name, see Consumer Strategies. I've experienced reading from a deleted topic in Kafka 0.8, and there was no problems, still verify ("Trust, but verify")
See Spark Streaming's doc about Fault Tolerance, also use the mode --supervise when submiting your application to your cluster, see the Deploying documentation for more information
To achieve exactly-once semantic, I suggest this Github from Spark Streaming's main commiter : https://github.com/koeninger/kafka-exactly-once
Bonus, good similar StackOverFlow post : Spark: processing multiple kafka topic in parallel
Bonus2: Watch out for the soon-to-be-released Spark 2.2 and the Structured Streaming component