I have scenario where I have different type of messages to be streamed from kafka producer.
If I dont want to use different topic per different message type how to handle it at spark-structured-streaming consumer side ?
i.e. only one topic I want to use for different type of messages ...say Student record , Customer record....etc.
How to identify which message is been received from Kafka topic?
Please let me know how to handle this scenario at kafka consumer side?
Kafka topics don't inheriently have "types of data". It's all bytes, so yes you can serialize completely separate objects into the same topic, but consumers must then add logic to know what are all possible types will get added into the topic.
That being said, Structured Streaming is built on the idea of having structured data with a schema, so it likely will not work if you had completely different types in the same topic without at least performing a filter first based on some inner attribute that is always present among all types.
Yes you can do this by adding "some attribute" to the message itself when producing which signifies a logical topic, or operation, and differentiating on the Spark side - e.g. Structured Streaming KAFKA integration. E.g. checking the message content for "some attribute" and process accordingly.
Partitioning is used of course for ordering always.
Related
What should be done in a Spark Structured Streaming job so that it can read a multi event Kafka topic?
I am trying to read a topic that has multiple type of events. Every event may have difference schema. How does a streaming job determine the type of event or which schema of the event to use ?
Kafka Dataframes are always bytes. You use a UDF to deserialize the key/value columns.
For example, assuming data is JSON, you first cast bytes to string, then you can use get_json_object to extract/filter specific fields.
If data is in other format, then you could use Kafka record headers added by the producer (you'll need to add those in your own producer code) to designate what event type each record is, then filter based on those, and add logic for processing different sub-dataframes. Or you could wrap binary data in a more consistent schema such as CloudEvents spec, which includes a type field and nested binary content, which needs further deserialized.
I am trying to read 2 kafka topics using Cassandra sink connector and insert into 2 Cassandra tables. How can I go about doing this?
This is my connector.properties file:
name=cassandra-sink-orders
connector.class=com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector
tasks.max=1
topics=topic1,topic2
connect.cassandra.kcql=INSERT INTO ks.table1 SELECT * FROM topic1;INSERT INTO ks.table2 SELECT * FROM topic2
connect.cassandra.contact.points=localhost
connect.cassandra.port=9042
connect.cassandra.key.space=ks
connect.cassandra.contact.points=localhost
connect.cassandra.username=cassandra
connect.cassandra.password=cassandra
Am I doing everything right? Is this the best way of doing this or should I create two separate connectors?
There's one issue with your config. You need one task per topic-partition. So if your topics have one partition, you need tasks.max set to at least 2.
I don't see it documented in Connect's docs, which is a shame
If you want to consume those two topics in one consumer that's fine and it's correct setup. The best way of doing this depends whether those messages should be consumed by one or two consumers. So it depends on your business logic.
Anyway, if you want to consume two topics via one consumer that should work find since consumer can subscribe to multiple topics. Did you try running this consumer? Is it working?
Is it possible to dynamically update topics list in spark-kafka consumer?
I have a Spark Streaming application which uses spark-kafka consumer.
Say initially I have spark-kakfa consumer listening for topics: ["test"] and after a while my topics list got updated to ["test","testNew"]. now is there a way to update spark-kafka consumer topics list and ask spark-kafka consumer to consume data for updated list of topics without stopping sparkStreaming application or sparkStreaming context
Is it possible to dynamically update topics list in spark-kafka consumer
No. Both the receiver and receiverless approaches are fixed once you initialize the kafka stream using KafkaUtils. There is no way for you to pass new topics as you go as the DAG is fixed.
If you want to read dynamically, perhaps consider a batch k
job which is scheduled iteratively and can read the topics dynamically and creating an RDD out of that.
An additional solution would be to use a technology that gives you kore flexibility over the consumption, such as Akka Streams.
As Yuval said, it isn't possible but there might be a work around if you know what the structure/format of data you are dealing with from Kafka.
For example,
If your streaming application is listening to topics ["test","testNew"]
Downl the line you want to add a new topic named [test4], as a work around, you can simply add a unique key to the that is contained in it and pass it to the existing topics.
Design your streaming application in such a way to recognize/filter the data based on the key you added to that test2 data
You can use Thread based approach
1. define the Cache using any data structure which contains list of topics
2. way to add element in this cache
3. You have to class A and B where B has all the spark related logic
4 Class A is long running job and from A you are calling B , whenever there is new topic you just spawning new thread with B
I'd suggest trying ConsumerStrategies.SubscribePattern from the latest Spark-Kafka integration (0.10) API version.
That would look like:
KafkaUtils.createDirectStream(
mySparkStreamingContext,
PreferConsistent,
SubscribePattern("test.*".r.pattern, myKafkaParamsMap))
We have been using spark streaming with kafka for a while and until now we were using the createStream method from KafkaUtils.
We just started exploring the createDirectStream and like it for two reasons:
1) Better/easier "exactly once" semantics
2) Better correlation of kafka topic partition to rdd partitions
I did notice that the createDirectStream is marked as experimental. The question I have is (sorry if this in not very specific):
Should we explore the createDirectStream method if exactly once is very important to us? Will be awesome if you guys can share your experience with it. Are we running the risk of having to deal with other issues such as reliability etc?
There is a great, extensive blog post by the creator of the direct approach (Cody) here.
In general, reading the Kafka delivery semantics section, the last part says:
So effectively Kafka guarantees at-least-once delivery by default and
allows the user to implement at most once delivery by disabling
retries on the producer and committing its offset prior to processing
a batch of messages. Exactly-once delivery requires co-operation with
the destination storage system but Kafka provides the offset which
makes implementing this straight-forward.
This basically means "we give you at least once out of the box, if you want exactly once, that's on you". Further, the blog post talks about the guarantee of "exactly once" semantics you get from Spark with both approaches (direct and receiver based, emphasis mine):
Second, understand that Spark does not guarantee exactly-once
semantics for output actions. When the Spark streaming guide talks
about exactly-once, it’s only referring to a given item in an RDD
being included in a calculated value once, in a purely functional
sense. Any side-effecting output operations (i.e. anything you do in
foreachRDD to save the result) may be repeated, because any stage of
the process might fail and be retried.
Also, this is what the Spark documentation says about receiver based processing:
The first approach (Receiver based) uses Kafka’s high level API to store consumed
offsets in Zookeeper. This is traditionally the way to consume data
from Kafka. While this approach (in combination with write ahead logs)
can ensure zero data loss (i.e. at-least once semantics), there is a
small chance some records may get consumed twice under some failures.
This basically means that if you're using the Receiver based stream with Spark you may still have duplicated data in case the output transformation fails, it is at least once.
In my project I use the direct stream approach, where the delivery semantics depend on how you handle them. This means that if you want to ensure exactly once semantics, you can store the offsets along with the data in a transaction like fashion, if one fails the other fails as well.
I recommend reading the blog post (link above) and the Delivery Semantics in the Kafka documentation page. To conclude, I definitely recommend you look into the direct stream approach.
I would like to write to kafka from spark stream data.
I know that I can use KafkaUtils to read from kafka.
But, KafkaUtils doesn't provide API to write to kafka.
I checked past question and sample code.
Is Above sample code the most simple way to write to kafka?
If I adopt way like above sample, I must create many classes...
Do you know more simple way or library to help to write to kafka?
Have a look here:
Basically this blog post summarise your possibilities which are written in different variations in the link you provided.
If we will look at your task straight forward, we can make several assumptions:
Your output data is divided to several partitions, which may (and quite often will) reside on different machines
You want to send the messages to Kafka using standard Kafka Producer API
You don't want to pass data between machines before the actual sending to Kafka
Given those assumptions your set of solution is pretty limited: You whether have to create a new Kafka producer for each partition and use it to send all the records of that partition, or you can wrap this logic in some sort of Factory / Sink but the essential operation will remain the same : You'll still request a producer object for each partition and use it to send the partition records.
I'll suggest you continue with one of the examples in the provided link, the code is pretty short, and any library you'll find would most probably do the exact same thing behind the scenes.