I am trying to read 2 kafka topics using Cassandra sink connector and insert into 2 Cassandra tables. How can I go about doing this?
This is my connector.properties file:
name=cassandra-sink-orders
connector.class=com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector
tasks.max=1
topics=topic1,topic2
connect.cassandra.kcql=INSERT INTO ks.table1 SELECT * FROM topic1;INSERT INTO ks.table2 SELECT * FROM topic2
connect.cassandra.contact.points=localhost
connect.cassandra.port=9042
connect.cassandra.key.space=ks
connect.cassandra.contact.points=localhost
connect.cassandra.username=cassandra
connect.cassandra.password=cassandra
Am I doing everything right? Is this the best way of doing this or should I create two separate connectors?
There's one issue with your config. You need one task per topic-partition. So if your topics have one partition, you need tasks.max set to at least 2.
I don't see it documented in Connect's docs, which is a shame
If you want to consume those two topics in one consumer that's fine and it's correct setup. The best way of doing this depends whether those messages should be consumed by one or two consumers. So it depends on your business logic.
Anyway, if you want to consume two topics via one consumer that should work find since consumer can subscribe to multiple topics. Did you try running this consumer? Is it working?
Related
I'm trying to create a sort of a consumer group as it exist in Kafka but for Cassandra. The goal is to have a request been paginated and each page done by one instance of an App.
Is there any notion like the consumer group one in Cassandra ?
The TL;DR; is that no, the consumer-group notion doesn't exist in the clients in Cassandra. The burden of which client processes what is entirely on the app developer.
You can use Cassandra's tokens to do selective paging.
Assuming 2 clients (easy example)
Client 1 pages from -2^63 to 0
Client 2 pages from 1 to 2^63 - 1
The above idea assumes you want to page through all the data in something similar to a batch process which wouldn't be a good fit for Cassandra.
If you're after the latest N results, where the 1st half is sent to client 1 and the second to client 2 you can use a logical bucket in your partitioning key.
If you're looking to scale the processing of a large number of Cassandra rows, you might consider a scalable execution platform like Flink or Storm. You'd be able to parallelize both the reading of the rows and the processing of the rows, although a parallelizable source (or spout in Storm) is not something you can get out of the box.
I have scenario where I have different type of messages to be streamed from kafka producer.
If I dont want to use different topic per different message type how to handle it at spark-structured-streaming consumer side ?
i.e. only one topic I want to use for different type of messages ...say Student record , Customer record....etc.
How to identify which message is been received from Kafka topic?
Please let me know how to handle this scenario at kafka consumer side?
Kafka topics don't inheriently have "types of data". It's all bytes, so yes you can serialize completely separate objects into the same topic, but consumers must then add logic to know what are all possible types will get added into the topic.
That being said, Structured Streaming is built on the idea of having structured data with a schema, so it likely will not work if you had completely different types in the same topic without at least performing a filter first based on some inner attribute that is always present among all types.
Yes you can do this by adding "some attribute" to the message itself when producing which signifies a logical topic, or operation, and differentiating on the Spark side - e.g. Structured Streaming KAFKA integration. E.g. checking the message content for "some attribute" and process accordingly.
Partitioning is used of course for ordering always.
I had a couple questions regarding the Cassandra connector written by Data Mountaineer. Any help is greatly appreciated as we're trying to figure out the best way to scale our architecture.
Do we have to create a Connector config for each Cassandra table we want to update? For instance, let's say I have a 1000 tables. Each table is dedicated to a different type of widget. Each widget has similar characteristics, but slightly different data. Do we need to create a connector for each table? If so, how is this managed and how does this scale?
In Cassandra, we often need to model column families based on the business need. We may have 3 tables representing user information. 1 by username, 1 by email and 1 by last name. Would we need 3 connector configs and deploy 3 separate Sink tasks to push data to each table?
I think both questions are similar, can the sink handle multiple topics?
The sink can handle multiple tables in one sink so one configuration. This is set in the kcql statement connect.cassandra.export.route.query=INSERT INTO orders SELECT * FROM orders-topic;INSERT INTO positions SELECT * FROM positions but at present they need to be in the same Cassandra keyspace. This would route events from the trades topic to a Cassandra table called trades and events from positions. You can also select specific columns and rename like select columnA as columnB.
You may want more than one sink instance for separation of concerns, i.e. isolating the write of a group of topics from other unrelated topics.
You can scale with the number of tasks the connector is allowed to run, each task starts a Writer for all the target tables.
We have a support channel of our own for more direct communication. https://datamountaineer.com/contact/
Is it possible to dynamically update topics list in spark-kafka consumer?
I have a Spark Streaming application which uses spark-kafka consumer.
Say initially I have spark-kakfa consumer listening for topics: ["test"] and after a while my topics list got updated to ["test","testNew"]. now is there a way to update spark-kafka consumer topics list and ask spark-kafka consumer to consume data for updated list of topics without stopping sparkStreaming application or sparkStreaming context
Is it possible to dynamically update topics list in spark-kafka consumer
No. Both the receiver and receiverless approaches are fixed once you initialize the kafka stream using KafkaUtils. There is no way for you to pass new topics as you go as the DAG is fixed.
If you want to read dynamically, perhaps consider a batch k
job which is scheduled iteratively and can read the topics dynamically and creating an RDD out of that.
An additional solution would be to use a technology that gives you kore flexibility over the consumption, such as Akka Streams.
As Yuval said, it isn't possible but there might be a work around if you know what the structure/format of data you are dealing with from Kafka.
For example,
If your streaming application is listening to topics ["test","testNew"]
Downl the line you want to add a new topic named [test4], as a work around, you can simply add a unique key to the that is contained in it and pass it to the existing topics.
Design your streaming application in such a way to recognize/filter the data based on the key you added to that test2 data
You can use Thread based approach
1. define the Cache using any data structure which contains list of topics
2. way to add element in this cache
3. You have to class A and B where B has all the spark related logic
4 Class A is long running job and from A you are calling B , whenever there is new topic you just spawning new thread with B
I'd suggest trying ConsumerStrategies.SubscribePattern from the latest Spark-Kafka integration (0.10) API version.
That would look like:
KafkaUtils.createDirectStream(
mySparkStreamingContext,
PreferConsistent,
SubscribePattern("test.*".r.pattern, myKafkaParamsMap))
I would like to write to kafka from spark stream data.
I know that I can use KafkaUtils to read from kafka.
But, KafkaUtils doesn't provide API to write to kafka.
I checked past question and sample code.
Is Above sample code the most simple way to write to kafka?
If I adopt way like above sample, I must create many classes...
Do you know more simple way or library to help to write to kafka?
Have a look here:
Basically this blog post summarise your possibilities which are written in different variations in the link you provided.
If we will look at your task straight forward, we can make several assumptions:
Your output data is divided to several partitions, which may (and quite often will) reside on different machines
You want to send the messages to Kafka using standard Kafka Producer API
You don't want to pass data between machines before the actual sending to Kafka
Given those assumptions your set of solution is pretty limited: You whether have to create a new Kafka producer for each partition and use it to send all the records of that partition, or you can wrap this logic in some sort of Factory / Sink but the essential operation will remain the same : You'll still request a producer object for each partition and use it to send the partition records.
I'll suggest you continue with one of the examples in the provided link, the code is pretty short, and any library you'll find would most probably do the exact same thing behind the scenes.