When is a Kafka connector preferred over a Spark streaming solution? - apache-spark

With Spark streaming, I can read Kafka messages and write data to different kind of tables, for example HBase, Hive and Kudu. But this can also be done by using Kafka connectors for these tables. My question is, in which situations I should prefer connectors over the Spark streaming solution.
Also how tolerant is the Kafka connector solution? We know that with Spark streaming, we can use checkpoints and executors running on multiple nodes for fault tolerant execution, but how is fault tolerance (if possibe) achieved with Kafka connectors? By running the connector on multiple nodes?

So, generally, there should be no big difference in functionality when it comes to simply reading records from Kafka and sending them into other services.
Kafka Connect is probably easier when it comes to standard tasks since it offers various connectors out-of-the-box, so it will quite probably reduce the need of writing any code. So, if you just want to copy a bunch of records from Kafka to HDFS or Hive then it will probably be easier and faster to do with Kafka connect.
Having this in mind, Spark Streaming drastically takes over when You need to do things that are not standard i.e. if You want to perform some aggregations or calculations over records and write them to Hive, then You probably should go for Spark Streaming from the beginning.
Genrally, I found doing some substandard things with Kafka connect, like for example splitting one message to multiple ones(assuming it was for example JSON array) to be quite troublesome and often require much more work than it would be in Spark.
As for the Kafka Connect fault tolerance, as it's described in the docs this is achieved by running multiple distributed workers with same group.id, the workers redistribute tasks and connectors if one of them fails.

in which situations I should prefer connectors over the Spark streaming solution.
"It Depends" :-)
Kafka Connect is part of Apache Kafka, and so has tighter integration with Apache Kafka in terms of security, delivery semantics, etc.
If you don't want to write any code, Kafka Connect is easier because it's just JSON to configure and run
If you're not using Spark already, Kafka Connect is arguably more
straightforward to deploy (run the JVM, pass in the configuration)
As a framework, Kafka Connect is more transferable since the concepts are the same, you just plugin the appropriate connector for the technology that you want to integrate with each time
Kafka Connect handles all the tricky stuff for you like schemas, offsets, restarts, scaleout, etc etc etc
Kafka Connect supports Single Message Transform for making changes to data as it passes through the pipeline (masking fields, dropping fields, changing data types, etc etc). For more advanced processing you would use something like Kafka Streams or ksqlDB.
If you are using Spark, and it's working just fine, then it's not necessarily prudent to rip it up to use Kafka Connect instead :)
Also how tolerant is the Kafka connector solution? … how is fault tolerance (if possibe) achieved with Kafka connectors?
Kafka Connect can be run in distributed mode, in which you have one or more worker processes across nodes. If a worker fails, Kafka Connect rebalances the tasks across the remaining ones. If you add a worker in, Kafka Connect will rebalance to ensure workload distribution. This was drastically improved in Apache Kafka 2.3 (KIP-415)
Kafka Connect uses the Kafka consumer API and tracks offsets of records delivered to a target system in Kafka itself. If the task or worker fails you can be sure that it will restart from the correct point. Many connectors support exactly-once delivery too (e.g. HDFS, Elasticsearch, etc)
If you want to learn more about Kafka Connect see the docs here and my talk here. See a list of connectors here, and tutorial videos here.
Disclaimer: I work for Confluent and a big fan of Kafka Connect :-)

Related

How to build Aggregations on Apache Solr with Spark

I have a requirement to build aggregations on the data that we receive to our Apache Kafka...
I am little bit lost which technlogical path to follow...
It seems people see the standard way, a constellation of Apache Kafka <-> Apache Spark <-> Solr
Bitnami Data Platform
I can't find concrete examples how this actually functions, but I am also asking myself would any solution von
Apache Kafka <-> Kafka Connect Solr <-> Solr
would not do the trick becasue solr supports aggregations also...
Solr Aggregation
but I saw some code snippets that aggregate the Data in Spark and write under special index to Solr.....
Also probably aggregation mit Kafka <-> Kafka Connect Solr <-> Solr will only function for only one Topic from Kafka, so if I have to combine the data from 2 or more, different Topics and aggregate, then Kafka, Spark, Solr is way to go.... (or this viable at all)
So as you may read, I am little bit confused, so I like to ask here, how are you approching this problem with your real life solutions....
Thx for answers...
Spark can of course join multiple topics. So can Flink, or Kafka Streams/KsqlDB. Spark or Flink just happen to be able to also write their data to external sources, such as Solr, rather than exclusively back into a new Kafka topic. The "downside" is that you need to maintain a scheduler exclusively for those, as compared to running a cluster of standalone Kafka Connect or Kafka Streams JAR applications. If you're using kubernetes, then that could be used for all of above (maybe not Flink... Haven't tried)
Kafka Connect can consume multiple topics and, depending on the connector configuration, might write to one or many Solr collections.

Kafka Spark Streaming ingestion for multiple topics

We are currently ingesting Kafka messages into HDFS using Spark Streaming. So far we spawn a whole Spark job for each topic.
Since messages are produced pretty rarely for some topics (average of 1 per day), we're thinking about organising the ingestion in pools.
The idea is to avoid creating a whole container (and related resources) for this "unfrequent" topics. In fact Spark Streaming accepts a list of topics in input, so we're thinking about using this feature in order to have a single job consuming all of them.
Do you guys think the one exposed is a good strategy? We also thought about batch ingestion, but we like to keep real-time behavior so we excluded this option. Do you have any tip or suggestion?
Does Spark Streaming handle well multiple topics as a source in case of failures in terms of offset consistency etc.?
Thanks!
I think Spark should be able to handle multiple topics fine as they have support for this from a long time and yes Kafka connect is not confluent API. Confluent does provide connectors for their cluster but you can use it too. You can see that Apache Kafka also has documentation for Connect API.
It is little difficult with Apache version of Kafka, but you can use it.
https://kafka.apache.org/documentation/#connectapi
Also if you're opting for multiple kafka topics in single spark streaming job, you may need to think about not creating small files as your frequency seems very less.

Which framework should be used to aggregate and joining the data of Kafka topics and store in to MySQL

I have data in two kafka topics from mysql using debezium-connector-mysql-plugin.
now i want to aggregate this data at daily level and store in to another mysql table.
please suggest.
Thanks.
You've not really laid out your requirements, other than commenting that you don't want to use Confluent Platform (but not said why).
In general, with data in Kafka (regardless of where it comes from) you have different options for processing it:
Bespoke consumer (probably a bad idea, given the availability of stream processing frameworks)
KSQL (use SQL to do your joins etc) - part of Confluent Platform
Kafka Streams - a Java library for doing stream processing. Part of Apache Kafka.
Flink, Spark Streaming, Samza, Heron, etc etc etc
It's up to you which you use, and it's going to come down to factors such as
Existing technology in use (no point deploying a Spark cluster if you don't need to; conversely, if you already use Spark and have lots of developers trained on it then it could make sense to use it)
Language familiarity of developers - does it have to be a Java API, or is SQL more accessible
Capabilities of the framework/tool - do you need tight security integration, exactly-once processing, CEP, etc etc. Some of these will rule in or out the tool that you use.
Once you've joined and aggregated your data, a good pattern to follow is to write it back to Kafka (thus more loosely decoupling your design, and enabling separation of responsibilities of the components) and from there write it to MySQL using Kafka Connect and the JDBC Sink. Kafka Connect is part of Apache Kafka.
One final consideration : if you're taking data from MySQL, to process it and then write it back into MySQL… do you even need Kafka? Is there an appropriate reason to be using it and not just doing this processing in mySQL itself?
Disclaimer: I work for Confluent.

Spark streaming + Kafka vs Just Kafka

Why and when one would choose to use Spark streaming with Kafka?
Suppose I have a system getting thousand messages per seconds through Kafka. I need to apply some real time analytics on these messages and store the result in a DB.
I have two options:
Create my own worker that reads messages from Kafka, run the analytics algorithm and store the result in DB. In a Docker era it is easy to scale this worker through my entire cluster with just scale command. I just need to make sure I have an equal or grater number of partitions than my workers and all is good and I have a true concurrency.
Create a Spark cluster with Kafka streaming input. Let the Spark cluster to do the analytics computations and then store the result.
Is there any case when the second option is a better choice? Sounds to me like it is just an extra overhead.
In a Docker era it is easy to scale this worker through my entire cluster
If you already have that infrastructure available, then great, use that. Bundle your Kafka libraries in some minimal container with health checks, and what not, and for the most part, that works fine. Adding a Kafka client dependency + a database dependency is all you really need, right?
If you're not using Spark, Flink, etc, you will need to handle Kafka errors, retries, offset and commit handling more closely to your code rather than letting the framework handle those for you.
I'll add in here that if you want Kafka + Database interactions, check out the Kafka Connect API. There's existing solutions for JDBC, Mongo, Couchbase, Cassandra, etc. already.
If you need more complete processing power, I'd go for Kafka Streams rather than needing to separately maintain a Spark cluster, and so that's "just Kafka"
Create a Spark cluster
Let's assume you don't want to maintain that, or rather you aren't able to pick between YARN, Mesos, Kubernetes, or Standalone. And if you are running the first three, it might be worth looking at running Docker on those anyway.
You're exactly right that it is extra overhead, so I find it's all up to what you have available (for example, an existing Hadoop / YARN cluster with idle memory resources), or what you're willing to support internally (or pay for vendor services, e g. Kafka & Databricks in some hosted solution).
Plus, Spark isn't running the latest Kafka client library (up until 2.4.0 updated to Kafka 2.0, I believe), so you'll need to determine if that's a selling point.
For actual streaming libraries, rather than Spark batches, Apache Beam or Flink would probably let you do the same types of workloads against Kafka
In general, in order to scale a producer / consumer, you need some form of resource scheduler. Installing Spark may not be difficult for some, but knowing how to use it efficiently and tune for appropriate resources can be

Streaming data from Kafka into Cassandra in real time

What's the best way to write date from Kafka into Cassandra? I would expect it to be a solved problem, but there doesn't seem to be a standard adapter.
A lot of people seem to be using Storm to read from Kafka and then write to Cassandra, but storm seems like somewhat of an overkill for simple ETL operations.
We are heavily using Kafka and Cassandra through Storm
We rely on Storm because:
there are usually a lot of distributed processing (inter-node) steps before result of original message hit Cassandra (Storm bolt topologies)
We don't need to maintain consumer state of Kafka (offset) ourselves - Storm-Kafka connector is doing it for us when all products of original message is acked within Storm
Message processing is distributed across nodes with Storm natively
Otherwise if it is a very simple case, you might effectively read messages from Kafka and write result to Cassandra without help of Storm
Recent release of Kafka came with the connector concept to support source and sinks as first class concepts in the design. With this, you do not need any streaming framework for moving data in/out of Kafka. Here is the Cassandra connector for Kafka that you can use: https://github.com/tuplejump/kafka-connect-cassandra

Resources