Spring Batch Remote Partitioning with Kafka as middle wear

Spring Batch Remote Partitioning with Kafka as middle wear - spring-integration

I was checking Spring batch remote partitioning for loading data from RDBMS sources as well as multi partitioned Kafka topic. Problem with me is, I can not have rabbitMQ or JMS as the middle wear between master and worker nodes, I can only have Kafka as channel between the master and worker.
On all the documentation I can see that it supports JMS and AMQP.
Can anyone tell me how we can use remote partitioning with Kafka as middle wear .... if anyone has working example also, it will be a great help?

spring-integration-kafka provides similar endpoints to those used for JMS and RabbitMQ so it shouldn't be difficult to apply the concepts in that documentation to kafka.
The spring-integration-kafka latest version is 3.3.1 (it is moving to the core spring-integration project in 5.4.0).

Related

When is a Kafka connector preferred over a Spark streaming solution?

With Spark streaming, I can read Kafka messages and write data to different kind of tables, for example HBase, Hive and Kudu. But this can also be done by using Kafka connectors for these tables. My question is, in which situations I should prefer connectors over the Spark streaming solution.
Also how tolerant is the Kafka connector solution? We know that with Spark streaming, we can use checkpoints and executors running on multiple nodes for fault tolerant execution, but how is fault tolerance (if possibe) achieved with Kafka connectors? By running the connector on multiple nodes?

So, generally, there should be no big difference in functionality when it comes to simply reading records from Kafka and sending them into other services.
Kafka Connect is probably easier when it comes to standard tasks since it offers various connectors out-of-the-box, so it will quite probably reduce the need of writing any code. So, if you just want to copy a bunch of records from Kafka to HDFS or Hive then it will probably be easier and faster to do with Kafka connect.
Having this in mind, Spark Streaming drastically takes over when You need to do things that are not standard i.e. if You want to perform some aggregations or calculations over records and write them to Hive, then You probably should go for Spark Streaming from the beginning.
Genrally, I found doing some substandard things with Kafka connect, like for example splitting one message to multiple ones(assuming it was for example JSON array) to be quite troublesome and often require much more work than it would be in Spark.
As for the Kafka Connect fault tolerance, as it's described in the docs this is achieved by running multiple distributed workers with same group.id, the workers redistribute tasks and connectors if one of them fails.

in which situations I should prefer connectors over the Spark streaming solution.
"It Depends" :-)
Kafka Connect is part of Apache Kafka, and so has tighter integration with Apache Kafka in terms of security, delivery semantics, etc.
If you don't want to write any code, Kafka Connect is easier because it's just JSON to configure and run
If you're not using Spark already, Kafka Connect is arguably more
straightforward to deploy (run the JVM, pass in the configuration)
As a framework, Kafka Connect is more transferable since the concepts are the same, you just plugin the appropriate connector for the technology that you want to integrate with each time
Kafka Connect handles all the tricky stuff for you like schemas, offsets, restarts, scaleout, etc etc etc
Kafka Connect supports Single Message Transform for making changes to data as it passes through the pipeline (masking fields, dropping fields, changing data types, etc etc). For more advanced processing you would use something like Kafka Streams or ksqlDB.
If you are using Spark, and it's working just fine, then it's not necessarily prudent to rip it up to use Kafka Connect instead :)
Also how tolerant is the Kafka connector solution? … how is fault tolerance (if possibe) achieved with Kafka connectors?
Kafka Connect can be run in distributed mode, in which you have one or more worker processes across nodes. If a worker fails, Kafka Connect rebalances the tasks across the remaining ones. If you add a worker in, Kafka Connect will rebalance to ensure workload distribution. This was drastically improved in Apache Kafka 2.3 (KIP-415)
Kafka Connect uses the Kafka consumer API and tracks offsets of records delivered to a target system in Kafka itself. If the task or worker fails you can be sure that it will restart from the correct point. Many connectors support exactly-once delivery too (e.g. HDFS, Elasticsearch, etc)
If you want to learn more about Kafka Connect see the docs here and my talk here. See a list of connectors here, and tutorial videos here.
Disclaimer: I work for Confluent and a big fan of Kafka Connect :-)

Is the natural replacement for Spark (Direct) Streaming either Spark Structured Streaming or Kafka Streams?

Over the past few years we have developed quite some Spark Streaming (Direct API) applications that are reading or writing to/from Kafka, IBM MQ, Hive, HBase, HDFS, and others on our Cloudera Platform. Now that the Direct API of Spark Streaming (we currently have version 2.3.2) is deprecated and we recently added the Confluent platform (comes with Kafka 2.2.0) to our project we plan to migrate these applications.
What is the natural replacement of our Spark Streaming applications? Should we migrate to Spark Structured Streaming or rather to Kafka Streams?
I personally do not have any experience with both frameworks but in my view Spark Structured Streaming seems to be the natural choice. Our code base is mainly written in Scala which could be also used for the Structured API. Kafka Streams has a few limitations with Scala. Although we might loose some flexibility by leaving the low level API of RDDs and moving to a higher level of DataFrames we could build on our knowledge with Spark.
On the other side there is Kafka Streams which is probably the best choice when it comes to processing data between Kafka topics which is our main use case. And looking at all the Kafka Connectors that come with Confluent the other uses cases can be served as well.

You currently have some Spark scheduler, therefore you can use Structured Streaming, which is binary compatible with the old Streaming API.
If you're using Mesos or k8s, then putting Kafka Streams apps in Docker and running those is easier to scale, monitor and configure than Spark, IMO since it acts as any other Docker container in those systems, so you build a pattern around everything
Kafka Streams... is probably the best choice when it comes to processing data between Kafka topics
True.
Kafka Streams has a few limitations with Scala.
I think you might want to keep reading that section
The Kafka Streams DSL for Scala library is a wrapper over the existing Java APIs for Kafka Streams DSL that addresses the concerns raised
Of course you could always use Kotlin to interop better with the Java API

Twitter data harvesting

For my project, I need to harvest data from Twitter.
I am currently facing two design choices:
What is the best software architecture? I read that spark has Twitter support but I am not familiar with Scala. On the other hand, Apache Spark seems a good option, but then I'm not sure on how to save data to a common sink
I have some budget constraints. I surely need one server to do the sink and the processing. However, for the data harvesting, I don't know if several VM/containers offer a better performance / cost ratio than a bunch of Raspberry PI running Kafka producers.

Take a look at Confluent platform and especially Kafka Connect [1].
There is a Twitter connector out of the box. All the twitter data will be streamed to Kafka.
[1] https://www.confluent.io/blog/using-ksql-to-analyse-query-and-transform-data-in-kafka

Agree with #leshkin that Kafka Connect is the most natural fit. However, the Twitter connector (available on github here) does not require Confluent Platform, simply Kafka Connect which is a standard part of the Apache Kafka distribution. https://kafka.apache.org/documentation/#connect
If you choose, you can run Kafka connect workers in distributed mode to divide the load across several VMs/containers/boxes and these don't have to be the same boxes you run your kafka brokers (they only need some relevant libs from kafka and the libs for the connector and Java of course)

how to stream from kafka to cassandra and increment counters

I have apache access log file and i want to store access counts (total/daily/hourly) of each page in a cassandra table.
I am trying to do it by using kafka connect to stream from log file to a kafka topic. In order to increment metrics counters in Cassandra can I use Kafka Connect again? Otherwise which other tool should be used here e.g. kafka streams, spark, flink, kafka connect etc?

You're talking about doing stream processing, which Kafka can do - either with Kafka's Streams API, or KSQL. KSQL runs on top of Kafka Streams, and gives you a very simple way to build the kind of aggregations that you're talking about.
Here's an example of doing aggregations of streams of data in KSQL
SELECT PAGE_ID,COUNT(*) FROM PAGE_CLICKS WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY PAGE_ID
See more at : https://www.confluent.io/blog/using-ksql-to-analyse-query-and-transform-data-in-kafka
You can take the output of KSQL which is actually just a Kafka topic, and stream that through Kafka Connect e.g. to Elasticsearch, Cassandra, and so on.
You mention other stream processing tools, they're valid too - depends in part on existing skills and language preferences (e.g. Kafka Streams is Java library, KSQL is … KSQL, Spark Streaming has Python as well as Java, etc), but also deployment preferences. Kafka Streams is just a Java library to deploy within your existing application. KSQL is deployable in a cluster, and so on.

This can be easily done with Flink, either as a batch or streaming job, and either with or without Kafka (Flink can read from files and write to Cassandra). This sort of time windowed aggregation is easily done with Flink's SQL api; see the examples here.

Streaming data from Kafka into Cassandra in real time

What's the best way to write date from Kafka into Cassandra? I would expect it to be a solved problem, but there doesn't seem to be a standard adapter.
A lot of people seem to be using Storm to read from Kafka and then write to Cassandra, but storm seems like somewhat of an overkill for simple ETL operations.

We are heavily using Kafka and Cassandra through Storm
We rely on Storm because:
there are usually a lot of distributed processing (inter-node) steps before result of original message hit Cassandra (Storm bolt topologies)
We don't need to maintain consumer state of Kafka (offset) ourselves - Storm-Kafka connector is doing it for us when all products of original message is acked within Storm
Message processing is distributed across nodes with Storm natively
Otherwise if it is a very simple case, you might effectively read messages from Kafka and write result to Cassandra without help of Storm

Recent release of Kafka came with the connector concept to support source and sinks as first class concepts in the design. With this, you do not need any streaming framework for moving data in/out of Kafka. Here is the Cassandra connector for Kafka that you can use: https://github.com/tuplejump/kafka-connect-cassandra

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string