Log processing using ELK stack - apache-spark

My requirement is
I have log files that I need to process, also I would like to enrich the log information with some data which I have in postgres db.
Step 1. I plan to feed data from above two sources (log file and database) to kafka topics, using logstash
Step 2. I plan to use kafka stream to join data on different kafka topics and push them to elastic search via API calls.
My doubt is about step 2,
Is kafka stream is the way to go ? or can I use Apache spark which I believe can be used for same.
Any help on this is appreciated.

Step 1. I plan to feed data from above two sources (log file and database) to kafka topics, using logstash
If you're already using Apache Kafka, then note that you can use Kafka Connect for integrating systems, including databases, into Kafka. For information on integrating databases, see this article.
Step 2. I plan to use kafka stream to join data on different kafka topics and push them to elastic search via API calls.
My doubt is about step 2, Is kafka stream is the way to go ? or can I use Apache spark which I believe can be used for same. Any help on this is appreciated.
Yes, Kafka Streams is a good fit for this. It can enrich events as they flow through a topic, using data from other topics. These topics can be sourced from any system, including log files, databases, etc. Here is example code of such join, and the documentation for it.
BTW you might want to also check out KSQL. KSQL is built on Kafka Streams so you get the same scalability and elasticity functionality, but with a SQL abstraction that you can run directly (no coding needed). For an example of using KSQL to enrich streams of data see this talk or this article
(Disclosure: I work for Confluent, who lead the open-source KSQL project)

Related

Kafka log aggregation and processing

Hi I am trying to use Kafka as a log aggregator and filtering layer so they input into Splunk for eg.
Input side of Kafka will be Kafka S3 connectors and other connectors getting logs from S3 and Amazon Kinesis Data streams.See this pic for reference:
However what I want to know is inside the Kafka data pipeline for processing or filtering is it necessary to do Spark jobs? Or can that be just done with a simple Kafka streams app and if we have to do this design for several different logs what would be an efficient solution to implement. I am looking at a solution which we can replicate across different log streams without major changes each time.
Thank you
Spark (or Flink) can essentially replace Kafka Streams and Kafka Connect for transforming topics and writing to S3.
If you want to write directly to Splunk, then there is a Kafka Connector written explicitly for that, and you could use any Kafka client to consume+produce processed data before writing it downstream

When is a Kafka connector preferred over a Spark streaming solution?

With Spark streaming, I can read Kafka messages and write data to different kind of tables, for example HBase, Hive and Kudu. But this can also be done by using Kafka connectors for these tables. My question is, in which situations I should prefer connectors over the Spark streaming solution.
Also how tolerant is the Kafka connector solution? We know that with Spark streaming, we can use checkpoints and executors running on multiple nodes for fault tolerant execution, but how is fault tolerance (if possibe) achieved with Kafka connectors? By running the connector on multiple nodes?
So, generally, there should be no big difference in functionality when it comes to simply reading records from Kafka and sending them into other services.
Kafka Connect is probably easier when it comes to standard tasks since it offers various connectors out-of-the-box, so it will quite probably reduce the need of writing any code. So, if you just want to copy a bunch of records from Kafka to HDFS or Hive then it will probably be easier and faster to do with Kafka connect.
Having this in mind, Spark Streaming drastically takes over when You need to do things that are not standard i.e. if You want to perform some aggregations or calculations over records and write them to Hive, then You probably should go for Spark Streaming from the beginning.
Genrally, I found doing some substandard things with Kafka connect, like for example splitting one message to multiple ones(assuming it was for example JSON array) to be quite troublesome and often require much more work than it would be in Spark.
As for the Kafka Connect fault tolerance, as it's described in the docs this is achieved by running multiple distributed workers with same group.id, the workers redistribute tasks and connectors if one of them fails.
in which situations I should prefer connectors over the Spark streaming solution.
"It Depends" :-)
Kafka Connect is part of Apache Kafka, and so has tighter integration with Apache Kafka in terms of security, delivery semantics, etc.
If you don't want to write any code, Kafka Connect is easier because it's just JSON to configure and run
If you're not using Spark already, Kafka Connect is arguably more
straightforward to deploy (run the JVM, pass in the configuration)
As a framework, Kafka Connect is more transferable since the concepts are the same, you just plugin the appropriate connector for the technology that you want to integrate with each time
Kafka Connect handles all the tricky stuff for you like schemas, offsets, restarts, scaleout, etc etc etc
Kafka Connect supports Single Message Transform for making changes to data as it passes through the pipeline (masking fields, dropping fields, changing data types, etc etc). For more advanced processing you would use something like Kafka Streams or ksqlDB.
If you are using Spark, and it's working just fine, then it's not necessarily prudent to rip it up to use Kafka Connect instead :)
Also how tolerant is the Kafka connector solution? … how is fault tolerance (if possibe) achieved with Kafka connectors?
Kafka Connect can be run in distributed mode, in which you have one or more worker processes across nodes. If a worker fails, Kafka Connect rebalances the tasks across the remaining ones. If you add a worker in, Kafka Connect will rebalance to ensure workload distribution. This was drastically improved in Apache Kafka 2.3 (KIP-415)
Kafka Connect uses the Kafka consumer API and tracks offsets of records delivered to a target system in Kafka itself. If the task or worker fails you can be sure that it will restart from the correct point. Many connectors support exactly-once delivery too (e.g. HDFS, Elasticsearch, etc)
If you want to learn more about Kafka Connect see the docs here and my talk here. See a list of connectors here, and tutorial videos here.
Disclaimer: I work for Confluent and a big fan of Kafka Connect :-)

Which framework should be used to aggregate and joining the data of Kafka topics and store in to MySQL

I have data in two kafka topics from mysql using debezium-connector-mysql-plugin.
now i want to aggregate this data at daily level and store in to another mysql table.
please suggest.
Thanks.
You've not really laid out your requirements, other than commenting that you don't want to use Confluent Platform (but not said why).
In general, with data in Kafka (regardless of where it comes from) you have different options for processing it:
Bespoke consumer (probably a bad idea, given the availability of stream processing frameworks)
KSQL (use SQL to do your joins etc) - part of Confluent Platform
Kafka Streams - a Java library for doing stream processing. Part of Apache Kafka.
Flink, Spark Streaming, Samza, Heron, etc etc etc
It's up to you which you use, and it's going to come down to factors such as
Existing technology in use (no point deploying a Spark cluster if you don't need to; conversely, if you already use Spark and have lots of developers trained on it then it could make sense to use it)
Language familiarity of developers - does it have to be a Java API, or is SQL more accessible
Capabilities of the framework/tool - do you need tight security integration, exactly-once processing, CEP, etc etc. Some of these will rule in or out the tool that you use.
Once you've joined and aggregated your data, a good pattern to follow is to write it back to Kafka (thus more loosely decoupling your design, and enabling separation of responsibilities of the components) and from there write it to MySQL using Kafka Connect and the JDBC Sink. Kafka Connect is part of Apache Kafka.
One final consideration : if you're taking data from MySQL, to process it and then write it back into MySQL… do you even need Kafka? Is there an appropriate reason to be using it and not just doing this processing in mySQL itself?
Disclaimer: I work for Confluent.

Twitter data harvesting

For my project, I need to harvest data from Twitter.
I am currently facing two design choices:
What is the best software architecture? I read that spark has Twitter support but I am not familiar with Scala. On the other hand, Apache Spark seems a good option, but then I'm not sure on how to save data to a common sink
I have some budget constraints. I surely need one server to do the sink and the processing. However, for the data harvesting, I don't know if several VM/containers offer a better performance / cost ratio than a bunch of Raspberry PI running Kafka producers.
Take a look at Confluent platform and especially Kafka Connect [1].
There is a Twitter connector out of the box. All the twitter data will be streamed to Kafka.
[1] https://www.confluent.io/blog/using-ksql-to-analyse-query-and-transform-data-in-kafka
Agree with #leshkin that Kafka Connect is the most natural fit. However, the Twitter connector (available on github here) does not require Confluent Platform, simply Kafka Connect which is a standard part of the Apache Kafka distribution. https://kafka.apache.org/documentation/#connect
If you choose, you can run Kafka connect workers in distributed mode to divide the load across several VMs/containers/boxes and these don't have to be the same boxes you run your kafka brokers (they only need some relevant libs from kafka and the libs for the connector and Java of course)

how to stream from kafka to cassandra and increment counters

I have apache access log file and i want to store access counts (total/daily/hourly) of each page in a cassandra table.
I am trying to do it by using kafka connect to stream from log file to a kafka topic. In order to increment metrics counters in Cassandra can I use Kafka Connect again? Otherwise which other tool should be used here e.g. kafka streams, spark, flink, kafka connect etc?
You're talking about doing stream processing, which Kafka can do - either with Kafka's Streams API, or KSQL. KSQL runs on top of Kafka Streams, and gives you a very simple way to build the kind of aggregations that you're talking about.
Here's an example of doing aggregations of streams of data in KSQL
SELECT PAGE_ID,COUNT(*) FROM PAGE_CLICKS WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY PAGE_ID
See more at : https://www.confluent.io/blog/using-ksql-to-analyse-query-and-transform-data-in-kafka
You can take the output of KSQL which is actually just a Kafka topic, and stream that through Kafka Connect e.g. to Elasticsearch, Cassandra, and so on.
You mention other stream processing tools, they're valid too - depends in part on existing skills and language preferences (e.g. Kafka Streams is Java library, KSQL is … KSQL, Spark Streaming has Python as well as Java, etc), but also deployment preferences. Kafka Streams is just a Java library to deploy within your existing application. KSQL is deployable in a cluster, and so on.
This can be easily done with Flink, either as a batch or streaming job, and either with or without Kafka (Flink can read from files and write to Cassandra). This sort of time windowed aggregation is easily done with Flink's SQL api; see the examples here.

Resources