Akka stream vs Hive Stream - apache-spark

I am working on a requirement where we need to read messages from Kafka and save (sink) to Hive. I can think about of multiple implementation using different technologies:
Akka stream - where source will be Kafka source and sink to hive
Hive Stream - using hive streaming
Spark streaming
nifi - https://nifi.apache.org/
What would be best way to handle large set of kafka messages to stream with Hive?
Thanks
Arun

Best is of course a very vague concept, but I personally like NiFi as a data movement solution.
If you are looking for fast development, and clear monitoring then the intuitive GUI should prove very valuable.
If you find that you cannot get enough performance, or good enough latency, you might be able to improve with Spark Streaming, but often that should not be needed.
Ful disclosure: Have not worked with Akka Streams, and work for Cloudera a driving force behind Nifi, Spark and Hive

Related

Spark structured streaming from JDBC source

Can someone let me know if its possible to to Spark structured streaming from a JDBC source? E.g SQL DB or any RDBMS.
I have looked at a few similar questions on SO, e.g
Spark streaming jdbc read the stream as and when data comes - Data source jdbc does not support streamed reading
jdbc source and spark structured streaming
However, I would like to know if its officially supported on Apache Spark?
If there is any sample code that would be helpful.
Thanks
No, there is no such built-in support in Spark Structured Streaming. The main reason is that most of databases doesn't provided an unified interface for obtaining the changes.
It's possible to get changes from some databases using archive logs, write-ahead logs, etc. But it's database-specific. For many databases the popular choice is Debezium that can read such logs and push list of changes into a Kafka, or something similar, from which it could be consumed by Spark.
I am on a project now architecting this using CDC Shareplex from ORACLE and writing to KAFKA and then using Spark Structured Streaming with KAFKA integration and MERGE on delta format on HDFS.
Ie that is the way to do it if not using Debezium. You can use change logs for base tables or materialized views to feed CDC.
So direct JDBC is not possible.

Spark Streaming vs Structured Streaming

The last months I've been using quite a lot Structured Streaming for implementing Stream Jobs (after using Kafka a lot). After reading the book Stream Processing with Apache Spark i was having this question: Is there any point or use cases where i would use Spark Streaming instead of Structured Streaming? Should i invest some time getting into it or since im already using Spark Structured Streaming i should stick with it and there is no benefit on the previous API.
Would appreciate any opinion/insight
Hi Sharing my personal experience.
Structured streaming is the future for spark based streaming implementation. It provides higher level of abstraction and other great features. However there are few restrictions.
i have had to switch to spark streaming on few occasions due to the flexibility offered by it. One recent example is, we had to perform Joins with static reference data, however Outer joins are not supported in Structured streaming. This can be accomplished with Spark streaming.
With the newer spark version 2.4, Structured streaming is much improved with support for foreachBatch sink which gives similar flexibility offered by spark streaming.
My personal thought is having the knowledge of spark streaming is helpful and you might have to use it depending on your use case.

Are Spark Streaming, Structured Streaming and Kafka Streaming the same thing?

I have come across three popular streaming techniques that are Spark Streaming, Structured Streaming and Kafka Streaming.
I have gone through various sites but not getting this answer, are these three the same thing or different?
If not same what is the basic difference.
I am not looking for an in depth answer. But an answer to above question (yes or no) and a little intro to each of them so that I can explore more. :)
Thanks in advance
Subrat
I guess you are referring to Kafka Streams when you say "Kafka Streaming".
Kafka Streams is a JVM library, part of Apache Kafka. It is a way of processing data in Kafka topics providing an abstraction layer. Applications running KafkaStreams library can be run anywhere (not just in the Kafka cluster, actually, it is not recommended to). They'll consume, process and produce data to/from the Kafka cluster.
Spark Streaming is a part of Apache Spark distributed data processing library, that provides Stream (as oppposed to batch) processing. Spark initially provided batch computation only, so a specific layer Spark Streaming was provided for stream processing. Spark Streaming can be fed with Kafka data, but it can be connected to other sources as well.
Structured Streaming, within the realm of Apache Spark, is a different approach that came to overcome certain limitations to stream processing of the previous approach that Spark Streaming was using. It was added to Spark from a certain version onwards(2.0 IIRC).

Reading a Message from Kafka and writing to HDFS

I'm looking for the best way to read messages (alot of messages, around 100B each day) from Kafka, after reading the message I need to make manipulate on data and write it into HDFS.
If I need to do it with the best performance, What is the best way for me to read messages from Kafka and write file into HDFS?
Which programming language is best for that?
Do I need to consider to use solutions like Spark for that?
You should use Spark streaming for this (see here), it provides simple correspondence between Kafka partitions and Spark partitions.
Or you can use Use Kafka Streams (see more). Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in Kafka clusters.
You can use Spark, Flink, NiFi, Streamsets... but Confluent provides Kafka Connect HDFS exactly for this purpose.
The Kafka Connect API is somewhat limited in transformations, so what most people do is to write a Kafka Streams job to filter/enhance the data to a secondary topic, which then is written to HDFS
Note: These options will write many files to HDFS (generally, one per Kafka topic partition)
Which programming language is best for that?
Each of the above are using Java. But you don't need to write any code yourself if using NiFi, Streamsets, or Kafka Connect

Streaming data from Kafka into Cassandra in real time

What's the best way to write date from Kafka into Cassandra? I would expect it to be a solved problem, but there doesn't seem to be a standard adapter.
A lot of people seem to be using Storm to read from Kafka and then write to Cassandra, but storm seems like somewhat of an overkill for simple ETL operations.
We are heavily using Kafka and Cassandra through Storm
We rely on Storm because:
there are usually a lot of distributed processing (inter-node) steps before result of original message hit Cassandra (Storm bolt topologies)
We don't need to maintain consumer state of Kafka (offset) ourselves - Storm-Kafka connector is doing it for us when all products of original message is acked within Storm
Message processing is distributed across nodes with Storm natively
Otherwise if it is a very simple case, you might effectively read messages from Kafka and write result to Cassandra without help of Storm
Recent release of Kafka came with the connector concept to support source and sinks as first class concepts in the design. With this, you do not need any streaming framework for moving data in/out of Kafka. Here is the Cassandra connector for Kafka that you can use: https://github.com/tuplejump/kafka-connect-cassandra

Resources