Spark Streamin improving data pipeline but maintiaining continuous delivery - apache-spark

I am developing a data pipeline that will consume data from Kafka, process it via spark streaming and ingest it into Cassandra.
The data pipeline I will go into production will definitely evolve after several months. But how to move from old to new data pipeline, but to maintain continuous delivery and avoid any data loss?
Thank you

The exact solution will depend on the specific requirements of your application. In general, Kafka will serve you as buffer. Messages going into Kafka are preserved following the topic expiration time.
In Spark streaming, you need to track the consumed offsets either automatically through snapshots, or manually (we do the later as that provides more recovery options).
Then you can stop, deploy a new version and restart your pipeline from where you previously left. In this model, messages are processed with at-least-once semantics and zero data loss.

Related

Kubernetes Vs Spark Vs Spark on kubernetes

So I have a use case where I will stream about 1000 records per minute from kafka. I just need to dump these records in raw form in a no sql db or something like a data lake for that matter
I ran this through two approaches
Approach 1
——————————
Create kafka consumers in java and run them as three different containers in kubernetes. Since all the containers are in the same kafka consumer group, they would all contribute towards reading from same kafka topic and dump data into data lake. This works pretty quick for the volume of work load I have
Approach 2
——————————-
I then created a spark cluster and the same java logic to read from kafka and dump data in data lake
Observations
———————————-
Performance of kubernetes if not bad was equal to that of a spark job running in clustered mode.
So my question is, what is the real use case for using spark over kubernetes the way I am using it or even spark on kubernetes?
Is spark only going to rise and shine much much heavier work loads let’s say something of the order of 50,000 records per minute or cases where some real time processing needs to be done on the data before dumping it to the sink?
Spark has more cost associated to it so I need to make sure I use it only if it would scale better than kuberbetes solution
If your case is only to archive/snapshot/dump records I would recommend you to look into the Kafka Connect.
If you need to process the records you stream, eg. aggregate or join streams, then Spark comes into the game. Also for this case you may look into the Kafka Streams.
Each of these frameworks have its own tradeoffs and performance overheads, but in any case you save much development efforts using the tools made for that rather than developing your own consumers. Also these frameworks already support most of the failures handling, scaling, and configurable semantics. Also they have enough config options to tune the behaviour to most of the cases you can imagine. Just choose the available integration and you're good to go! And of course beware the open source bugs ;) .
Hope it helps.
Running kafka inside Kubernetes is only recommended when you have a lot of expertise doing it, as Kubernetes doesn't know it's hosting Spark, and Spark doesn't know its running inside Kubernetes you will need to double check for every feature you decide to run.
For your workload, I'd recommend sticking with Kubernetes. The elasticity, performance, monitoring tools and scheduling features plus the huge community support adds well on the long run.
Spark is a open source, scalable, massively parallel, in-memory execution engine for analytics applications so it will really spark when your load become more processing demand. It simply doesn't have much room to rise and shine if you are only dumping data, so keep It simple.

Avoiding re-processing of data during Spark Structured Streaming application updates

I am using Structured Streaming with Spark 2.2. We are using Kafka as our source and are using checkpoints for failure recovery and e2e exactly once guarantees. I would like to get some more information on how to handle updates to the application when there is a change in stateful operations and/or output schema.
As some of the sources suggest I can start the updated application parallel to the old application until it catches up with the old application in terms of data, and then kill the old one. But then the new application will have to re-read/re-process all the data in Kafka which could take a long time.
I want to avoid this re-processing of the data in the newly deployed updated application.
One way I can think of is for the application to keep writing the offsets into something in addition to the checkpoint directory, for example in zookeeper/hdfs. And then, on an update of the application, I command Kafka readstream() to start reading from the offsets stored in this new location (zookeeper/hdfs) - since the updated application can't read from the checkpoint directory which is now deemed incompatible.
So a couple of questions:
Is the above-stated solution a valid solution?
If yes, How can I automate the detection of whether the application is being restarted because of a failure/maintenance or because of code changes to stateful operations and/or output schema?
Any guidance, example or information source is appreciated.

Spark streaming + Kafka vs Just Kafka

Why and when one would choose to use Spark streaming with Kafka?
Suppose I have a system getting thousand messages per seconds through Kafka. I need to apply some real time analytics on these messages and store the result in a DB.
I have two options:
Create my own worker that reads messages from Kafka, run the analytics algorithm and store the result in DB. In a Docker era it is easy to scale this worker through my entire cluster with just scale command. I just need to make sure I have an equal or grater number of partitions than my workers and all is good and I have a true concurrency.
Create a Spark cluster with Kafka streaming input. Let the Spark cluster to do the analytics computations and then store the result.
Is there any case when the second option is a better choice? Sounds to me like it is just an extra overhead.
In a Docker era it is easy to scale this worker through my entire cluster
If you already have that infrastructure available, then great, use that. Bundle your Kafka libraries in some minimal container with health checks, and what not, and for the most part, that works fine. Adding a Kafka client dependency + a database dependency is all you really need, right?
If you're not using Spark, Flink, etc, you will need to handle Kafka errors, retries, offset and commit handling more closely to your code rather than letting the framework handle those for you.
I'll add in here that if you want Kafka + Database interactions, check out the Kafka Connect API. There's existing solutions for JDBC, Mongo, Couchbase, Cassandra, etc. already.
If you need more complete processing power, I'd go for Kafka Streams rather than needing to separately maintain a Spark cluster, and so that's "just Kafka"
Create a Spark cluster
Let's assume you don't want to maintain that, or rather you aren't able to pick between YARN, Mesos, Kubernetes, or Standalone. And if you are running the first three, it might be worth looking at running Docker on those anyway.
You're exactly right that it is extra overhead, so I find it's all up to what you have available (for example, an existing Hadoop / YARN cluster with idle memory resources), or what you're willing to support internally (or pay for vendor services, e g. Kafka & Databricks in some hosted solution).
Plus, Spark isn't running the latest Kafka client library (up until 2.4.0 updated to Kafka 2.0, I believe), so you'll need to determine if that's a selling point.
For actual streaming libraries, rather than Spark batches, Apache Beam or Flink would probably let you do the same types of workloads against Kafka
In general, in order to scale a producer / consumer, you need some form of resource scheduler. Installing Spark may not be difficult for some, but knowing how to use it efficiently and tune for appropriate resources can be

Why do we need kafka to feed data to apache spark

I am reading about spark and its real-time stream processing.I am confused that If spark can itself read stream from source such as twitter or file, then Why do we need kafka to feed data to spark? It would be great if someone explains me what advantage we get if we use spark with kafka. Thank you.
Kafka offers a decoupling and buffering of your input stream.
Take Twitter data for example, afaik you connect to the twitter api and get a constant stream of tweets that match criteria you specified. If you now shut down your Spark jobs for an hour do to some mainentance on your servers or roll out a new version, then you will miss tweets from that hour.
Now imagine you put Kafka in front of your Spark jobs and have a very simple ingest thread that does nothing but connect to the api and write tweets to Kafka, where the Spark jobs retrieve them from. Since Kafka persists everything to disc, you can shut down your processing jobs, perform maintenance and when they are restarted, they will retrieve all data from the time they were offline.
Also, if you change your processing jobs in a significant way and want to reprocess data from the last week, you can easily do that if you have Kafka in your chain (provided you set your retention time high enough) - you'd simply roll out your new jobs and change the offsets in Kafka so that your jobs reread old data and once that is done your data store is up to date with your new processing model.
There is a good article on the general principle written by Jay Kreps, one of the people behind Kafka, give that a read if you want to know more.
Kafka decouples everything,Consumer-Producer need not to know about each other.
Kafka provides pub-sub model based on topic.
From multiple sources you can write data(messages) to any topic in kafka, and consumer(spark or anything) can consume data based on topic.
Multiple consumer can consume data from same topic as kafka stores data for period of time.
But at the end, it's depends ion your use-case if you really need a broker.

Memsql Spark-Kafka Transform Failure

We have a Spark Cluster running under Memsql, We have different Pipelines running, The ETL setup is as below.
Extract:- Spark read Messages from Kafka Cluster (Using Memsql Kafka-Zookeeper)
Transform:- We have a custom jar deployed for this step
Load:- Data from Transform stage is Loaded in Columnstore
I have below doubts:
What Happens to the Message polled from Kafka, if the Job fails in Transform stage
- Does Memsql takes care of loading that Message again
- Or, the data is Lost
If the data gets Lost, how can I solve this Problem, is there any configuration changes which needs to done for this?
As it stands, at least once semantics are not available in MemSQL Ops. It is on the roadmap and will be present in one of the future releases of Ops.
If you haven't yet, you should check out MemSQL 5.5 Pipelines.
http://blog.memsql.com/pipelines/
This one isn't based on spark, (and transforms are done a bit differently so you might have to rewrite your code), but we have native kafka streams now.
The way we get exactly once with the native version is simple; store the offsets in the database same atomic transaction as the actual data. If something fails and the transaction isn't committed, the offsets won't be committed, so we'll naturally and automatically retry that partition-offset-range.

Resources