Kafka using Docker for production clusters - linux

We need to build a Kafka production cluster with 3-5 nodes in cluster ,
We have the following options:
Kafka in Docker containers (Kafka cluster include zookeeper and schema registry on each node)
Kafka cluster not using docker (Kafka cluster include zookeeper and schema registry on each node)
Since we are talking on production cluster we need good performance as we have high read/write to disks (disk size is 10T), good IO performance, etc.
So does Kafka using Docker meet the requirements for productions clusters?
more info - https://www.infoq.com/articles/apache-kafka-best-practices-to-optimize-your-deployment/

It can be done, sure. I have no personal experience with it, but if you don't otherwise have experience managing other stateful containers, I'd suggest avoiding it.
As far as "getting started" with Kafka in containers, Kubernetes is the most documented way, and Strimzi (free, optional commercial support by Lightbend) or Confluent Operator (commercial support by Confluent) can make this easy when using Kubernetes or Openshift. Or DC/OS offers a Kafka service over Mesos/Marathon. If you don't already have any of these services, then I think it's apparent that you should favor not using containers.
Bare metal or virtualized deployments would be much easier to maintain than hand-deployed containerized ones, from what I have experienced. Particularly for logging, metric gathering, and statically assigned Kafka listener mappings over the network. Confluent provides Ansible scripts for doing deployments to such environments
That isn't to say there's companies that have been successful at it, or at least tried. IBM, RedHat, and Shopify immediately pop up in my searches, for example
Here's a few talk about things to consider when Kafka is in containers
https://www.confluent.io/kafka-summit-london18/kafka-in-containers-in-docker-in-kubernetes-in-the-cloud
https://kafka-summit.org/sessions/running-kafka-kubernetes-practical-guide/

Related

Flink vs Spark deployment modes on multi-node Cluster

In Spark, the three cluster (not local) deployment options that I am familiar with:
Standalone
Mesos
Yarn
There might be more cluster deployment options but I am concerned with these three. All the three above support client and cluster modes of deployment. The client mode involves the driver program being run from the edge machine itself and the cluster mode involves launching the driver in one of the worker nodes inside the cluster.
Now on the side of Flink, I only have experience with a 1 node setup which I learned from some tutorial which did not really elaborate on the ecosystem and was focussed more on code than "also" providing a big picture. I was looking at deployment options in Flink, therefore, to understand this. The documentation talks about the all the three options: Standalone, Mesos and YARN but it's not becoming clear from the docs if it supports (, what we in Spark's jargon would term as) the client mode or the cluster mode or both or some other mode.
The idea is to replace a Spark cluster with a Flink one. I want to understand the steps while I carry those out. The steps are available in the docs. The rationale behind those steps are either implicit (enough for me to not understand) or are just not there.
An explanation by Apache Flink experts/contributors would help.
There was recently a discussion about this topic on the Flink mailing list:
(Topic name: [DISCUSS] Semantic and implementation of per-job mode)
https://lists.apache.org/thread.html/6c688a73b281d38670a74f05d63f2858f59da1f37bc7211640de7ca8#%3Cuser.flink.apache.org%3E
Currently, all job submission from the flink CLI works like client mode in Spark.
An opt-in option to have something similar to cluster mode will probably be available in future (As it seems to be indicated on the mailing list), especially due to the rapidly increasing number of flink deployments in Kubernetes clusters.

While creating Azure HDInsight cluster for Starburst Presto, can I create Spark Cluster?

While creating infrastructure for big data, I wanted to use Azure HDInsight with Presto installation. Azure HDInsight comes with different flavors like hadoop, spark etc. In documentations it is recommended to use hadoop cluster but I want to use the spark one.
Is it possible to use spark cluster with Starburst's Presto distribution?
It looks like you want to use both Presto and Spark at the same time.
If you run them on a single cluster, you would need to configure them appropriately to make sure the JVMs for different processes can co-exist. This is possible, but hard to do in practice (you need to know how JVM allocates memory beyond -Xmx setting), so it's definitely not recommended.
While I can imagine that in some on-premises installations where provisioning new hardware is hard you could want to colocate services on one cluster. In the cloud, it's much more convenient to provision two separate clusters, each appropriately sized for your particular needs and workload. For example, you could have one cluster with Presto for interactive analytics, dashboarding and ad-hoc queries. And another one with Spark for your machine learning or ETL workloads.
Please refer to the Starburst Presto on Azure documentation for detailed configuration instructions.

How to setup Spark with a multi node Cassandra cluster?

First of all, I am not using the DSE Cassandra. I am building this on my own and using Microsoft Azure to host the servers.
I have a 2-node Cassandra cluster, I've managed to set up Spark on a single node but I couldn't find any online resources about setting it up on a multi-node cluster.
This is not a duplicate of how to setup spark Cassandra multi node cluster?
To set it up on a single node, I've followed this tutorial "Setup Spark with Cassandra Connector".
You have two high level tasks here:
setup Spark (single node or cluster);
setup Cassandra (single node or cluster);
This tasks are different and not related (if we are not talking about data locality).
How to setup Spark in Cluster you can find here Architecture overview.
Generally there are two types (standalone, where you setup Spark on hosts directly, or using tasks schedulers (Yarn, Mesos)), you should draw upon your requirements.
As you built all by yourself, I suppose you will use Standalone installation. The difference between one node is network communication. By default Spark runs on localhost, more commonly it uses FQDNS name, so you should configure it in /etc/hosts and hostname -f or try IPs.
Take a look at this page, which contains all necessary ports for nodes communication. All ports should be open and available between nodes.
Be attentive that by default Spark uses TorrentBroadcastFactory with random ports.
For Cassandra see this docs: 1, 2, tutorials 3, etc.
You will need 4 likely. You also could use Cassandra inside Mesos using docker containers.
p.s. If data locality it is your case you should come up with something yours, because nor Mesos, nor Yarn don't handle running spark jobs for partitioned data closer to Cassandra partitions.

How to create a Spark or TensorFlow cluster based on containers with mesos or kubernetes?

After reading the discussions about the differences between mesos and kubernetes and kubernetes-vs.-mesos-vs.-swarm, I am still confused about how to create a Spark and TensorFlow cluster with docker containers via some bear metal hosts and AWS like private cloud (OpenNebular).
Currently, I am able to build a static TensorFlow cluster with docker containers manually distributed to different hosts. I only run a stand alone spark on a bear metal host. The way of manually setup a mesos cluster for containers can be found here.
Since my resources are limited, I would like to find a way to deploy docker containers to the current mixed infrastructure to build either a tensorflow or spark cluster, so that I can do data analysis either with tensorflow or spark on the same resources.
Is it possible to create/run/undeploy a spark or tensorflow cluster quickly with docker containers on a mixed infrastructure with mesos or kubernetes? How can I do that?
Any comments and hints are welcome.
Given you have limited resources, I suggest you have a look at using the Spark helm, which gives you:
1 x Spark Master with port 8080 exposed on an external LoadBalancer
3 x Spark Workers with HorizontalPodAutoscaler to scale to max 10 pods when CPU hits 50% of 100m
1 x Zeppelin with port 8080 exposed on an external LoadBalancer
If this configuration doesn't work then you can build your own docker images and deploy those, take a look at this blog series. There is work underway to make Spark more Kubernetes friendly. This issue also gives some insight.
Not looked into Tensorflow, I suggest you look at this blog

Which cluster type should I choose for Spark?

I am new to Apache Spark, and I just learned that Spark supports three types of cluster:
Standalone - meaning Spark will manage its own cluster
YARN - using Hadoop's YARN resource manager
Mesos - Apache's dedicated resource manager project
I think I should try Standalone first. In the future, I need to build a large cluster (hundreds of instances).
Which cluster type should I choose?
Spark Standalone Manager : A simple cluster manager included with Spark that makes it easy to set up a cluster. By default, each application uses all the available nodes in the cluster.
A few benefits of YARN over Standalone & Mesos:
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
The Spark standalone mode requires each application to run an executor on every node in the cluster; whereas with YARN, you choose the number of executors to use
YARN directly handles rack and machine locality in your requests, which is convenient.
The resource request model is, oddly, backwards in Mesos. In YARN, you (the framework) request containers with a given specification and give locality preferences. In Mesos you get resource "offers" and choose to accept or reject those based on your own scheduling policy. The Mesos model is a arguably more flexible, but seemingly more work for the person implementing the framework.
If you have a big Hadoop cluster already in place, YARN is better choice.
The Standalone manager requires the user configure each of the nodes with the shared secret. Mesos’ default authentication module, Cyrus SASL, can be replaced with a custom module. YARN has security for authentication, service level authorization, authentication for Web consoles and data confidentiality. Hadoop authentication uses Kerberos to verify that each user and service is authenticated by Kerberos.
High availability is offered by all three cluster managers but Hadoop YARN doesn’t need to run a separate ZooKeeper Failover Controller.
Useful links:
spark documentation page
agildata article
I think the best to answer that are those who work on Spark. So, from Learning Spark
Start with a standalone cluster if this is a new deployment.
Standalone mode is the easiest to set up and will provide almost all
the same features as the other cluster managers if you are only
running Spark.
If you would like to run Spark alongside other applications, or to use
richer resource scheduling capabilities (e.g. queues), both YARN and
Mesos provide these features. Of these, YARN will likely be
preinstalled in many Hadoop distributions.
One advantage of Mesos over both YARN and standalone mode is its
fine-grained sharing option, which lets interactive applications such
as the Spark shell scale down their CPU allocation between commands.
This makes it attractive in environments where multiple users are
running interactive shells.
In all cases, it is best to run Spark on the same nodes as HDFS for
fast access to storage. You can install Mesos or the standalone
cluster manager on the same nodes manually, or most Hadoop
distributions already install YARN and HDFS together.
Standalone is pretty clear as other mentioned it should be used only when you have spark only workload.
Between yarn and mesos, One thing to consider is the fact that unlike mapreduce, spark job grabs executors and hold it for entire lifetime of a job. where in mapreduce a job can get and release mappers and reducers over lifetime.
if you have long running spark jobs which during the lifetime of a job doesn't fully utilize all the resources it got in beginning, you may want to share those resources to other app and that you can only do either via Mesos or Spark dynamic scheduling. https://spark.apache.org/docs/2.0.2/job-scheduling.html#scheduling-across-applications
So with yarn, only way have dynamic allocation for spark is by using spark provided dynamic allocation. Yarn won't interfere in that while Mesos will. Again this whole point is only important if you have a long running spark application and you would like to scale it up and down dynamically.
In this case and similar dilemmas in data engineering, there are many side questions to be answered before choosing one distribution method over another.
For example, if you are not running your processing engine on more than 3 nodes, you usually are not facing too big of a problem to handle so your margin of performance tuning between YARN and SparkStandalone (based on experience) will not clarify your decision. Because usually you will try to make your pipeline simple, specially when your services are not self-managed by cloud and bugs and failures happen often.
I choose standalone for relatively small or not-complex pipelines but if I'm feeling alright and have a Hadoop cluster already in place, I prefer to take advantage of all the extra configs that Hadoop(Yarn) can give me.
Mesos has more sophisticated scheduling design, allowing applications like Spark to negotiate with it. It's more suitable for the diversity of applications today. I found this site really insightful:
https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn
"... YARN is optimized for scheduling Hadoop jobs, which are historically (and still typically) batch jobs with long run times. This means that YARN was not designed for long-running services, nor for short-lived interactive queries (like small and fast Spark jobs), and while it’s possible to have it schedule other kinds of workloads, this is not an ideal model. The resource demands, execution model, and architectural demands of MapReduce are very different from those of long-running services, such as web servers or SOA applications, or real-time workloads like those of Spark or Storm..."

Resources