Kubernetes cluster architecture - cassandra

Does it make sense to create a separate Kubernetes cluster for my Cassandra instances and one cluster for the application layer? Is the DB cluster accessible from the service cluster when both are in the same region and zone?
Or is it better to have one cluster with different pools - one pool for the service layer and one pool the DB nodes?
Thanks

This is more of a toss-up or opinion in terms of how you want to design your whole architecture. Here are some things to consider:
Same cluster:
Pros
Workloads don't need to go to a different podCidr to get its data.
You can optimize your resources in the same set of servers.
This is one of the main reasons people use containers orchestrators and containers.
It allows you to run multiple different types of workloads on the same set of resources.
Cons
If you have an issue with your cluster running Cassandra you risk losing your data. Or temporarily lose data if you have backups. (Longer downtime)
If you'd like to super isolate the db and app in terms of security, it may be harder.
Different clusters:
Pros
'Safer' if one of your clusters goes down.
More separation in terms of security for your data at rest.
Cons
Resources may not be optimally utilized. Leaving some CPUs, memory, etc idle.
More infrastructure management.
Different node pools:
Pros
Separation of data at rest
Still going through the same PodCidr.
Cons
More management of different node pools.
Resources may not be optimally utilized.

Related

Does clustering in Node.js and auto-scaling web application using Kubernetes serve the same purpose?

Node.js has introduced the Cluster module to scale up applications for performance optimization. We have Kubernetes doing the same thing.
I'm confused if both are serving the same purpose? My assumption is clustering can spawn up to max 8 processes (if there are 4 cpu cores with 2 threads each) and there is no such limitation in Kubernetes.
Kubernetes and the Node.js Cluster module operate at different levels.
Kubernetes is in charge of orchestrating containers (amongst many other things). From its perspective, there are resources to be allocated, and deployments that require or use a specific amount of resources.
The Node.js Cluster module behaves as a load-balancer that forks N times and spreads the requests between the various processes it owns, all within the limits defined by its environment (CPU, RAM, Network, etc).
In practice, Kubernetes has the possibility to spawn additional Node.js containers (scaling horizontally). On the other hand, Node.js can only grow within its environment (scaling vertically). You can read about this here.
While from a performance perspective both approaches might be relatively similar (you can use the same number of cores in both cases); the problem with vertically scaling on a single machine is that you lose the high-availability aspect that Kubernetes provides. On the other hand, if you decide to deploy several Node.js containers on different machines, you are much more tolerant for the day one of them is going down.

Is it possible to isolate spark cluster nodes for each individual application

We have a spark cluster comprising of 16 nodes. Is it possible to limit nodes 1 & 2 for application 'A'; nodes 3,4,5 for application 'B'; nodes 10,11,12,15 for application 'C'; and so on?
From the documentation, I understand that we can set some properties to control spark executor cores, number of executors to be launched, memories etc. But, I am curious to know if I can achieve the above use case.
One obvious way to do that is to configure 3 different clusters with the desired topology, otherwise you're out of luck, spark does not have any provision,
because it is usually a bad idea and generally against the design principles of spark and clustering in general. Why? If you assign application A to specific hosts, but it gets idle, while application B is running at 100%, you have 2 idle hosts that could be working for B, so you would be wasting costly computing resources. Usually, what you want is to assign a certain number of resources per application and let the system decide how to allocate them (scheduling.. plain spark is pretty elementary, but running under YARN and Mesos you can be more sophisticated).
Another reason why it's a bad idea is that you don't want rules that specify a specific host or set of hosts. What if you assign node 1&2 to application A and they both go down? Beside not using your resources efficiently, tying your app to specific hosts makes it also difficult to make them resilient to failure by rescheduling them on other hosts.
You may have other ways to do something similar though, if you're running spark under YARN or Mesos, you can define queues or quotas and limit the amount of resources that each application can use at a given time.
In general, it depends on the reason, why do you want to statically allocate resources to applications. If it's for resource management, you should instead looking at schedulers and queues. If it's for security, you should have multiple clusters, keeping in mind that you'd be losing in performance.

Selecting a node size for a GKE kubernetes cluster

We are debating the best node size for our production GKE cluster.
Is it better to have more smaller nodes or less larger nodes in general?
e.g. we are choosing between the following two options
3 x n1-standard-2 (7.5GB 2vCPU)
2 x n1-standard-4 (15GB 4vCPU)
We run on these nodes:
Elastic search cluster
Redis cluster
PHP API microservice
Node API microservice
3 x seperate Node / React websites
Two things to consider in my opinion:
Replication:
services like Elasticsearch or Redis cluster / sentinel are only able to provide reliable redundancy if there are enough Pods running the service: if you have 2 nodes, 5 elasticsearch Pods, well chances are 3 Pods will be on one node and 2 on the other: you maximum replication will be 2. If you happen to have 2 replica Pods on the same node and it goes down, you lose the whole index.
[EDIT]: if you use persistent block storage (this best for persistence but is complex to setup since each node needs its own block, making scaling tricky), you would not 'lose the whole index', but this is true if you rely on local storage.
For this reason, more nodes is better.
Performance:
Obviously, you need enough resources. Smaller nodes have lower resources, so if a Pod starts getting lots of traffic, it will be more easily reaching its limit and Pods will be ejected.
Elasticsearch is quite a memory hog. You'll have to figure if running all these Pods require bigger nodes.
In the end, as your need grow, you will probably want to use a mix of different capacity nodes, which in GKE will have labels for capacity which can be used to set resources quotas and limits for memory and CPU. You can also add your own labels to insure certain Pods end up on certain types of nodes.

Whats the best way to mirror a live Cassandra cluster for analytics tasks?

Assuming a live cluster with several DCs, whats the best way to setup some nodes that are dedicated for analytic queries?
Analytic nodes will be hosted in a separate (routed) network and must not write any data back to the production nodes. They also must not be counted against for any CL. This especially applies to EACH_QUORUM that will be used for some writes. Analytics nodes may be offline at any time.
All solutions I've looked into seem to have their own drawbacks.
1) Take snapshots on production and transfer to independent analytics cluster
Significant update delay
IO intensive either on network or disk (e.g. rsync)
Lots of duplicate data due to different replication factors (3:1 prod. vs analytics)
Mismatch in SSTable row ranges and cluster topology on analytics cluster may require to use sstableloader
2) Use write survey mode to establish read-only nodes
Not 100% sure how this could be done for setting up multiple survey nodes to cover the whole ring
Queries can only be executed against each node locally as they could not be part of a coordinated execution
3) Add regular DC dedicated for analytics
EACH_QUORUM will fail in case analytics cluster is not available
Queries on production should not be served from analytics
Would require a way to prevent users on analytics to be able to execute queries or updates on production
Any other options or existing tools that could be used?

Minimizing Hazelcast chatter and overhead

Our app needs to run both on a large number of machines and on a single standalone machine. It has three distinct clusters, each performing a mostly isolated function. Cluster A is the main one, and clusters B & C are independent, but they both need access to a map in A to know where to route requests. Access needs to be super-fast.
Which setup should I choose?
Each cluster has its own Hazelcast instance. Clusters B & C are also lite members of the A instance.
Each cluster has its own Hazelcast instance. Clusters B & C use a Hazelcast client to talk to A.
One giant instance for all clusters.
I'm concerned about chatter and overhead as the clusters get larger, to potentially hundreds of machines. Which setup is most scalable?
Also, is there a writeup anywhere which details the messages that Hazelcast passes around? I'd like to know exactly what happens when a key gets added or removed, for example.
Try to avoid lite-member setup (1) as it is harder to maintain clusters with lite-members.
If all these machines/nodes are on the same local network and if # of nodes is around 50, you can go with (3).. all in one cluster. Otherwise I would go with (2) as you can scale clients really well and they are very lite-weight.

Resources