How do I determine the number of Node Types, Number of nodes and VM size in Service Fabric cluster for a relatively simple but high throughput API? - azure

I have an Asp.Net core 2.0 Wen API that has a relatively simple logic (simple select on a SQL Azure DB, return about 1000-2000 records. No joins, aggregates, functions etc.). I have only 1 GET API. which is called from an angular SPA. Both are deployed in service fabric as as stateless services, hosted in Kestrel as self hosting exes.
considering the number of users and how often they refresh, I've determined there will be around 15000 requests per minute. in other words 250 req/sec.
I'm trying to understand the different settings when creating my Service Fabric cluster.
I want to know:
How many Node Types? (I've determined as Front-End, and Back-End)
How many nodes per node type?
What is the VM size I need to select?
I have ready the azure documentation on cluster capacity planning. while I understand the concepts, I don't have a frame of reference to determine the actual values i need to provide to the above questions.

In most places where you read about the planning of a cluster they will suggest that this subject is part science and part art, because there is no easy answer to this question. It's hard to answer it because it depends a lot on the complexity of your application, without knowing the internals on how it works we can only guess a solution.
Based on your questions the best guidance I can give you is, Measure first, Measure again, Measure... Plan later. Your application might be memory intensive, network intensive, CPU, Disk and son on, the only way to find the best configuration is when you understand it.
To understand your application before you make any decision on SF structure, you could simply deploy a simple cluster with multiple node types containing one node of each VM size and measure your application behavior on each of them, and then you would add more nodes and span multiple instances of your service on these nodes and see which configuration is a best fit for each service.
1.How many Node Types?
I like to map node types as 1:1 to roles on your application, but is not a law, it will depend how much resource each service will consume, if the service consume enough resource to make a single VM(node) busy (Memory, CPU, Disk, IO), this is a good candidate to have it's own node type, in other cases there are services that are light-weight that would be a waste of resources provisioning an entire VM(node) just for it, an example is scheduled jobs, backups, and so on, In this case you could provision a set of machines that could be shared for these services, one important thing you have to keep in mind when you share a node-type with multiple service is that they will compete for resources(memory, CPU, network, disk) and the performance measures you took for each service in isolation might not be the same anymore, so they would require more resources, the option is test them together.
Another point is the number of replicas, having a single instance of your service is not reliable, so you would have to create replicas of it(the right number I describe on next answer), in this case you end up with a service load split in to multiple nodes, making this node-type under utilized, is where you would consider joining services on same node-type.
2.How many nodes per node type?
As stated before, it will depend on your service resource consumption, but a very basic rule is a minimum of 3 per node type.
Why 3?
Because 3 is the lowest number where you could have a rolling update and guarantee a quorum of 51% of nodes\service\instances running.
1 Node: If you have a service running 1 instance in a node-type of 1 node, when you deploy a new version of your service, you would have to bring down this instance before the new comes up, so you would not have any instance to serve the load while upgrading.
2 Nodes: Similar to 1 node, but in this case you keep only 1 node running, in case of failure, you wouldn't have a failover to handle the load until the new instance come up, it will worse if you are running a stateful service, because you will have only one copy of your data during the upgrade and in case of failure you might loose data.
3 Nodes: During a update you still have 2 nodes available, when the one being updated get back, the next one is put down and you still have 2 nodes running, in case of failure of one node, the other node can support the load until a new node is deployed.
3 nodes does not mean the your cluster will be highly reliable, it means the chances of failure and data loss will be lower, you might be unlucky a loose 2 nodes at same time. As suggested in the docs, in production is better to always keep the number of nodes as 5 or more, and plan to have a quorum of 51% nodes\services available. In this case I would recommend 5, 7 or 9 nodes in cases you really need higher uptime 99.9999...%
3.What is the VM size I need to select?
As said before, only measurements will give this answer.
Observations:
These recommendations does not take into account the planning for primary node types, it is recommended to have at least 5 nodes on primary Node Types, it is where SF system services are placed, they are responsible to manage the
cluster, so they must be highly reliable, otherwise you risk losing control of your cluster. If you plan to share these nodes with your application services, keep in mind that your services might impact them, so you have to always monitor them to check for any impact it might cause.

Related

How to determine infra needs for a spark cluster

I am looking for some suggestions or resources on how to size-up servers for a spark cluster. We have enterprise requirements that force us to use on-prem servers only so I can't try the task on a public cloud (and even if I used fake data for PoC I would still need to buy the physical hardware later). The org also doesn't have a shared distributed compute env that I could use/I wasn't able to get good internal guidance on what to buy. I'd like to have some idea on what we need before I talk to a vendor who would try to up-sell me.
Our workload
We currently have a data preparation task that is very parallel. We implement it in python/pandas/sklearn + multiprocessing package on a set of servers with 40 skylake cores/80 threads and ~500GB RAM. We're able to complete the task in about 5 days by manually running this task over 3 servers (each one working on a separate part of the dataset). The tasks are CPU bounded (100% utilization on all threads) and usually the memory usage is low-ish (in the 100-200 GB range). Everything is scalable to a few thousand parallel processes, and some subtasks are even more paralellizable. A single chunk of data is in 10-60GB range (different keys can have very different sizes, a single chunk of data has multiple things that can be done to it in parallel). All of this parallelism is currently very manual and clearly should be done using a real distributed approach. Ideally we would like to complete this task in under 12 hours.
Potential of using existing servers
The servers we use for this processing workload are often used on individual basis. They each have dual V100 and do (single node, multigpu) GPU accelerated training for a big portion of their workload. They are operated bare metal/no vm. We don't want to lose this ability to use the servers on individual basis.
Looking for typical spark requirements they also have the issue of (1) only 1GB ethernet connection/switches between them (2) their SSDs are configured into a giant 11TB RAID 10 and we probably don't want to change how the file system looks like when the servers are used on individual basis.
Is there a software solution that could transform our servers into a cluster and back on demand or do we need to reformat everything into some underlying hadoop cluster (or something else)?
Potential of buying new servers
With the target of completing the workload in 12 hours, how do we go about selecting the correct number of nodes/node size?
For compute nodes
How do we choose number of nodes
CPU/RAM/storage?
Networking between nodes (our DC provides 1GB switches but we can buy custom)?
Other considerations?
For storage nodes
Are they the same as compute nodes?
If not how do we choose what is appropriate (our raw dataset is actually small, <1TB)
We extensively utilize a NAS as a shared storage between the servers, are there special consideration on how this needs to work with a cluster?
I'd also like to understand how I can scale up/down these numbers while still being able to viably complete the parallel processing workload. This way I can get a range of quotes => generate a budget proposal for 2021 => buy servers ~Q1.

Is implementing elastic search service on same server as node server with auto scaling is a good idea?

Trying to deploy a project on t3 large server with auto scaling.
I have my elastic search service deployed on same system as node and react projects.(Not using AWS elastic search)
Will it be facing issues in future and i need to segregate elastic search service to some other server?
It's always nice to have a separate dedicated server for running the Elasticsearch server but as you are using AWS some of the things which you can do to minimize the issues:
Elasticsearch is a stateful application contrast to your node and react app unless you are storing the state there as well which is not a good idea and due to stateless nature of the applications, autoscaling is very useful as you can on-demand based on the CPU, memory or other metrics scale up or down the instances.
But in case of Elasticsearch or other stateful applications, it becomes tricky as when you scale up or down the instance, shards get relocated if they are not reachable within a threshold which can lead to unbalanced Elasticsearech cluster.
Now in order to minimize these issues:
Make sure you can storing Elasticsearch indices on the network-attached disk so that there is no data loss when autoscaling brings a new instance and new instance again should use earlier network attaches EBS(where your data is stored).
Make sure you don't create a new Elasticsearch process when you scale up or down the instances according to your autoscaling policy and the Elasticsearch process should be fixed and scale up/down with some manual intervention.
If you have to scale up the Elasticsearch cluster then make sure you disable shard allocation to avoid the issues mentioned earlier.
These are some known issues which you might face and there could be even more based on your configuration and while writing the answer itself I felt, it so easy to just have a dedicated instance for Elasticsearch to avoid these weird issues.
I would add to other answers following:
Elasticsearch performs best if it has enough RAM to keep indexes in entirety in RAM. If the Elasticsearch is competing with Node/Application for RAM it will affect it's performance.
From maintenance/performance perspective you should consider having at least 3-node cluster. Even if that means you have smaller machines. If AWS is upgrading infrastructure and you have 1 machine, when than 0.05% unavailability hits your search is down. If you need to do maintenance on the node or do upgrades having multiple machines will help with availability.
Depending on your use of Elasticsearch and how often you update/delete items in the indexes, and how fast your indexes will grow, adding more machines/nodes to the cluster will help with growth.
There are probably many more things to consider, but that totally depends on your application, budget, SLAs etc.

SLURM Highly Availability Head Node

According to https://slurm.schedmd.com/quickstart_admin.html#HA high availability of SLURM is achieved by deploying a second BackupController which takes over when the primary fails and retrieves the current state from a shared file system (probably NFS).
In my opinion this has a number of drawbacks. E.g. this limits the total number of server to two and the second server is probably barely used.
Is this the only way to get a highly available head node with SLURM?
What I would like to do is a classic 3-tiered setup: A load balancer in the first tier which spreads all requests evenly across the nodes in the seconds tier. This requires the head node(s) to be stateless. The third tier is the database tier where all information is stored or read. I don't know anything about the internals of SLURM and I'm not sure if this is even remotely possible.
In the current design, the controller internal state is in-memory, and Slurm saves it to a set of files in the directory pointed to by the StateSaveLocation configuration parameter regularly. Only one instance of slurmctld can write to that directory at a time.
One problem with storing the state in the database would be a terrible latency in resource allocation with a lot of synchronisations needed, because optimal resource allocation can only be done with full information. The infrastructure needed to support the same level of throughput as Slurm can handle now with in-memory state would be very costly compared with the current solution implying only bitwise operations on arrays in memory.
Is this the only way to get a highly available head node with SLURM?
You can also have a single MasterController managed with Corosync. But indeed Slurm only has active/passive options available for HA.
In my opinion this has a number of drawbacks. E.g. this limits the
total number of server to two and the second server is probably barely
used.
The load on the controller is often very reasonable with respect to the current processing power, and the resource allocation problem cannot be trivially parallelised (or made stateless). Often, the backup controller is co-located on a machine running another service. For instance, on small deployments, one machine runs the Slurm primary controller, and other services (NFS, LDAP, etc.), etc. while another is the user login node, that also acts as a secondary Slurm controller.

Node.js scaling out on Kubernetes

I built an app on node.js using Docker and I'm not sure how to scale it on a Kubernetes cluster so that I take the most out of my cluster hardware.
From a performance perspective which of the following is better:
clusterize my node app and run as many containers as needed
or
just run as many containers as needed without clustering ?
When I say clustering I mean this https://nodejs.org/api/cluster.html
My app is a simple CRUD Api backed by mongoDB. We estimate that it will have 1000 concurrent users. Our cluster has 3 nodes.
The NodeJS cluster mechanism is useful to allow NodeJS to more effectively use greater than a single core, so depending on your code it may benefit you, but it's highly dependent on your code and the various dependencies and how well they work (or not) with clustering.
As a general practice, if you can break your containers down into nicely parallelized efforts that can be run as pods within kubernetes, then I'd recommend the following as a process to see what works for you:
set up a single pod with your code in it, and run a load test against it. Use the data that Kubernetes has from cAdvisor to characterize how much resources (cpu & memory) your pod likes to have.
set a resource limit for cpu and memory based on what you see above.
run a load test to validate what your single pod handles in terms of scale
And from there, you have a baseline where you can use Kubernetes to scale this horizontally to validate the 1000 user concurrent baseline you want to achieve.
There's a good talk on this process from the 2017 Kubecon called Load Testing Kubernetes: How to optimize your cluster resource allocation in production
Once you have a baseline, you can run a prototype out leveraging the clustering in your code, and then compare against the non-clustered version. If you do this, I'd double-check that any limits you set are > 1 core for CPU, or you'll be self-limiting outside of the NodeJS runtime to get access to multiple cores, which would defeat the purpose of using clustering.
Depending on what you're doing in your code, there may be significant re-work needed to enable clustering, as it wants to leverage its own worker concept, and it's not clear what frameworks you're using and if they'll fit reasonably into that structure.

Dynamic Service Creation to Distribute Load

Background
The problem we're facing is that we are doing video encoding and want to distribute the load to multiple nodes in the cluster.
We would like to constrain the number of video encoding jobs on a particular node to some maximum value. We would also like to have small video encoding jobs sent to a certain grouping of nodes in the cluster, and long video encoding jobs sent to another grouping of nodes in the cluster.
The idea behind this is to help maintain fairness amongst clients by partitioning the large jobs into a separate pool of nodes. This helps ensure that the small video encoding jobs are not blocked / throttled by a single tenant running a long encoding job.
Using Service Fabric
We plan on using an ASF service for the video encoding. With this in mind we had an idea of dynamically creating a service for each job that comes in. Placement constraints could then be used to determine which pool of nodes a job would run in. Custom metrics based on memory usage, CPU usage ... could be used to limit the number of active jobs on a node.
With this method the node distributing the jobs would have to poll whether a new service could currently be created that satisfies the placement constraints and metrics.
Questions
What happens when a service can't be placed on a node? (Using CreateServiceAsync I assume?)
Will this polling be prohibitively expensive?
Our video encoding executable is packaged along with the service which is approximately 80MB. Will this make the spinning up of a new service take a long time? (Minutes vs seconds)
As an alternative to this we could use a reliable queue based system, where the large jobs pool pulls from one queue and the small jobs pool pulls from another queue. This seems like the simpler way, but I want to explore all options to make sure I'm not missing out on some of the features of Service Fabric. Is there another better way you would suggest?
I have no experience with placement constraints and dynamic services, so I can't speak to that.
The polling of the perf counters isn't terribly expensive, that being said it's not a free operation. A one second poll interval shouldn't cause any huge perf impact while still providing a decent degree of resolution.
The service packages get copied to each node at deployment time rather than when services get spun up, so it'll make the deployment a bit slower but not affect service creation.
You're going to want to put the job data in reliable collections any way you structure it, but the question is how. One idea I just had that might be worth considering is making the job processing service a partitioned service and base your partitioning strategy based off encoding job size and/or tenant so that large jobs from the same tenant get stuck in the same queue, and smaller jobs for others go elsewhere.
As an aside, one thing I've dealt with in the past is SF remoting limits the size of the messages sent and throws if its too big, so if your video files are being passed from service to service you're going to want to consider a paging strategy for inter service communication.

Resources