How to determine infra needs for a spark cluster

How to determine infra needs for a spark cluster - apache-spark

I am looking for some suggestions or resources on how to size-up servers for a spark cluster. We have enterprise requirements that force us to use on-prem servers only so I can't try the task on a public cloud (and even if I used fake data for PoC I would still need to buy the physical hardware later). The org also doesn't have a shared distributed compute env that I could use/I wasn't able to get good internal guidance on what to buy. I'd like to have some idea on what we need before I talk to a vendor who would try to up-sell me.
Our workload
We currently have a data preparation task that is very parallel. We implement it in python/pandas/sklearn + multiprocessing package on a set of servers with 40 skylake cores/80 threads and ~500GB RAM. We're able to complete the task in about 5 days by manually running this task over 3 servers (each one working on a separate part of the dataset). The tasks are CPU bounded (100% utilization on all threads) and usually the memory usage is low-ish (in the 100-200 GB range). Everything is scalable to a few thousand parallel processes, and some subtasks are even more paralellizable. A single chunk of data is in 10-60GB range (different keys can have very different sizes, a single chunk of data has multiple things that can be done to it in parallel). All of this parallelism is currently very manual and clearly should be done using a real distributed approach. Ideally we would like to complete this task in under 12 hours.
Potential of using existing servers
The servers we use for this processing workload are often used on individual basis. They each have dual V100 and do (single node, multigpu) GPU accelerated training for a big portion of their workload. They are operated bare metal/no vm. We don't want to lose this ability to use the servers on individual basis.
Looking for typical spark requirements they also have the issue of (1) only 1GB ethernet connection/switches between them (2) their SSDs are configured into a giant 11TB RAID 10 and we probably don't want to change how the file system looks like when the servers are used on individual basis.
Is there a software solution that could transform our servers into a cluster and back on demand or do we need to reformat everything into some underlying hadoop cluster (or something else)?
Potential of buying new servers
With the target of completing the workload in 12 hours, how do we go about selecting the correct number of nodes/node size?
For compute nodes
How do we choose number of nodes
CPU/RAM/storage?
Networking between nodes (our DC provides 1GB switches but we can buy custom)?
Other considerations?
For storage nodes
Are they the same as compute nodes?
If not how do we choose what is appropriate (our raw dataset is actually small, <1TB)
We extensively utilize a NAS as a shared storage between the servers, are there special consideration on how this needs to work with a cluster?
I'd also like to understand how I can scale up/down these numbers while still being able to viably complete the parallel processing workload. This way I can get a range of quotes => generate a budget proposal for 2021 => buy servers ~Q1.

Related

Choosing the right EC2 instance for three NodeJS Applications

I'm running three MEAN stack programmes. Each application receives over 10,000 monthly users. Could you please assist me in finding an EC2 instance for my apps?
I've been using a "t3.large" instance with two vCPUs and eight gigabytes of RAM, but it costs $62 to $64 per month.
I need help deciding which EC2 instance to use for three Nodejs applications.

First check CloudWatch metrics for the current instances. Is CPU and memory usage consistent over time? Analysing the metrics could help you to decide whether you should select a smaller/bigger instance or not.
One way to avoid too unnecessary costs is to use auto scaling groups and load balancers. By using them and finding and applying proper settings, you could have always right amount of computing power for your applications.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html
https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html

Depends on your applications. If your apps need more compute power or more memory or more storage? Deciding a server is similar to installing an app on system. Check what are basic requirements for it & then proceed to choose server.
If you have 10k+ monthly customers, think about using ALB so that traffic gets distributed evenly. Try caching to server some content if possible. Use unlimited burst mode of t3 servers if CPU keeps hitting 100%. Also, try to optimize code so that fewer resources are consumed. Once you are comfortable with ec2 choice, try to purchase saving plans or RIs for less cost.
Also, do monitor the servers & traffic using Cloudwatch agent, internet monitor etc features.

SLURM Highly Availability Head Node

According to https://slurm.schedmd.com/quickstart_admin.html#HA high availability of SLURM is achieved by deploying a second BackupController which takes over when the primary fails and retrieves the current state from a shared file system (probably NFS).
In my opinion this has a number of drawbacks. E.g. this limits the total number of server to two and the second server is probably barely used.
Is this the only way to get a highly available head node with SLURM?
What I would like to do is a classic 3-tiered setup: A load balancer in the first tier which spreads all requests evenly across the nodes in the seconds tier. This requires the head node(s) to be stateless. The third tier is the database tier where all information is stored or read. I don't know anything about the internals of SLURM and I'm not sure if this is even remotely possible.

In the current design, the controller internal state is in-memory, and Slurm saves it to a set of files in the directory pointed to by the StateSaveLocation configuration parameter regularly. Only one instance of slurmctld can write to that directory at a time.
One problem with storing the state in the database would be a terrible latency in resource allocation with a lot of synchronisations needed, because optimal resource allocation can only be done with full information. The infrastructure needed to support the same level of throughput as Slurm can handle now with in-memory state would be very costly compared with the current solution implying only bitwise operations on arrays in memory.
Is this the only way to get a highly available head node with SLURM?
You can also have a single MasterController managed with Corosync. But indeed Slurm only has active/passive options available for HA.
In my opinion this has a number of drawbacks. E.g. this limits the
total number of server to two and the second server is probably barely
used.
The load on the controller is often very reasonable with respect to the current processing power, and the resource allocation problem cannot be trivially parallelised (or made stateless). Often, the backup controller is co-located on a machine running another service. For instance, on small deployments, one machine runs the Slurm primary controller, and other services (NFS, LDAP, etc.), etc. while another is the user login node, that also acts as a secondary Slurm controller.

How do I determine the number of Node Types, Number of nodes and VM size in Service Fabric cluster for a relatively simple but high throughput API?

I have an Asp.Net core 2.0 Wen API that has a relatively simple logic (simple select on a SQL Azure DB, return about 1000-2000 records. No joins, aggregates, functions etc.). I have only 1 GET API. which is called from an angular SPA. Both are deployed in service fabric as as stateless services, hosted in Kestrel as self hosting exes.
considering the number of users and how often they refresh, I've determined there will be around 15000 requests per minute. in other words 250 req/sec.
I'm trying to understand the different settings when creating my Service Fabric cluster.
I want to know:
How many Node Types? (I've determined as Front-End, and Back-End)
How many nodes per node type?
What is the VM size I need to select?
I have ready the azure documentation on cluster capacity planning. while I understand the concepts, I don't have a frame of reference to determine the actual values i need to provide to the above questions.

In most places where you read about the planning of a cluster they will suggest that this subject is part science and part art, because there is no easy answer to this question. It's hard to answer it because it depends a lot on the complexity of your application, without knowing the internals on how it works we can only guess a solution.
Based on your questions the best guidance I can give you is, Measure first, Measure again, Measure... Plan later. Your application might be memory intensive, network intensive, CPU, Disk and son on, the only way to find the best configuration is when you understand it.
To understand your application before you make any decision on SF structure, you could simply deploy a simple cluster with multiple node types containing one node of each VM size and measure your application behavior on each of them, and then you would add more nodes and span multiple instances of your service on these nodes and see which configuration is a best fit for each service.
1.How many Node Types?
I like to map node types as 1:1 to roles on your application, but is not a law, it will depend how much resource each service will consume, if the service consume enough resource to make a single VM(node) busy (Memory, CPU, Disk, IO), this is a good candidate to have it's own node type, in other cases there are services that are light-weight that would be a waste of resources provisioning an entire VM(node) just for it, an example is scheduled jobs, backups, and so on, In this case you could provision a set of machines that could be shared for these services, one important thing you have to keep in mind when you share a node-type with multiple service is that they will compete for resources(memory, CPU, network, disk) and the performance measures you took for each service in isolation might not be the same anymore, so they would require more resources, the option is test them together.
Another point is the number of replicas, having a single instance of your service is not reliable, so you would have to create replicas of it(the right number I describe on next answer), in this case you end up with a service load split in to multiple nodes, making this node-type under utilized, is where you would consider joining services on same node-type.
2.How many nodes per node type?
As stated before, it will depend on your service resource consumption, but a very basic rule is a minimum of 3 per node type.
Why 3?
Because 3 is the lowest number where you could have a rolling update and guarantee a quorum of 51% of nodes\service\instances running.
1 Node: If you have a service running 1 instance in a node-type of 1 node, when you deploy a new version of your service, you would have to bring down this instance before the new comes up, so you would not have any instance to serve the load while upgrading.
2 Nodes: Similar to 1 node, but in this case you keep only 1 node running, in case of failure, you wouldn't have a failover to handle the load until the new instance come up, it will worse if you are running a stateful service, because you will have only one copy of your data during the upgrade and in case of failure you might loose data.
3 Nodes: During a update you still have 2 nodes available, when the one being updated get back, the next one is put down and you still have 2 nodes running, in case of failure of one node, the other node can support the load until a new node is deployed.
3 nodes does not mean the your cluster will be highly reliable, it means the chances of failure and data loss will be lower, you might be unlucky a loose 2 nodes at same time. As suggested in the docs, in production is better to always keep the number of nodes as 5 or more, and plan to have a quorum of 51% nodes\services available. In this case I would recommend 5, 7 or 9 nodes in cases you really need higher uptime 99.9999...%
3.What is the VM size I need to select?
As said before, only measurements will give this answer.
Observations:
These recommendations does not take into account the planning for primary node types, it is recommended to have at least 5 nodes on primary Node Types, it is where SF system services are placed, they are responsible to manage the
cluster, so they must be highly reliable, otherwise you risk losing control of your cluster. If you plan to share these nodes with your application services, keep in mind that your services might impact them, so you have to always monitor them to check for any impact it might cause.

Azure VM pricing - Is it better to have 80 single core machines or 10 8-core machines?

I am limited by a piece of software that utilizes a single core per instance of the program run. It will run off an SQL server work queue and deposit results to the server. So the more instances I have running the faster the overall project is done. I have played with Azure VMs a bit and can speed up the process in two ways.
1) I can run the app on a single core VM, clone that VM and run it on as many as I feel necessary to speed up the job sufficiently.
OR
2) I can run the app 8 times on an 8-core VM, ...again clone that VM and run it on as many as I feel necessary to speed up the job sufficiently.
I have noticed in testing that the speed-up is roughly the same for adding 8 single core VMs and 1 8-core VM. Assuming this is true, would it better better price-wise to have single core machines?
The pricing is a bit of a mystery, whether it is real cpu usage time, or what. It is a bit easier using the 1 8-core approach as spinning up machines and taking them down takes time, but I guess that could be automated.
It does seem from some pricing pages that the multiple single core VM approach would cost less?
Side question: so could I do like some power shell scripts to just keep adding VMs of a certain image and running the app, and then start shutting them down once I get close to finishing? After generating the VMs would there be some way to kick off the app without having to remote in to each one and running it?

I would argue that all else being equal, and this code truly being CPU-bound and not benefitting from any memory sharing that running multiple processes on the same machine would provide, you should opt for the single core machines rather than multi-core machines.
Reasons:
Isolate fault domains
Scaling out rather than up is better to do when possible because it naturally isolates faults. If one of your small nodes crashes, that only affects one process. If a large node crashes, multiple processes go down.
Load balancing
Windows Azure, like any multi-tenant system, is a shared resource. This means you will likely be competing for CPU cycles with other workloads. Having small VMs gives you a better chance of having them distributed across physical servers in the datacenter that have the best resource situation at the time the machines are provisioned (you would want to make sure to stop and deallocate the VMs before starting them again to allow the Azure fabric placement algorithms to select the best hosts). If you used large VMs, it would be less likely to find a suitable host with optimal contention to accommodate many virtual cores.
Virtual processor scheduling
It's not widely understood how scheduling a virtual CPU is different than scheduling a physical one, but it is something worth reading up on. The main thing to remember is that hypervisors like VMware ESXi and Hyper-V (which runs Azure) schedule multiple virtual cores together rather than separately. So if you have an 8-core VM, the physical host must have 8 physical cores free simultaneously before it can allow the virtual CPU to run. The more virtual cores, the more unlikely the host will have sufficient physical cores at any given time (even if 7 physical cores are free, the VM cannot run). This can result in a paradoxical effect of causing the VM to perform worse as more virtual CPU cores are added to it. http://www.perfdynamics.com/Classes/Materials/BradyVirtual.pdf
In short, a single vCPU machine is more likely to get a share of the physical processor than an 8 vCPU machine, all else equal.
And I agree that the pricing is basically the same, except for a little more storage cost to store many small VMs versus fewer large ones. But storage in Azure is far less expensive than the compute, so likely doesn't tip any economic scale.
Hope that helps.

Billing
According to Windows Azure Virtual Machines Pricing Details, Virtual Machines are charged by the minute (of wall clock time). Prices are listed as hourly rates (60 minutes) and are billed based on total number of minutes when the VMs run for a partial hour.
In July 2013, 1 Small VM (1 virtual core) costs $0.09/hr; 8 Small VMs (8 virtual cores) cost $0.72/hr; 1 Extra Large VM (8 virtual cores) cost $0.72/hr (same as 8 Small VMs).
VM Sizes and Performance
The VMs sizes differ not only in number of cores and RAM, but also on network I/O performance, ranging from 100 Mbps for Small to 800 Mbps for Extra Large.
Extra Small VMs are rather limited in CPU and I/O power and are inadequate for workloads such as you described.
For single-threaded, I/O bound applications such as described in the question, an Extra Large VM could have an edge because of faster response times for each request.
It's also advisable to benchmark workloads running 2, 4 or more processes per core. For instance, 2 or 4 processes in a Small VM and 16, 32 or more processes in an Extra Large VM, to find the adequate balance between CPU and I/O loads (provided you don't use more RAM than is available).
Auto-scaling
Auto-scaling Virtual Machines is built-into Windows Azure directly. It can be based either on CPU load or Windows Azure Queues length.
Another alternative is to use specialized tools or services to monitor load across the servers and run PowerShell scripts to add or remove virtual machines as needed.
Auto-run
You can use the Windows Scheduler to automatically run tasks when Windows starts.

The pricing is "Uptime of the machine in hours * rate of the VM size/hour * number of instances"
e.g. You have a 8 Core VM (Extra Large) running for a month (30 Days)
(30 * 24) * 0.72$ * 1= 518.4$
for 8 single cores it will be
(30 * 24) * 0.09 * 8 = 518.4$
So I doubt if there will be any price difference. One advantage of using smaller machines and "scaling out" is when you have more granular control over scalability. An Extra-large machine will eat more idle dollars than 2-3 small machines.
Yes you can definitely script this. Assuming they are IaaS machines you could add the script to windows startup, if on PaaS you could use the "Startup Task".
Reference

Azure compute instances

On Azure I can get 3 extra small instances for the price 1 small.I'm not worried about my site not scaling.
Are there any other reasons I should not go for 3 extra small instead of 1 small?
See: Azure pricing calculator.

An Extra Small instance is limited to approx. 5Mbps bandwidth on the NIC (vs. approx. 100Mbps per core with Small, Medium, Large, and XL), and has less than 1GB of RAM. So, let's say you're running something that's very storage-intensive. You could run into bottlenecks accessing SQL Azure or Windows Azure storage.
With RAM: If you're running 3rd-party apps, such as MongoDB, you'll likely run into memory issues.
From a scalability standpoint, you're right that you can spread the load across 2 or 3 Extra Small instances, and you'll have a good SLA. Just need to make sure your memory and bandwidth are good enough for your performance targets.
For more details on exact specs for each instance size, including NIC bandwidth, see this MSDN article.

Look at the fine print - the I/O performance is supposed to be much better with the small instance compared to the x-small instance. I am not sure if this is due to a technology related bottleneck or a business decision, but that's the way it is.
Also I'm guessing the OS takes a bit of RAM in each of the instances, so in 3 X-small instances it takes it up three times instead of just once in a small instance. That would reduce the resources that are actually available for your application needs.

While 3 xtra-small instances theoretically may equal or even be better "on paper" than one small instance, do remember that xtra-small instances do not have dedicated cores and their raw computing resources are shared with other tenants. I've tried these xtra-small instances in an attempt to save money for tiny-load website and must say that there were simply outages or times of horrible performance that I've found unacceptable.
In short: I would not use xtra-small instances for any sort of production environment

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string