apache-spark deployment: stand alone VS multiple VM's - apache-spark

I have one machine on which to deploy Spark, Hadoop, and Tachyon.
Are spark operations from hdfs/tachyon going to be faster on one node with all cores/RAM or a number of VM nodes evenly dividing the resources?
Ram is < 200GB.
Performance and Scalability of Broadcast in Spark is quite old, but suggests that the increase network traffic could be a strong negative in the all vs VM's problem.

Its probably better to have multiple instances of the workers, while their is an increase in network overhead the JVM performance with a really large heap isn't great.

Related

Frequent Compaction of OpsCenter.rollup_state on all the nodes consuming CPU cycles

I am using Datastax Cassandra 4.8.16. With cluster of 8 DC and 5 nodes on each DC on VM's. For last couple of weeks we observed below performance issue
1) Increase drop count on VM's.
2) LOCAL_QUORUM for some write operation not achieved.
3) Frequent Compaction of OpsCenter.rollup_state and system.hints are visible in Opscenter.
Appreciate any help finding the root cause for this.
Presence of dropped mutations means that cluster is heavily overloaded. It could be increase of the main load, so it + load from OpsCenter, overloaded system - you need to look into statistics about number of requests, latencies, etc. per nodes and per tables, to see where increase happened. Please also check the I/O statistics on machines (for example, with iostat) - sizes of the queues, read/write latencies, etc.
Also it's recommended to use a dedicated OpsCenter cluster to store metrics - it could be smaller size, and doesn't require an additional license for DSE. How it said in the OpsCenter's documentation:
Important: In production environments, DataStax strongly recommends storing data in a separate DataStax Enterprise cluster.
Regarding VMs - usually it's not really recommended setup, but heavily depends on what kind of underlying hardware - number of CPUs, RAM, disk system.

What is the proper setup for spark with cassandra

After using and playing around with the spark connector, I want to utilize it in the most efficient way, for our batch processes.
is the proper approach to set up a spark worker on the same host where Cassandra node is on? does the spark connector ensure data locality?
I am a bit concerned that a memory intensive spark worker will cause the entire machine to stop, then I will lose a Cassandra node, so I'm a bit confused whether I should place the workers on the Cassandra nodes, or separate (which means no data locality). what is the common way and why?
This depends on your particular use case. Some things to be aware of
1) CPU Sharing, while memory will not be shared (heaps will be separate) between Spark and Cassandra. There is nothing stopping spark executors from stealing time on C* cpu cores. This can lead to load and slowdowns in C* if the spark process is very cpu intensive. If it isn't then this isn't much of a problem.
2) Your network speed, if your network is very fast then there is much less value to locality than if you are on a slower network.
So you have to ask yourself, do you want a simpler setup (everything in one place) or do you want a complicated setup but more isolated.
For instance DataStax (the company I work for) ships Spark running colocated with Cassandra by default, but we also offer the option of having it run separately. Most of our users colocate possibly because of this default, those who don't usually do so because of easier scaling.

How to do performance tuning in production cluster for spark job?

Lets assume we have a spark job where we are doing all the performance tuning and making it to run of development environment which is going to have limited configuration (1 node 32GB RAM 500GB Hard disk)
Obviously our production cluster is going to be high, how the tuning parameters which measured in development environment can be helpful in production cluster. Is it advisable to tune jobs directly in production cluster ?
How it is being done in real-time ?
Shameless Plug (Author) try Sparklens https://github.com/qubole/sparklens Most of the time the real question is not if the application is slow, but will it scale. And for most of the applications, answer is upto a limit.
The structure of spark application puts important constraints on its scalability. Number of tasks in a stage, dependencies between stages, skew and amount of work done on the driver side are the main constraints.
One of the best features of Sparklens is that it simulates and tell you how your spark application will perform with different executor counts. Looks perfect for your problem.

Cassandra CPU performance

I deployed a Cassandra 2.2 ring composed by 4 nodes in the cloud with 8 vCPU and 8GB of ram. I am running some tests now with cassandra-stress and YCSB tools to test its performance. I am mainly interested in read requests with a small amount of write requests (95%/5%).
Running the experiments, I noticed that even setting a high number of threads (or clients) the CPU (and disk) does not saturate, but still always around the 60% of utilisation.
I am trying to figure out where is the bottleneck in my system. From the hardware point of view it seems all ok to me.
dstat
I also looked into the Cassandra configuration file to see if there are some tuning parameters to increase the system throughput. I increase the value of concurrent_read/write parameter, but it doesn't increase the performance.
The log file also does not contain any warning.
What it could be that is limiting my system?
Thanks
You might want to consider running cassandra-stress from outside the cluster and on multiple instances as described in
Usage of the Cassandra tool cassandra-stress

Increasing Number of VMs decreases Cassandra Throughput. What can be reason?

I am using YCSB benchmarking tool to benchmark Cassandra cluster.
I am varying the number of Virtual machines in the cluster.
I am using 1 physical host and I am using 1,2,3,4 virtual machines for benchmarking(as shown in attached figure).
The generated workload is same all the time Workload C 10,000,00 operations, 10,000 records
Each VM has 2 GB RAM, 20GB drive
Cassandra - 1 seed node, endpoint_snitch - gossipproperty
Keyspace YCSB - Replication factor 3,
The problem is that when I increase the number of virtual machines in the cluster, the throughput decreases. What can be the reason?
By definition, by increasing compute resources(i.e virtual machines), the cluster should offer better performance, but the opposite is happening as shown in attached figure. Kindly explain what can be the probable reason for this? I am writing my thesis on this topic but I am unable to figure out the reason for this, please help, I will be grateful to you.
Throughput observed by varying number of VMs in Cassandra cluster:
Very likely hitting a disk IO bottleneck. Especially with non ssd drives this is completely expected. Unless you have dedicated disk/cpu per vm the competition for resources will cause contention like this. Also 2gb per vm is not enough to do any kind of performance benchmark with Cassandra since the minimum recommended JVM heap size is 8gb.
Cassandra is great at horizontal scaling (nearly linear), but that doesn't mean that simply adding VMs to one physical host will increase throughput - a single VM on the physical host will have less contention for resources (disk, cpu, memory, network) than 4, so it's likely one VM would perform better than 4.
By definition, if you WERE increasing resources, you SHOULD see it perform better - but you're not, you're simply adding contention to existing resources. If you want to scale cassandra, you need to test it with additional physical resources - more physical machines, not more VMs on the same machine.
Finally, as Chris Lohfink mentions, your VMs are too small to do meaningful tests - 8GB JVM heap is recommended, with another 8GB of vm page cache to support reads - running Cassandra with less than 16G of RAM is typically non-ideal in production.
You're trying to test a jet engine (a distribute database designed for hundreds or thousands of physical nodes) with a gas station level equipment - your benchmark hardware isn't viable for a real production environment, so your benchmark results aren't meaningful.

Resources