Does Spark Allow Running On Multiple Nodes? - apache-spark

I am using Cassandra and I have 3 nodes (rf:3) each of that have 20 core, each of that has 1tb SSD, and each of that has 64GB ram. But when I connect Cassandra from spark and try to read data only which I use node CPU's working, all of the cores that computer working 90% 80% level but the other computers (nodes) CPU's does not change they are not working. Does Spark Allow Running On Multiple Nodes? if yes, How can provide that?
I am giving 3 node ip's to the connection host like below but it does not work.
hosts ={"spark.cassandra.connection.host":'node1_ip,node2_ip,node3_ip',
"table":"ex_table","keyspace":"ex_keyspace"}
data_frame=sqlContext.read.format("org.apache.spark.sql.cassandra") \
.options(**hosts).load()

Related

Cassandra Cluster - Production (Vwmare)

I intend to create a cassandra cluster with 10 nodes (16v cpu + 32 Gb of RAM each).
However, for the generation of this cluster, I intend to use a high-end storage (SSD only) with 320k IOPS. These machines will be spread over 10 machines with VMWARE 6.7 installed. Any contraindications in this case? Even though it is a very performative architecture for any type of application / database?
It looks server side is quite okay but you need to consider other things like network, OS and data modelling part to opt good performance in Cassandra.
You can take a look datastax recommendation here :-
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/config/configRecommendedSettings.html

Get number of available executors

I'm spinning up an EMR 5.4.0 cluster with Spark installed. I have a job for which performance really degrades if it's scheduled on executors which aren't available (eg on a cluster w/ 2 m3.xlarge core nodes there are about 16 executors available).
Is there any way for my app to discover this number? I can discover the hosts by doing this:
sc.range(1,100,1,100).pipe("hostname").distinct().count(), but I'm hoping there's a better way of getting an understanding of the cluster that Spark is running on.

Why does my Spark only use two computers in the cluster?

I'm using Spark 1.3.1 on StandAlone mode in my cluster which has 7 machines. 2 of the machines are powerful and have 64 cores and 1024 GB memory, while the others have 40 cores and 256 GB memory. One of the powerful machines is set to be the master, and others are set to be the slaves. Each of the slave machine runs 4 workers.
When I'm running my driver program on one of the powerful machines, I see that it takes the cores only from the two powerful machines. Below is a part of the web UI of my spark master.
My configuration of this Spark driver program is as follows:
spark.scheduling.mode=FAIR
spark.default.parallelism=32
spark.cores.max=512
spark.executor.memory=256g
spark.logConf=true
Why spark does this? Is this a good thing or a bad thing? Thanks!
Consider lowering your executors memory from the 256GB that you have defined.
For the future, take in consideration assigning around 75% of available memory.

Cassandra cluster on budget

I am learning Cassandra and want to run a cloud based cluster. I don't care much about speed.
What I want to really test is the replication and recovery features.
I would be running tests like
taking nodes offline every once in a while
kill -9 cassandra
powering off server
manually corrupting sstables/commitlog (not sure if this is recoverable)
I am thinking of going for a 4 node cluster.
Each node will have the following config:
2 GB RAM
10 GB SSD
2 CPUs (Virtual)
Two nodes will be in a European datacenter and other two will be in a North American data center.
I know 8GB is the recommended minimum for Cassandra. But that config would be quite expensive.
If it helps, I can run one more VM on a dedicated box. This VM can have 16 GB RAM and 8 virtual CPUs. I could also run 4 VMs with 4GB RAM each on this box. But I guess, having 4 separate VMs in different data centers would make a more realistic setup and bring to fore any issues that may arise out of network problems, latencies etc.
Is it okay to run Cassandra on machines with this config? Please share your thoughts.
Many people run multiple instances of cassandra on modern laptops using ccm ( https://github.com/pcmanus/ccm ). If you just want to get an idea of what it does (create a 3 node cluster, add data, add a 4th node, create a snapshot, remove a node, add it back, restore the snapshot, etc), using ccm on a PC may be 'good enough'.
Otherwise, you can certainly run with less than 1GB of ram, but it's not always fun. There have been some clusters on tiny hardware ( http://www.datastax.com/dev/blog/32-node-raspberry-pi-cassandra-cluster ). Depending on your budget, making a cluster on raspberry pi's may be as cost effective as your 2 VM cluster.

Hardware requirements for Presto

I suspect the answer is "it depends", but is there any general guidance about what kind of hardware to plan to use for Presto?
Since Presto uses a coordinator and a set of workers, and workers run with the data, I imagine the main issues will be having sufficient RAM for the coordinator, sufficient network bandwidth for partial results sent from workers to the coordinator, etc.
If you can supply some general thoughts on how to size for this appropriately, I'd love to hear them.
Most people are running Trino (formerly PrestoSQL) on the Hadoop nodes they already have. At Facebook we typically run Presto on a few nodes within the Hadoop cluster to spread out the network load.
Generally, I'd go with the industry standard ratios for a new cluster: 2 cores and 2-4 gig of memory for each disk, with 10 gigabit networking if you can afford it. After you have a few machines (4+), benchmark using your queries on your data. It should be obvious if you need to adjust the ratios.
In terms of sizing the hardware for a cluster from scratch some things to consider:
Total data size will determine the number of disks you will need. HDFS has a large overhead so you will need lots of disks.
The ratio of CPU speed to disks depends on the ratio between hot data (the data you are working with) and the cold data (archive data). If you just starting your data warehouse you will need lots of CPUs since all the data will be new and hot. On the other hand, most physical disks can only deliver data so fast, so at some point more CPUs don't help.
The ratio of CPU speed to memory depends on the size of aggregations and joins you want to perform and the amount of (hot) data you want to cache. Currently, Presto requires the final aggregation results and the hash table for a join to fit in memory on a single machine (we're actively working on removing these restrictions). If you have larger amounts of memory, the OS will cache disk pages which will significantly improve the performance of queries.
In 2013 at Facebook we ran our Presto processes as follows:
We ran our JVMs with a 16 GB heap to leave most memory available for OS buffers
On the machines we ran Presto we didn't run MapReduce tasks.
Most of the Presto machines had 16 real cores and used processor affinity (eventually cgroups) to limit Presto to 12 cores (so the Hadoop data node process and other things could run easily).
Most of the servers were on a 10 gigabit networks, but we did have one large old crufty cluster using 1 gigabit (which worked fine).
We used the same configuration for the coordinator and the workers.
In recent times, we ran the following:
The machines had 256 GB of memory and we ran a 200 GB Java heap
Most of the machines had 24-32 real cores and Presto was allocated all cores.
The machines had only minimal local storage for logs, with all table data remote (in a proprietary distributed file system).
Most servers had a 25 gigabit network connection to a fabric network.
The coordinators and workers had similar configurations.

Resources