Should slave nodes be launched/started separately on Amazon EMR server? - apache-spark

I have just launched Amazon Elastic MapReduce server after trying java.lang.OutofMemorySpace:Java heap space while fetching 120 million rows from database in pyspark where I have 1 master and 2 slave nodes running each having 4 cores and 8G RAM.
I am trying to load a massive dataset from MySQL database (containing approx. 120M rows). The query loads fine but when I do a df.show() operation or when I try to perform operations on the spark dataframe I am getting errors like -
org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
Task 0 in stage 0.0 failed 1 times; aborting job
java.lang.OutOfMemoryError: GC overhead limit exceeded
My questions are -
When I SSH into the Amazon EMR server and do htop, I see that 5GB out of 8GB is already in use. Why is this?
On the Amazon EMR portal, I can see that the master and slave servers are running. I'm not sure if the slave servers are being used or if its just the master doing all the work. Do I have to separately launch or "start" the 2 slave nodes or does Spark do that automatically? If yes, how do I do this?

If you are running spark as standalone mode (local[*]) from master then it will only use master node.
How are you submitting spark job?
Use yarn cluster or client mode while submitting spark job to use resources efficiently.
Read more on YARN cluster vs client
Master node runs all the other services like hive, mysql, etc. Those services may
taking 5GB of ram if aren’t using standalone mode.
In yarn UI (http://<master-public-dns>:8088) you can check what other containers are running in more detail.
You can check where your spark driver and executer are spinning,
in spark UI http://<master-public-dns>:18080.
Select your job and go to the Executor section, there you would find machine ip of each executor.
Enable ganglia in EMR OR go to CloudWatch ec2 metric to check each machine utilization.
Spark doesn’t start or terminates nodes.
If you want to scale your cluster depending upon job load, apply autoscaling policy to CORE or TASK instance group.
But at-least you need 1 CORE node always running.

Related

Monitor Spark with Prometheus when Spark clusters are spined up just when needed

We run spark over Kubernetes and we spin up a spark driver and executors for a lot of our tasks (not a spark task). After the task is finished we spin the cluster (on Kubernetes) down and spin up another one when needed (There could be a lot running simultaneously).
The problem I have is that I can't monitor it with Prometheus because I do not have a diver that is always "alive" that I can pull information on the executors from.
Is there a solution for that kind of architecture?

How to specify core instance node when submitting Spark step to AWS EMR cluster

I am running multiple instances for my EMR cluster on AWS.
I have 2 instances of CORE nodes and 1 MASTER node
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.add_job_flow_steps
I'm using PySpark to submit the job but don't see anything on specifying the CORE node to run this on.
I thought this is done automatically (like round-robin style?)
Is there way to acheive this?
You always submit your step to the master not the core nodes. The master will then distribute the task to the cluster's workers (Spark executors in the core or task nodes).

How does spark choose nodes to run executors?(spark on yarn)

How does spark choose nodes to run executors?(spark on yarn)
We use spark on yarn mode, with a cluster of 120 nodes.
Yesterday one spark job create 200 executors, while 11 executors on node1,
10 executors on node2, and other executors distributed equally on the other nodes.
Since there are so many executors on node1 and node2, the job run slowly.
How does spark select the node to run executors?
according to yarn resourceManager?
As you mentioned Spark on Yarn:
Yarn Services choose executor nodes for spark job based on the availability of the cluster resource. Please check queue system and dynamic allocation of Yarn. the best documentation https://blog.cloudera.com/blog/2016/01/untangling-apache-hadoop-yarn-part-3/
Cluster Manager allocates resources across the other applications.
I think the issue is with bad optimized configuration. You need to configure Spark on the Dynamic Allocation. In this case Spark will analyze cluster resources and add changes to optimize work.
You can find all information about Spark resource allocation and how to configure it here: http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
Are all 120 nodes having identical capacity?
Moreover the jobs will be submitted to a suitable node manager based on the health and resource availability of the node manager.
To optimise spark job, You can use dynamic resource allocation, where you do not need to define the number of executors required for running a job. By default it runs the application with the configured minimum cpu and memory. Later it acquires resource from the cluster for executing tasks. It will release the resources to the cluster manager once the job has completed and if the job is idle up to the configured idle timeout value. It reclaims the resources from the cluster once it starts again.

How to update spark configuration after resizing worker nodes in Cloud Dataproc

I have a DataProc Spark cluster. Initially, the master and 2 worker nodes are of type n1-standard-4 (4 vCPU, 15.0 GB memory), then I resized all of them to n1-highmem-8 (8 vCPUs, 52 GB memory) via the web console.
I noticed that the two workers nodes are not being fully used. In particular, there are only 2 executors on the first worker node and 1 executor on the second worker node, with
spark.executor.cores 2
spark.executor.memory 4655m
in the /usr/lib/spark/conf/spark-defaults.conf. I thought with spark.dynamicAllocation.enabled true, the number of executors will be increased automatically.
Also, The information on DataProc page of the web console doesn't get updated automatically, either. It seems that DataProc still think that all nodes are n1-standard-4.
My questions are
why are there more executors on the first worker node than the second?
why are not more executors added to each node?
Ideally, I want the whole cluster to get fully utilized, if the spark configuration needs updated, how?
As you've found a cluster's configuration is set when the cluster is first created and does not adjust to manual resizing.
To answer your questions:
The Spark ApplicationMaster takes a container in YARN on a worker node, usually the first worker if only a single spark application is running.
When a cluster is started, Dataproc attempts to fit two YARN containers per machine.
The YARN NodeManager configuration on each machine determines how much of the machine's resources should be dedicated to YARN. This can be changed on each VM under /etc/hadoop/conf/yarn-site.xml, followed by a sudo service hadoop-yarn-nodemanager restart. Once machines are advertising more resources to the ResourceManager, Spark can start more containers. After adding more resources to YARN, you may want to modify the size of containers requested by Spark by modifying spark.executor.memory and spark.executor.cores.
Instead of resizing cluster nodes and manually editing configuration files afterwards, consider starting a new cluster with new machine sizes and copy any data from your old cluster to the new cluster. In general, the simplest way to move data is to use hadoop's built in distcp utility. An example usage would be something along the lines of:
$ hadoop distcp hdfs:///some_directory hdfs://other-cluster-m:8020/
Or if you can use Cloud Storage:
$ hadoop distcp hdfs:///some_directory gs://<your_bucket>/some_directory
Alternatively, consider always storing data in Cloud Storage and treating each cluster as an ephemeral resource that can be torn down and recreated at any time. In general, any time you would save data to HDFS, you can also save it as:
gs://<your_bucket>/path/to/file
Saving to GCS has the nice benefit of allowing you to delete your cluster (and data in HDFS, on persistent disks) when not in use.

Is it possible to run multiple Spark applications on a mesos cluster?

I have a Mesos cluster with 1 Master and 3 slaves (with 2 cores and 4GB RAM each) that has a Spark application already up and running. I wanted to run another application on the same cluster, as the CPU and Memory utilization isn't high. Regardless, when I try to run the new Application, I get the error:
16/02/25 13:40:18 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I guess the new process is not getting any CPU as the old one occupies all 6.
I have tried enabling dynamic allocation, making the spark app Fine grained. Assigning numerous combinations of executor cores and number of executors. What I am missing here? Is it possible to run a Mesos Cluster with multiple Spark Frameworks at all?
You can try setting spark.cores.max to limit the number of CPUs used by each Spark driver, which will free up some resources.
Docs: https://spark.apache.org/docs/latest/configuration.html#scheduling

Resources