Is AWS EMR suited for an HA spark direct streaming application - apache-spark

I am trying to run a apache spark direct streaming application in AWS EMR.
The application receives and sends data to AWS kinesis and needs to be running the whole time.
If course if a core node is killed, it stops. But it should self-heal when the core node is replaced.
Now I noticed: When I kill one of the core nodes (simulating a problem), it is replaced by AWS EMR. But the application stops working (no output is send to kinesis anymore) and in also does not continue working unless I restart it.
What I get in the logs is:
ERROR YarnClusterScheduler: Lost executor 1 on ip-10-1-10-100.eu-central-1.compute.internal: Slave lost
Which is expected. But then I get:
20/11/02 13:15:32 WARN TaskSetManager: Lost task 193.1 in stage 373.0 (TID 37583, ip-10-1-10-225.eu-central-1.compute.internal, executor 2): FetchFailed(null, shuffleId=186, mapIndex=-1, mapId=-1, reduceId=193, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 186
These are just warnings, still the application does not produce any output anymore.
So I stop the application and start it again. Now it produces output again.
My question: Is AWS EMR suited for a self-healing application, like the one I need? Or am I using the wrong tool?
If yes, how do I get my spark application to continue after a core node is replaced?

Its recommended to use On-Demand for CORE instances
And at the same time use TASK instance to leverage SPOT instances.
Have a look
Amazon Emr - What is the need of Task nodes when we have Core nodes?
AWS DOC on Master, Core, and Task Nodes

Related

spark thrift server sql not scheduled for running for long time

I am running spark thrift on EMR (6.6), with managed scaling enabled.
from time to time we have SQL that stack for a long time (45m) until a new request comes to the server and releases it.
when that happens we see that there is one executor on a task node that EMR ask to kill.
What could be the reason for such behavior? How could it be avoided?
it turned out that AWS has a feature that prevents Spark from sending tasks to executors that run on DECOMMISSIONING node.
so in our case, we have min-executor = 1 and the last one was on DECOMMISSIONING node. so spark does not send to it any tasks but it does not ask for new resources because it has that executor.
so that seems to be like an EMR bug.

Should slave nodes be launched/started separately on Amazon EMR server?

I have just launched Amazon Elastic MapReduce server after trying java.lang.OutofMemorySpace:Java heap space while fetching 120 million rows from database in pyspark where I have 1 master and 2 slave nodes running each having 4 cores and 8G RAM.
I am trying to load a massive dataset from MySQL database (containing approx. 120M rows). The query loads fine but when I do a df.show() operation or when I try to perform operations on the spark dataframe I am getting errors like -
org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
Task 0 in stage 0.0 failed 1 times; aborting job
java.lang.OutOfMemoryError: GC overhead limit exceeded
My questions are -
When I SSH into the Amazon EMR server and do htop, I see that 5GB out of 8GB is already in use. Why is this?
On the Amazon EMR portal, I can see that the master and slave servers are running. I'm not sure if the slave servers are being used or if its just the master doing all the work. Do I have to separately launch or "start" the 2 slave nodes or does Spark do that automatically? If yes, how do I do this?
If you are running spark as standalone mode (local[*]) from master then it will only use master node.
How are you submitting spark job?
Use yarn cluster or client mode while submitting spark job to use resources efficiently.
Read more on YARN cluster vs client
Master node runs all the other services like hive, mysql, etc. Those services may
taking 5GB of ram if aren’t using standalone mode.
In yarn UI (http://<master-public-dns>:8088) you can check what other containers are running in more detail.
You can check where your spark driver and executer are spinning,
in spark UI http://<master-public-dns>:18080.
Select your job and go to the Executor section, there you would find machine ip of each executor.
Enable ganglia in EMR OR go to CloudWatch ec2 metric to check each machine utilization.
Spark doesn’t start or terminates nodes.
If you want to scale your cluster depending upon job load, apply autoscaling policy to CORE or TASK instance group.
But at-least you need 1 CORE node always running.

How to know if a machine in a Spark cluster 'participate's a job

I wanted to know when it is safe to remove a node from a machine from a cluster.
My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.
By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do
GET http://<rm http address:port>/ws/v1/cluster/nodes
to get the information of each node like
<node>
<rack>/default-rack</rack>
<state>RUNNING</state>
<id>host1.domain.com:54158</id>
<nodeHostName>host1.domain.com</nodeHostName>
<nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
<lastHealthUpdate>1476995346399</lastHealthUpdate>
<version>3.0.0-SNAPSHOT</version>
<healthReport></healthReport>
<numContainers>0</numContainers>
<usedMemoryMB>0</usedMemoryMB>
<availMemoryMB>8192</availMemoryMB>
<usedVirtualCores>0</usedVirtualCores>
<availableVirtualCores>8</availableVirtualCores>
<resourceUtilization>
<nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
<nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
<nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
<aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
<aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
<containersCPUUsage>0.0</containersCPUUsage>
</resourceUtilization>
</node>
If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?
I did not get if Spark lets us know this. I assume if a machine still stores some data useful for the running job, the machine may maintain a heart beat with Spark Driver or some central controller? Can we check this by scanning tcp or udp connections?
Is there any other way to check if a machine in a Spark cluster participates a job?
I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster
I will try to explain the latter point.
Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
The node removed can be an executor or an application master.
If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :
yarn.resourcemanager.am.max-attempts
By default, this value is 2
If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.
As far as data on these nodes is concerned, you need to understand how the tasks and their output are handled. Every node has its own local storage to store the output of the tasks running on them. After the tasks are run successfully, the OutputCommitter will move the output from local storage to the shared storage (HDFS) of the job from where the data is picked for the next step of the job.
When a task fails (may be because the node that runs this job failed or was removed), the task is rerun on another available node.
In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.

Avoid CPU pegging on Spark Standalone

I have a daily pipeline running on Spark Standalone 2.1. Its deployed in and runs on AWS EC2 and uses S3 for its persistence layer. For the most part, the pipeline runs without a hitch, but occasionally the job hangs on a single worker node during a reduceByKey operation. When I work into the worker, I notice that the CPU (as seen via top) is pegged at 100%. My remedy so far is to reboot the worker node so that Spark re-assigns the task and the job proceeds fine from there.
I would like to be able to mitigate this issue. I gather that I can prevent CPU pegging by switching to use YARN as my cluster manager, but I wonder whether I could configure Spark Standalone to prevent CPU pegging by maybe limiting the number of cores that get assigned to the Spark job ? Any suggestions would be greatly appreciated.

why EMR SPARK job fails when task group node is lost?

I am using AWS emr-5.0.0 to run a small cluster that consists of the following notes:
1 Master - AWS on demand instance
1 CORE - AWS on demand instance
2 TASK - AWS SPOT instance
All of them are x3.xlarge machines.
I run a python spark application with two stages.
The problem is that when I manually terminate one of the TASK instances (or it gets terminated due to spot price change) the entire spark job fails.
I would expect that SPARK would just continue running the lost tasks on remaining nodes. Please explain why it does not happen.
Below is the log, master ip is 172-31-1-0, core instance it is 172-31-1-173, the lost not ip is 172-31-3-81).
log file (stderr and stdout from spark-submit)

Resources