Abnormally high CPU consumption in PySpark

Abnormally high CPU consumption in PySpark - apache-spark

We have a moderately large PySpark program that we run on a Mesos cluster.
We run the program with spark.executor.cores=8 and spark.cores.max=24. Each Mesos node has 12 vcpu, so that only 1 executor is started on each node.
The program runs flawlessly, with correct results.
However, the issue is that each executor consumes much more CPU than 8. CPU load frequently reaches 25 or more.
With the htop program, we see that 8 python processes are started, as expected. However, each Python spawn several threads, so each python process can go up to 300% CPU.
This behavior is annoying in a shared cluster deployment.
Can someone explain this behavior ?
What are these 3 additional threads that pyspark starts ?
Additional infos:
The functions we use in our Spark operations are not multithreaded
We have the same behavior in local mode, outside of Mesos
We use Spark 2.1.1 and Python 3.5
Nothing else runs on the Mesos nodes, excepted the usual base services
In our test platform, Mesos nodes are actually OpenStack VM

Related

emr spark master node runs out of memory in yarn cluster mode

I am new to EMR and I am running an EMR cluster, with 1 master (32gb) and 5 core nodes (16gb). I launch 11 apps. The apps have to be separated in case one of them fail (all of them are streaming apps). I must mention that I also got ElasticSearch running on the cluster.
After some time the master node is running out of memory and stops responding and some apps starting to fail. In the process overview I found many smaller hadoop processes with that occupy 1-1.3GB of RAM. I guess these are the driver processes from each app. I tried to reduce the the driver memory under "spark.driver.memory" to 512MB, but it's still at 1.3GB after relaunching the apps. Is this because of yarn?
ES just allocates ca. 6.5 GB of RAM of the master node

I had to specify the driver memory in spark-submit command like this:
spark-submit --driver-memory 500M
because to specify it inside the python file is too late, when you run the driver in client mode, because it allocates the memory before

Avoid CPU pegging on Spark Standalone

I have a daily pipeline running on Spark Standalone 2.1. Its deployed in and runs on AWS EC2 and uses S3 for its persistence layer. For the most part, the pipeline runs without a hitch, but occasionally the job hangs on a single worker node during a reduceByKey operation. When I work into the worker, I notice that the CPU (as seen via top) is pegged at 100%. My remedy so far is to reboot the worker node so that Spark re-assigns the task and the job proceeds fine from there.
I would like to be able to mitigate this issue. I gather that I can prevent CPU pegging by switching to use YARN as my cluster manager, but I wonder whether I could configure Spark Standalone to prevent CPU pegging by maybe limiting the number of cores that get assigned to the Spark job ? Any suggestions would be greatly appreciated.

What is the minimum Hardware insfracture required for spark to run on spark standalone cluster mode?

I am running spark standalone cluster mode in my local computer .This is hardware information about my computer
Intel Core i5
Number of Processors: 1
Total Number of Cores: 2
Memory: 4 GB.
I am trying to run spark program from eclipse on spark standalone cluster .This is some part of my code .
String logFile = "/Users/BigDinosaur/Downloads/spark-2.0.1-bin-hadoop2.7 2/README.md"; //
SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("spark://BigDinosaur.local:7077"));
after running program in eclipse I am getting following warning message
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resource
This is my screen shot of web UI
After going through other people answer on similar problem it seems like hardware resource mismatch is the root cause.
I want to get more information on
What is Minimum Hardware insfracture required for spark standalone cluster to run application on it ?

It started running after i run following command
./start-slave.sh spark://localhost:7077 --cores 1 --memory 1g
I gave for core 1 and memory 1 g

As per I know. Spark allocates memory from whatever memory is available when spark job starts.
You may want to try with explicitely providing cores and executor memory when starting job.

Spark on Mesos is much slower than local

I'm running a Spark Streaming process on a 16 CPU's 64 GB RAM host with Mesos.
When I'm running it using Mesos as a cluster manager (by setting --master mesos://leader.mesos:5050) it's running much slower than when it is run in local mode (--master local[4]).
I can't find the reason for that and I have no clue. One of the things I've noticed is that there is one specific task that is taking significantly more time on Mesos than in Local.
The weird thing (maybe that should be the questions' title) is that the task itself takes 6s and its stage (it has only one stage) takes less than a second. See attached pictures (Mesos (1) and (2)). How come? Isn't a job equal to the sum of its parts?
Local:
Mesos:
(1)
(2)
Another note: I did manage to run this exact same Spark Streaming process on another Mesos cluster, and it runs in a sensible amount of time, pretty much like in the local mode described above. The only difference that I can think of is that this cluster has more than one host, and that Spark is running with 2 executors rather than 1. (I couldn't find a way to run more than 1 executor on the same host on Mesos). Is this may be the reason?
Any clues would be much appreciated.

Spark can run over Mesos in two modes: coarse-grained (default) and fine-grained (see documentation).
In coarse-grained mode Spark launches exactly one executor on each machine it was assigned to by Mesos. Inside this task Spark launches other mini-tasks. It has the benefit of lower startup overhead (in your case you don't want to change this mode).
Could you be more specific about your streaming job? Is it CPU, disk, or network bounded? You can easily compare performance if you run some of Spark examples.
If your task is CPU intensive you might consider setting spark.mesos.extra.cores. By default Spark tries to acquire all cores that are being offered by Mesos. So, if there's no other task running on that cluster it shouldn't be a problem.

Spark Standalone Mode multiple shell sessions (applications)

In Spark 1.0.0 Standalone mode with multiple worker nodes, I'm trying to run a Spark shell from two different computers (same Linux user).
In the documentation, it says "By default, applications submitted to the standalone mode cluster will run in FIFO (first-in-first-out) order, and each application will try to use all available nodes."
The number of cores per worker is set to 4 with 8 being available (via SPARK_JAVA_OPTS="-Dspark.cores.max=4"). Memory is also limited such that enough should be available for both.
However, when looking at the Spark Master WebUI, the shell application that was started later will always remain in state "WAITING" until the first one is exited. The number of cores assigned to it is 0, the Memory per node 10G (same as the one that is already running)
Is there a way to have both shells running at the same time without using Mesos?

Before a shell will start processing on a spark standalone cluster, there has to be sufficient cores and memory. You must specify from each spark shell the number of cores you want, or it will use them all. If you specify 5 cores, with executor memory=10G (the amount of memory you allocated for the executors), and the second spark shell to run with 2 cores, and 10G of memory, the second one will still not start, because the first shell is using both executors, and is using all of the memory on both. If you specify 5G of executor memory for each spark shell, then they can concurrently run.
Essentially you want to have multiple jobs running on a standalone cluster -- unfortunately, it is really not designed to handle this case well. If you want to do that you should use either mesos or yarn.

One workaround to this is to restrict the number of cores per spark shell using total-executor-cores. For example to restrict it to 16 cores, launch it like this:
bin/spark-shell --total-executor-cores 16 --master spark://$MASTER:7077
In this case each shell will use only 16 cores, so you can have two shells running on your 32 cores cluster. They can then run simultaneously but never use more than 16 cores each :(
This solution is far from ideal, I know. You depend on users to restrict themselves, to shut down their shells, and resources are wasted when a user is not running code. I have created a request to fix this on JIRA, which you can vote for.

The application ends when your shell dies. So, you cannot run concurrently two spark-shells on two laptops. What you can do is launch one spark-shell, launch the other, and have the second start when the first one dies.
Contrarily to spark-shell, spark-submit does terminate once computation is over. So you can spark-submit one app, launch a spark-shell, and have the shell take over the moment the application is done.
Or you can run two apps sequentially (one after the other) with two spark-submit launches.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Abnormally high CPU consumption in PySpark - apache-spark

Related

emr spark master node runs out of memory in yarn cluster mode

Avoid CPU pegging on Spark Standalone

What is the minimum Hardware insfracture required for spark to run on spark standalone cluster mode?

Spark on Mesos is much slower than local

Spark Standalone Mode multiple shell sessions (applications)

Categories

Resources