Spark - only one partition is processed at each node - apache-spark

I see in my Spark job that usually (but not always) only one partition is being processed on each node. What could be the possible reasons? How can I debug it?

You should check the executor's resources configuration:
spark.executor.memory
spark.executor.cores
These configs control how many executors can run concurrently on each node and therefore -- how many partitions are processed concurrently (by default, every executor processes a single partition).
For example, if your nodes have 8 cores and 32gb memory each and your spark application is defined with:
spark.executor.memory=25g
spark.executor.cores=3
only one executor will be able to run concurrently on each node and in order to run 2 executors concurrntly the node should have at least 50gb memory.

Related

Can we have same Spark Config for all jobs/applications?

I am trying to understand the Spark Config, I see that the number of executors , executor cores and executor memory is being calculated based on the cluster. Eg:
Cluster Config:
10 Nodes
16 cores per Node
64GB RAM per Node
Recommned Config is 29 executors, 18GB memory each and 5 cores each!!
However, would this config be the same of all the jobs/applications that run on the cluster ? What if more than 1 job/app is running at the same time what would happen ? Also, would this config be the same irrespective of the data that I am processing whether it be 1GB or 100GB or would the config change based on the data aswell, if so how to calculated ?
Reference for recommend config- https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html
The default configuration in spark will apply to all the jobs, this you can set in the spark-defaults.conf.
In case of yarn: Jobs are automatically put in the queue if enough resources are not available
You can set the number of executor cores and other configuration during spark submit to override the defaults. You can also look at dynamic allocation to avoid doing this yourself, this is not guaranteed to work as efficiently as setting the configuration yourself.

Increase the Spark workers cores

I have installed Spark on master and 2 workers. The original core number per worker is 8. When I start the master, the workers are work properly without any problem, but the problem is in Spark GUI each worker has only 2 cores assigned.
Kindly, how can I increase the number of the cores in which each worker works with 8 cores?
The setting which controls cores per executor is spark.executor.cores. See doc. It can be set either via spark-submit cmd argument or in spark-defaults.conf. The file is usually located in /etc/spark/conf (ymmv). YOu can search for the conf file with find / -type f -name spark-defaults.conf
spark.executor.cores 8
However the setting does not guarantee that each executor will always get all the available cores. This depends on your workload.
If you schedule tasks on a dataframe or rdd, spark will run a parallel task for each partition of the dataframe. A task will be scheduled to an executor (separate jvm) and the executor can run multiple tasks in parallel in jvm threads on each core.
Also an exeucutor will not necessarily run on a separate worker. If there is enough memory, 2 executors can share a worker node.
In order to use all the cores the setup in your case could look as follows:
given you have 10 gig of memory on each node
spark.default.parallelism 14
spark.executor.instances 2
spark.executor.cores 7
spark.executor.memory 9g
Setting memory to 9g will make sure, each executor is assigned to a separate node. Each executor will have 7 cores available. And each dataframe operation will be scheduled to 14 concurrent tasks, which will be distributed x 7 to each executor. You can also repartition a dataframe, instead of setting default.parallelism. One core and 1gig of memory is left for the operating system.

Spark Local vs Cluster

What is the Spark cluster equivalent of standalone's local[N]. I mean, the value we set as a parameter of local as N, which parameter takes it in the cluster mode?
In local[N] - N is the maximum number of cores can be used in a node at any point of time.
In cluster mode you can set --executor-cores N.
It means that each executor can run a maximum of N tasks at the same time in an executor.
In cluster mode, one executor will run on one worker node, which means that one executor will takes all the cores on the worker node. It could result in under-utilization of the resources. Keep in mind, driver will also takes one worker node.

How to release seemingly inactive executors from a long-running PySpark framework?

Here's my problem. Let's say I have a long-running PySpark framework. It has thousands of tasks that can all be executed in parallel. I get allocated 1,000 cores at the beginning on many different hosts. Each task needs one core. Then, when those finish, the host holds onto one core and has no active tasks. Since there are a large number of hosts, what can happen is that a larger and larger percentage of my cores are allocated to executors that don't have any active tasks. So I can have 1000 cores allocated, but only 100 active tasks. The other 900 cores are allocated to executors that have no active tasks. How can I improve this? Is there a way to shut down executors that aren't doing anything? I am currently using PySpark 1.2, so it'd be great for the functionality to be in that version, but would be happy to hear about solutions (or better solutions) in newer versions. Thanks!
If you do not specify the number of executors that Spark should use, Spark allocates executors as long as Spark has at least 1 task pending in its queue. You can set an upper limit to the number of executors that Spark can dynamically allocate by using this parameter: spark.dynamicAllocation.maxExecutors.
In other word, when launching spark, use:
pyspark --master yarn-client --conf spark.dynamicAllocation.maxExecutors=1000
instead of
pyspark --master yarn-client --num-executors=1000
By default, Spark will release executors after 60s of non-activity.
Note, if you .persist() your Spark.DataFrame, make sure to .unpersist() them otherwise Spark will not release the executors.

Spark Streaming Job Keeps growing memory

I am running spark v 1.6.1 on a single machine in standalone mode, having 64GB RAM and 16cores.
I have created five worker instances to create five executor as in standalone mode, there cannot be more than one executor in one worker node.
Configuration:
SPARK_WORKER_INSTANCES 5
SPARK_WORKER_CORE 1
SPARK_MASTER_OPTS "-Dspark.deploy.default.Cores=5"
all other configurations are default in spark_env.sh
I am running a spark streaming direct kafka job at an interval of 1 min, which takes data from kafka and after some aggregation write the data to mongo.
Problems:
when I start master and slave, it starts one master process and five worker processes. each only consume about 212 MB of ram.when i submit the job , it again creates 5 executor processes and 1 job process and also the memory uses grows to 8GB in total and keeps growing over time (slowly) also when there is no data to process.
we are unpersisting cached rdd at the end also set spark.cleaner.ttl to 600. but still memory is growing.
one more thing, I have seen the merged SPARK-1706, then also why i am unable to create multiple executor within a worker.and also in spark_env.sh file , setting any configuration related to executor comes under YARN only mode.
Any help would be greatly appreciated,
Thanks

Resources