First off, apologies if I use confusing or incorrect terminology, I am still learning.
I am trying to set up configuration for a Slurm-enabled adaptive cluster.
Documentation of the supercomputer and it’s Slurm configuration is documented here. Here is some of the most relevant information extracted from the website:
Partition Name
Max Nodes per Job
Max Job Runtime
Max resources used simultaneously
Shared Node Usage
Default Memory per CPU
Max Memory per CPU
compute
512
8 hours
no limit
no
1920 MB
8000 MB
compute
This partition consists of 2659 AMD EPYC 7763 Milan compute nodes and is intended for running parallel scientific applications. The compute nodes allocated for a job are used exclusively and cannot be shared with other jobs. Some information about the compute node:
Component
Value
# of CPU Cores
64
# of Threads
128
Here is some output from control show partition:
PartitionName=compute
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=512 MaxTime=08:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=l[10000-10058,10061-10062,10064-10065,10067-10068,10070-10083,10090-10095,10100-10158,10160-10183,10190-10195,10200-10258,10260-10283,10290-10295,10300-10357,10359-10383,10390-10395,10400-10483,10490-10495,10500-10583,10590-10595,10600-10683,10690-10695,10700-10783,10790-10795,20000-20059,20061-20062,20064-20065,20067-20068,20070-20083,20090-20095,20100-20183,20190-20195,20200-20223,20225-20283,20290-20295,20300-20383,20390-20395,20400-20483,20490-20495,20500-20583,20590-20595,20600-20683,20690-20695,30000-30059,30061-30062,30064-30083,30090-30095,30100-30183,30190-30195,30200-30230,30232-30283,30290-30295,30300-30383,30390-30395,30400-30483,30490-30495,30500-30583,30590-30595,30600-30683,30690-30695,30700-30760,30762-30783,30790-30795,40000-40026,40028-40029,40031-40032,40034-40035,40037-40038,40040-40083,40090-40095,40101-40102,40104-40105,40107-40108,40110-40111,40113-40183,40190-40195,40200-40283,40287-40295,40300-40359,40400-40483,40490-40495,40500-40583,40587-40595,40600-40683,40687-40695,50200-50259,50269-50271,50300-50359,50369-50371]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=711168 TotalNodes=2778 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=960 MaxMemPerCPU=3840
Here is what I have so far:
cluster = SLURMCluster(
name='dask-cluster',
processes=32,
cores=64,
memory=f"{8000 * 64 * 0.90} MB",
project="ab0995",
queue="compute",
interface='ib0',
walltime='08:00:00',
asynchronous=0,
# job_extra=["--ntasks-per-node=50",],
)
Some things to mention:
In the first above table, “nodes” refers to compute server nodes, not Dask nodes (which I think should probably be rather called Dask Workers? If someone could clear up that term for me I would be grateful). Since I have 64 CPU Cores and 8000 MB of allowed memory, I thought it would be sensible to set the memory to 8000 * 64 with a “reduction” factor of 0.90, just to be on the safe side.
I have 64 CPUs, which I believe should translate to 64 “cores” in the SLURMCluster. I want each Python to have 2 CPUs, so, in total 32 processes. That might be optimised down to 4 CPUs per Python, but I have no idea how to get a feeling for sensible settings here.
I set the walltime of each dask-cluster job to the maximum allowed; as I would rather block with one Slurm Job than need to wait. This might induce idle work of that server, but it might still be more effective than waiting in the Slurm batch system queue.
If I now print the job script as configured above, I get:
print(cluster.job_script())
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -p compute
#SBATCH -A ab0995
#SBATCH -n 1
#SBATCH --cpus-per-task=64
#SBATCH --mem=430G
#SBATCH -t 08:00:00
/work/ab0995/AWIsoft/miniconda/NextGEMS/.conda/bin/python -m distributed.cli.dask_worker tcp://136.172.120.121:36449 --nthreads 2 --nprocs 32 --memory-limit 13.41GiB --name dummy-name --nanny --death-timeout 60 --interface ib0 --protocol tcp://
So, questions:
By my mental math, 8000*64*0.9 = 460.8 GB, not 430G What is happening here?
I don’t really understand the nthreads. nprocs, and memory-limit getting of the dask_worker…?
Can someone give me a good distinction between dask nodes, dask workers, compute nodes as seen in slurm?
I am in a similar situation to yours, where I am trying to understand how dask-ditributed works with Slurms.
The dask-distributed docs report that Slurm uses KB or GB, but it actually means KiB or GiB, so Dask converts your value in GiB.
What I've found is that nprocs=processes, nthreads=cores/processes and memory-limit=allocated-memory/processes. The job will then launch a dask_worker using nprocs processes of nthreads each (which will be the Workers of your SLURMCluster).
This is not clear to me as well, so I don't have a good answer. I think that since Slurm nodes have several CPUs, the Scheduler of your SLURMCluster manages Dask Workers based on the allocated CPUs. (I didn't find anything about dask nodes from the docs though)
I have a Spark job which I am trying to execute on EMR. It is giving me the below error:
java.lang.OutOfMemoryError: Java heap space
-XX:OnOutOfMemoryError="kill -9 %p"
Executing /bin/sh -c "kill -9 22611"...
I have tried it with even 10 core instances of type m5.12xlarge but still the same issue. My code is working fine as I have tested it via AWS Glue and that has succeeded with G1.X and 20 DPUs (takes around 3 hours to complete the job). Any recommendation regarding how I choose the EMR instance type?
So changing the instance types only won't always help, we need to play with spark configurations as well. I followed something like mentioned here and the job is successful on EMR.
I have seen various threads on this issue but the solutions given are not working in my case.
The environment is with pyspark 2.1.0 , Java 7 and has enough memory and Cores.
I am running a spark-submit job which deals with Json files, the job runs perfectly alright with the file size < 200MB but if its more than that it fails for Container exited with a non-zero exit code 143 then I checked yarn logs and the error there is java.lang.OutOfMemoryError: Requested array size exceeds VM limit
Since the json file is not in the format which can directly be read using spark.read.json() the first step in the application is reading the json as textfile to rdd to apply map and flatMap to covert into required format then using spark.read.json(rdd) to create the dataframe for further processing, the code is below
def read_json(self, spark_session, filepath):
raw_rdd = spark_session.sparkContext.textFile(filepath)
raw_rdd_add_sep = raw_rdd.map(lambda x:x.replace('}{','}}{{'))
raw_rdd_split = raw_rdd_add_sep.flatMap(lambda x:x.split('}{'))
required_df = spark_session.read.json(raw_rdd_split)
return required_df
I have tried increasing the Memory overhead for executor and driver which didn't help using options spark.driver.memoryOverhead , spark.executor.memoryOverhead
Also I have enabled the Off-Heap options spark.memory.offHeap.enabled and set the value spark.memory.offHeap.size
I have tried setting the JVM memory option with spark.driver.extraJavaOptions=-Xms10g
So The above options didn't work in this scenario, some of the Json files are nearly 1GB and we ought to process ~200 files a day.
Can someone help resolving this issue please?
Regarding "Container exited with a non-zero exit code 143", it is probably because of the memory problem.
You need to check out on Spark UI if the settings you set is taking effect.
BTW, the proportion for executor.memory:overhead.memory should be about 4:1
I don't know why you change the JVM setting directly spark.driver.extraJavaOptions=-Xms10g, I recommend using --driver-memory 10g instread. e.g.: spark-submit --driver-memory 10G (I remember driver-memory only works with spark-submit sometimes)
from my perspective, you just need to update the four arguments to feed your machine resources:
spark.driver.memoryOverhead ,
spark.executor.memoryOverhead,
spark.driver.memory ,
spark.executor.memory
When reading 161 000 elements from HBase (462 MB based on HDFS file size) Spark spends at least 6 seconds to read them.
HBase is configured to use a block cache. During the test (there is no other process running at that moment), the block cache has a size of 470.1 MB (752.0 MB free).
All the elements are in the block cache.
The executor is running in an Yarn container (yarn mode) of 1408 MB memory.
Everything is running on a single node (including the master) over an Amazon m4 large node.
There is no other row in the table and a range scanning is performed.
RDD initialized like this
Executor Logs (it took 8 seconds in debug logging level)
The job is executed via Spark JobServer
Even a simple count on the RDD (no other operation) takes 5 seconds
I don't know what I can do based on the figures below. Where does the executor spend its time? How can I identify the bottleneck?
Thank you very much,
Sébastien.
I've searched and not finding much information related to Hadoop Datanode processes dying due to GC overhead limit exceeded, so I thought I'd post a question.
We are running a test where we need to confirm our Hadoop cluster can handle having ~3million files stored on it (currently a 4 node cluster). We are using a 64bit JVM and we've allocated 8g to the namenode. However, as my test program writes more files to DFS, the datanodes start dying off with this error:
Exception in thread "DataNode: [/var/hadoop/data/hadoop/data]" java.lang.OutOfMemoryError: GC overhead limit exceeded
I saw some posts about some options (parallel GC?) I guess which can be set in hadoop-env.sh but I'm not too sure of the syntax and I'm kind of a newbie, so I didn't quite grok how it's done.
Thanks for any help here!
Try to increase the memory for datanode by using this: (hadoop restart required for this to work)
export HADOOP_DATANODE_OPTS="-Xmx10g"
This will set the heap to 10gb...you can increase as per your need.
You can also paste this at the start in $HADOOP_CONF_DIR/hadoop-env.sh file.
If you are running a map reduce job from command line, you can increase the heap using the parameter -D 'mapreduce.map.java.opts=-Xmx1024m' and/or -D 'mapreduce.reduce.java.opts=-Xmx1024m'. Example:
hadoop --config /etc/hadoop/conf jar /usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar --conf /etc/hbase/conf/hbase-site.xml -D 'mapreduce.map.java.opts=-Xmx1024m' --hbase-indexer-file $HOME/morphline-hbase-mapper.xml --zk-host 127.0.0.1/solr --collection hbase-collection1 --go-live --log4j /home/cloudera/morphlines/log4j.properties
Note that in some Cloudera documentation, they still use the old parameters mapred.child.java.opts, mapred.map.child.java.opts and mapred.reduce.child.java.opts. These parameters don't work anymore for Hadoop 2 (see What is the relation between 'mapreduce.map.memory.mb' and 'mapred.map.child.java.opts' in Apache Hadoop YARN?).
This post solved the issue for me.
So, the key is to "Prepend that environment variable" (1st time seen this linux command syntax :) )
HADOOP_CLIENT_OPTS="-Xmx10g" hadoop jar "your.jar" "source.dir" "target.dir"
GC overhead limit indicates that your (tiny) heap is full.
This is what often happens in MapReduce operations when u process a lot of data. Try this:
< property >
< name > mapred.child.java.opts < /name >
< value > -Xmx1024m -XX:-UseGCOverheadLimit < /value >
< /property >
Also, try these following things:
Use combiners, the reducers shouldn't get any lists longer than a small multiple of the number of maps
At the same time, you can generate heap dump from OOME and analyze with YourKit, etc adn analyze it