See amount of memory a pbs job is currently using - pbs

I know I can see how much memory a pbs job has requested using qstat but is there a way to view how much memory the job is currently using?
Thanks!

qstat -f <jobid> should show you up-to-date information on the memory usage of your job

Related

Databricks Spark: java.lang.OutOfMemoryError: GC overhead limit exceeded i

I am executing a Spark job in Databricks cluster. I am triggering the job via a Azure Data Factory pipeline and it execute at 15 minute interval so after the successful execution of three or four times it is getting failed and throwing with the exception "java.lang.OutOfMemoryError: GC overhead limit exceeded".
Though there are many answer with for the above said question but in most of the cases their jobs are not running but in my cases it is getting failed after successful execution of some previous jobs.
My data size is less than 20 MB only.
My cluster configuration is:
So the my question is what changes I should make in the server configuration. If the issue is coming from my code then why it is getting succeeded most of the time. Please advise and suggest me the solution.
This is most probably related to executor memory being bit low .Not sure what is current setting and if its default what is the default value in this particular databrics distribution. Even though it passes but there would lot of GCs happening because of low memory hence it would keep failing once in a while . Under spark configuration please provide spark.executor.memory and also some other params related to num of executors and cores per executor . In spark-submit the config would be provided as spark-submit --conf spark.executor.memory=1g
You may try increasing memory of driver node.
Sometimes the Garbage Collector is not releasing all the loaded objects in the driver's memory.
What you can try is to force the GC to do that. You can do that by executing the following:
spark.catalog.clearCache()
for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
rdd.unpersist()
print("Unpersisted {} rdd".format(id))

Change CPU count for RUNNING Slurm Jobs

I have a SLURM cluster and a RUNNING job where I have requested 60 threads by
#SBATCH --cpus-per-task=60
(I am sharing threads on a node using cgroups)
I now want to reduce the amount of threads to 30.
$ scontrol update jobid=274332 NumCPUs=30
Job is no longer pending execution for job 274332
The job has still 60 threads allocated.
$ scontrol show job 274332
JobState=RUNNING Reason=None Dependency=(null)
NumNodes=1 NumCPUs=60 NumTasks=1 CPUs/Task=60 ReqB:S:C:T=0:0:*:*
How would be the correct way to accomplish this?
Thanks!
In the current version of Slurm, scontrol only allows to reduce the number of nodes allocated to a running job, but not the number of CPUs (or the memory).
The FAQ says:
Use the scontrol command to change a job's size either by specifying a new node count (NumNodes=) for the job or identify the specific nodes (NodeList=) that you want the job to retain.
(Emphasis mine)

Profiling spark executor memory

I have been wanting to find a good way to profile a spark application's executor when its run from a jupyter notebook interface. I basically want to see details like what is the heap memory usage, young and perm gen memory usage etc through time for a particular executor(ones that fail atleast).
I see many solutions out there but nothing that seems mature and easy to install/use.
Are there any good tools that let me do this easily?

SLURM add nodes to suspended job

Is there a possibility to add nodes (cores) to the suspended jobs ?
As an example:
scontrol update jobid= then something to accomplish the task
Thank you in advance.
Regards,
Wahi
Changing the size of a suspended job is not possible, regarding the Slurm documentation. Only pending and running jobs.
Job(s) changing size must not be in a suspended state...
Hope it helps!

"GC Overhead limit exceeded" on Hadoop .20 datanode

I've searched and not finding much information related to Hadoop Datanode processes dying due to GC overhead limit exceeded, so I thought I'd post a question.
We are running a test where we need to confirm our Hadoop cluster can handle having ~3million files stored on it (currently a 4 node cluster). We are using a 64bit JVM and we've allocated 8g to the namenode. However, as my test program writes more files to DFS, the datanodes start dying off with this error:
Exception in thread "DataNode: [/var/hadoop/data/hadoop/data]" java.lang.OutOfMemoryError: GC overhead limit exceeded
I saw some posts about some options (parallel GC?) I guess which can be set in hadoop-env.sh but I'm not too sure of the syntax and I'm kind of a newbie, so I didn't quite grok how it's done.
Thanks for any help here!
Try to increase the memory for datanode by using this: (hadoop restart required for this to work)
export HADOOP_DATANODE_OPTS="-Xmx10g"
This will set the heap to 10gb...you can increase as per your need.
You can also paste this at the start in $HADOOP_CONF_DIR/hadoop-env.sh file.
If you are running a map reduce job from command line, you can increase the heap using the parameter -D 'mapreduce.map.java.opts=-Xmx1024m' and/or -D 'mapreduce.reduce.java.opts=-Xmx1024m'. Example:
hadoop --config /etc/hadoop/conf jar /usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar --conf /etc/hbase/conf/hbase-site.xml -D 'mapreduce.map.java.opts=-Xmx1024m' --hbase-indexer-file $HOME/morphline-hbase-mapper.xml --zk-host 127.0.0.1/solr --collection hbase-collection1 --go-live --log4j /home/cloudera/morphlines/log4j.properties
Note that in some Cloudera documentation, they still use the old parameters mapred.child.java.opts, mapred.map.child.java.opts and mapred.reduce.child.java.opts. These parameters don't work anymore for Hadoop 2 (see What is the relation between 'mapreduce.map.memory.mb' and 'mapred.map.child.java.opts' in Apache Hadoop YARN?).
This post solved the issue for me.
So, the key is to "Prepend that environment variable" (1st time seen this linux command syntax :) )
HADOOP_CLIENT_OPTS="-Xmx10g" hadoop jar "your.jar" "source.dir" "target.dir"
GC overhead limit indicates that your (tiny) heap is full.
This is what often happens in MapReduce operations when u process a lot of data. Try this:
< property >
< name > mapred.child.java.opts < /name >
< value > -Xmx1024m -XX:-UseGCOverheadLimit < /value >
< /property >
Also, try these following things:
Use combiners, the reducers shouldn't get any lists longer than a small multiple of the number of maps
At the same time, you can generate heap dump from OOME and analyze with YourKit, etc adn analyze it

Resources