Problems reading slurm configuration file with Singularity - slurm

I'm trying to run an application in Singularity across nodes (864 MPI tasks) on an HPC system, namely the S4 machine at the University of Wisconsin's Space Science and Engineering Center (SSEC).
I'm using what Singularity describes as the hybrid model 1, meaning that I'm using the native (system) MPI but I also have MPI installed in the container. The mpi versions are compatible - I'm using Intel MPI version 17.0.6 outside the container and Intel MPI version 17.0.1 inside the container. The code in the container is compiled with Intel 17.0.1 compilers (C++, C, and Fortran).
So here's the problem. When I first ran the code, it complained about not finding the slurm configuration file:
fv3jedi_var.x: error: s_p_parse_file: unable to status file /etc/slurm-llnl/slurm.conf: No such file or directory, retrying in 1sec up to 60sec
So I found the system slurm.conf file in /etc/slurm and mounted this directory in the container as /etc/slurm-llnl. It now finds the configuration file but it does not understand the site-specific configuration:
fv3jedi_var.x: error: "ALL" is not a valid option for "EnforcePartLimits"
fv3jedi_var.x: error: Parsing error at unrecognized key: Features
fv3jedi_var.x: error: Parse error in file /etc/slurm-llnl/slurm.conf line 225: " Features=ivy"
fv3jedi_var.x: error: Parsing error at unrecognized key: Features
fv3jedi_var.x: error: Parse error in file /etc/slurm-llnl/slurm.conf line 226: " Features=ivy"
fv3jedi_var.x: error: Parsing error at unrecognized key: Features
[...]
So, I'm stuck. I'm guessing that this might be a PMI issue? I currently have slurm libpmi.so installed in the container and that's what I'm specifying with the I_MPI_PMI_LIBRARY variable. But I wonder if the native (system) PMI (I know it is PMI as opposed to PMI2 or PMIx) is somehow configured to properly process the system slurm.conf file? I have tried to use the native PMI library by mounting (binding) the appropriate directory in the container and changing my I_MPI_PMI_LIBRARY variable. But, the native PMI library is in the same directory as the glibc library and when I mount that there is a conflict between the glibc libraries inside and outside the container:
/bin/sh: relocation error: /usr/lib64/libc.so.6: symbol _dl_starting_up, version GLIBC_PRIVATE not defined in file ld-linux-x86-64.so.2 with link time reference
Any ideas on how to proceed? My slurm batch script is below. Thanks!
#!/usr/bin/bash
# --mem-per-cpu=8192M
#SBATCH --job-name=bm_con14
#SBATCH --partition=ivy
#SBATCH --ntasks=864
#SBATCH --cpus-per-task=1
#SBATCH --time=2:00:00
#SBATCH --mail-user=miesch#ucar.edu
source /etc/bashrc
module purge
module load license_intel
module load intel/17.0.6
ulimit -s unlimited
cd /data/users/mmiesch/runs/con-benchmark/con
JEDICON=/data/users/mmiesch
JEDIBUILD=/data/users/mmiesch/jedi/fv3-bundle/build-con
JEDIBIN=/data/users/mmiesch/jedi/fv3-bundle/build-con/bin
export SINGULARITY_BINDPATH="$JEDIBUILD,/etc/slurm:/etc/slurm-llnl"
srun --ntasks=864 --cpu_bind=cores --distribution=block:block --verbose singularity exec --home=$PWD $JEDICON/jedi-intel17-impi-hpc-dev.sif ${JEDIBIN
}/fv3jedi_var.x Config/3dvar_bump.yaml
exit 0

Related

ERROR: eslint-bridge Node.js process is unresponsive. This is most likely caused by process running out of memory

This was originating from Azure DevOps pipilene while running an analysis with SonarQube.
I tried to apply below steps but no luck.
NPM Task
npm install -g increase-memory-limit
Pipeline Variable
SONAR_SCANNER_OPTS = -Xmx4096m
Error Log
INFO: Sensor TypeScript analysis [javascript]
INFO: Using TypeScript at: 'D:\a\45\s\Source\RockyBrands.MODocuments\Presentation Tier\RockyBrands.MODocuments.Web.UI\node_modules'
INFO: Found 2 tsconfig.json file(s): [D:\a\45\s\Source\RockyBrands.MODocuments\Presentation Tier\RockyBrands.MODocuments.Web.UI\obj\Dev\Package\PackageTmp\tsconfig.json, D:\a\45\s\Source\RockyBrands.MODocuments\Presentation Tier\RockyBrands.MODocuments.Web.UI\tsconfig.json]
INFO: 23 source files to be analyzed
INFO: Analyzing 23 files using tsconfig: D:\a\45\s\Source\RockyBrands.MODocuments\Presentation Tier\RockyBrands.MODocuments.Web.UI\tsconfig.json
INFO: 0/23 files analyzed, current file: Source/RockyBrands.MODocuments/Presentation Tier/RockyBrands.MODocuments.Web.UI/app/Home/Payments/batch-details/batch.details.component.ts
INFO: 0/23 files analyzed, current file: Source/RockyBrands.MODocuments/Presentation Tier/RockyBrands.MODocuments.Web.UI/app/Home/Payments/batch-details/batch.details.component.ts
INFO: 0/23 files analyzed, current file: Source/RockyBrands.MODocuments/Presentation Tier/RockyBrands.MODocuments.Web.UI/app/Home/Payments/batch-details/batch.details.component.ts
INFO: 0/23 files analyzed, current file: Source/RockyBrands.MODocuments/Presentation Tier/RockyBrands.MODocuments.Web.UI/app/Home/Payments/batch-details/batch.details.component.ts
INFO: 0/23 files analyzed, current file: Source/RockyBrands.MODocuments/Presentation Tier/RockyBrands.MODocuments.Web.UI/app/Home/Payments/batch-details/batch.details.component.ts
**##[error]ERROR: eslint-bridge Node.js process is unresponsive. This is most likely caused by process running out of memory. Consider setting sonar.javascript.node.maxspace to higher value (e.g. 4096).
ERROR: eslint-bridge Node.js process is unresponsive. This is most likely caused by process running out of memory. Consider setting sonar.javascript.node.maxspace to higher value (e.g. 4096).
##[error]ERROR: Failure during analysis, Node.js command to start eslint-bridge was: {NODE_PATH=D:\a\45\s\Source\RockyBrands.MODocuments\Presentation Tier\RockyBrands.MODocuments.Web.UI\node_modules} node D:\a\45\.sonarqube\out\.sonar\.sonartmp\eslint-bridge-bundle\package\bin\server 49855
java.lang.IllegalStateException: eslint-bridge is unresponsive**
at org.sonar.plugins.javascript.eslint.EslintBridgeServerImpl.request(EslintBridgeServerImpl.java:202)
at org.sonar.plugins.javascript.eslint.EslintBridgeServerImpl.analyzeTypeScript(EslintBridgeServerImpl.java:186)
ERROR: eslint-bridge Node.js process is unresponsive. This is most likely caused by process running out of memory
You could try to added sonar.javascript.node.maxspace=8192 in the sonar-project.properties.
Besides, if you are using the private agent, you need to consider the RAM size of the machine where your private agent is located. You can try to reduce the number of files contained in tsconfig.json files.
And if you are using the hosted agent, you could try to increase the size of the memory using the command. (e.g node --max-old-space-size=16384./node_modules/#angular/cli/bin/ng build --prod). Please reference this similar thread : https://github.com/TrilonIO/aspnetcore-angular-universal/issues/736#issuecomment-517374967. According to codehippie1’s comment, it needs same changes in csproj and package.json.

Neo4j refused to connect

Characteristics :
Linux
Neo4j version 3.2.1
Access on remote
Installation
I Had install neo4j and gave the folder chmod 777 .
Im running it remotely on my machine and I had already enabled non local access
Doing NEo4j start i get this message
Active database: graph.db
Directories in use:
home: /home/cloudera/Muna/apps/neo4j
config: /home/cloudera/Muna/apps/neo4j/conf
logs: /home/cloudera/Muna/apps/neo4j/logs
plugins: /home/cloudera/Muna/apps/neo4j/plugins
import: /home/cloudera/Muna/apps/neo4j/import
data: /home/cloudera/Muna/apps/neo4j/data
certificates: /home/cloudera/Muna/apps/neo4j/certificates
run: /home/cloudera/Muna/apps/neo4j/run
Starting Neo4j.
WARNING: Max 1024 open files allowed, minimum of 40000 recommended. See the Neo4j manual.
Started neo4j (pid 9469). It is available at http://0.0.0.0:7474/
There may be a short delay until the server is ready.
See /home/cloudera/Muna/apps/neo4j/logs/neo4j.log for current status.
and it is not connecting in the browser .
running neo4j console
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 409600000 bytes for AllocateHeap
# An error report file with more information is saved as:
# /home/cloudera/hs_err_pid18598.log
where could the problem be coming from ?
Firstly, you should set the maximum open files to 40000, which is the recommended value. Then you do not get the WARNING. Like this: http://neo4j.com/docs/1.6.2/configuration-linux-notes.html
Secondly,'failed to allocate memory' means that the Java virtual machine cannot allocate the memory you start it with.
It can be a misconfiguration, or you physically do not have enough memory.
Please read the memory sizing guidelines here:
https://neo4j.com/docs/operations-manual/current/performance/

i have just begun to use android studio and i cant seem to get my gradle to sync with my application. here is what it shows :

7:46:20 PM Gradle sync started
7:46:35 PM Gradle sync failed: Unable to start the daemon process.
This problem might be caused by incorrect configuration of the daemon.
For example, an unrecognized jvm option is used.
Please refer to the user guide chapter on the daemon at https://docs.gradle.org/2.10/userguide/gradle_daemon.html
Please read the following process output to find out more:
-----------------------
Error occurred during initialization of VM
Could not reserve enough space for object heap
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Consult IDE log for more details (Help | Show Log)
the jvm version is 1.7.0_79
and the studio version is 2.1.1
Error occurred during initialization of VM Could not reserve enough space for object heap Error: Could not create the Java Virtual Machine.
There's no space available in RAM. To fix go to /android-studio-dir/bin and edit studio.vmoptions and studio64.vmoptions to increment the -Xmx and to reserve more memory to Java. Note that the number of processes active may influence on that.
Probably, the /tmp location is full..
Found this somewhere..
Use df command
df
You should see an output with a line like this:
tmpfs 102400 102312 88 100% /tmp
So to change the size of the tmp file:
sudo mount -o remount,size=2G /tmp
Done! Now, It should work..

Openstack TripleO undercloud installation "could not find class ::ironic::drivers::deploy"

My host is:
cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)
The host setup was done as described here: http://docs.openstack.org/developer/tripleo-docs/environments/environments.html#virtual-environment up to the "Continue with Undercloud ..." step
The result:
sudo virsh list --all
Id Name State
----------------------------------------------------
3 baremetalbrbm_0 running
4 instack running
- baremetalbrbm_1 shut off
The undercloud setup was done as described here: http://docs.openstack.org/developer/tripleo-docs/installation/installation.html
The installation was attempted on the instack VM. Did the SSL setup as well.
Running
openstack undercloud install
fails with
+ puppet apply --detailed-exitcodes /etc/puppet/manifests/puppet-stack-config.pp Notice: Scope(Class[Tripleo::Firewall::Post]): At this stage, all network traffic is blocked. Warning: Scope(Class[Swift]): swift_hash_suffix has been deprecated and should be replaced with swift_hash_path_suffix, this will be removed Warning: Scope(Class[Nova::Keystone::Auth]): Note that service_name parameter default value will be changed to "Compute Service" (according future release. In case you use different value, please update your manifests accordingly. Warning: Scope(Class[Nova::Keystone::Auth]): Note that service_name_v3 parameter default value will be changed to "Compute Service v3" (acco in a future release. In case you use different value, please update your manifests accordingly. Warning: Scope(Class[Glance::Api]): The known_stores parameter is deprecated, use stores instead Warning: Scope(Class[Glance::Api]): default_store not provided, it will be automatically set to glance.store.filesystem.Store Warning: Scope(Class[Nova::Api]): In N cycle, enabled_apis will have to be an array of APIs to enable. Warning: Scope(Class[Neutron::Server]): identity_uri, auth_tenant, auth_user, auth_password, auth_region configuration options are deprecateted options Warning: Scope(Class[Neutron::Agents::Dhcp]): The dhcp_domain parameter is deprecated and will be removed in future releases Warning: Scope(Class[Heat]): Default value for rabbit_heartbeat_timeout_threshold parameter is different from OpenStack project defaults Warning: Scope(Class[Heat]): "admin_user", "admin_password", "admin_tenant_name" configuration options are deprecated in favor of auth_plugi Warning: Scope(Class[Nova::Network::Neutron]): neutron_auth_plugin parameter is deprecated and will be removed in a future release, use neut Error: Could not find class ::ironic::drivers::deploy for instack on node instack Error: Could not find class ::ironic::drivers::deploy for instack on node instack
+ rc=1
+ set -e
+ echo 'puppet apply exited with exit code 1' puppet apply exited with exit code 1
+ '[' 1 '!=' 2 -a 1 '!=' 0 ']'
+ exit 1 [2016-05-19 15:32:29,361] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/cot status 1]
[2016-05-19 15:32:29,362] (os-refresh-config) [ERROR] Aborting... Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/lib/python2.7/site-packages/instack_undercloud/undercloud.py", line 987, in install
_run_orc(instack_env) File "/usr/lib/python2.7/site-packages/instack_undercloud/undercloud.py", line 866, in _run_orc
_run_live_command(args, instack_env, 'os-refresh-config') File "/usr/lib/python2.7/site-packages/instack_undercloud/undercloud.py", line 444, in _run_live_command
raise RuntimeError('%s failed. See log for details.' % name) RuntimeError: os-refresh-config failed. See log for details. Command 'instack-install-undercloud' returned non-zero exit status 1
Tried to install the ironic api as described here http://docs.openstack.org/developer/ironic/deploy/install-guide.html although to my understanding, this should not be necessary, since the undercloud was not installed on a baremetal machine.
Same result.
Some hours of Puppet readings later, I went into the /etc/puppet/modules/ironic/manifests/drivers folder and found, to no surprise, that the deploy class was not there. Perhaps it should not have been needed? I copied it from https://github.com/openstack/puppet-ironic/blob/master/manifests/drivers/deploy.pp and it seems to have got past the error originally reported. Fingers crossed.

Spark - UbuntuVM - insufficient memory for the Java Runtime Environment

I'm trying to install Spark1.5.1 on Ubuntu14.04 VM. After un-tarring the file, I changed the directory to the extracted folder and executed the command "./bin/pyspark" which should fire up the pyspark shell. But I got an error message as follows:
[ OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000c5550000, 715849728, 0) failed;
error='Cannot allocate memory' (errno=12) There is insufficient
memory for the Java Runtime Environment to continue.
Native memory allocation (malloc) failed to allocate 715849728 bytes
for committing reserved memory.
An error report file with more information is saved as:
/home/datascience/spark-1.5.1-bin-hadoop2.6/hs_err_pid2750.log ]
Could anyone please give me some directions to sort out the problem?
We need to set spark.executor.memory in conf/spark-defaults.conf file to a value specific to your machine. For example,
usr1#host:~/spark-1.6.1$ cp conf/spark-defaults.conf.template conf/spark-defaults.conf
nano conf/spark-defaults.conf
spark.driver.memory 512m
For more information, refer to the official documentation: http://spark.apache.org/docs/latest/configuration.html
Pretty much what it says. It wants 7GB of RAM. So give the VM ~ 8GB of RAM.

Resources