Get available memory inside SLURM step - slurm

I'm trying to generate a script that automatically adapt its requirements to whatever is the environment where it is running.
I already got the number of CPUs available by accessing the SLURM_CPUS_PER_TASK environment variable. If it does not exists, I assume it is an interactive execution and default the value to 1.
Now I need to get the memory available, but this is not so straightforward. We have SLURM_MEM_PER_CPU and SLURM_MEM_PER_NODE. If I'm not wrong, this numbers are not always present, and there's the special case of asking for zero memory. But I need to have the real number, as I'm trying to run a java application and I need to put something specific in the -Xmx parameter.
Is there any easy way to get that info? Or I have to test for availability of any of the variables and query SLURM/the system in order to get total memory available in case of zero?

If you request memory (--mem) on your submit script these environment variables should be set.
Else you can try (scontrol show config)
or parse /etc/slurm/slurm.conf for MaxMemPerNode with the PartitionName you are running.
ref: https://slurm.schedmd.com/sbatch.html

Related

Trigger an interrupt when the value of a memory location is modified in FreeBSD/Linux

Is it possible to generate an interrupt when the value of a variable or memory location get modified in FreeBSD or Linux environment using C program ?
In a C application there is an dynamically allocated array which is being used/modified from multiple locations. The application is pretty large and complex, it is difficult to trace all the places the array being used or modified from.
The problem is in some condition/flow the array[2] element become 0 which is not expected as per this application. I can't run the application using gdb to debug this issue (because of some constraint). The only way to debug this issue is to modify the source code and run the binary where the issue is happening.
Is it possible to generate an interrupt when the arra[2] element is modified and print the backtrace to know which part of the codebase has modified it ?
Thanks!!!
You want a data breakpoint, also called watchpoint; GDB provides the following commands:
watch for writes
rwatch for reads
awatch for both
You can ask GDB for a specific condition as well, so the following expression (or something similar) should work:
watch array[2] if array[2] == 0
You must run the expression in the scope of the variable, the easiest way is to just set a breakpoint in the line after the allocation, then set the watchpoint after the breakpoint triggers and resume execution.
OTOH, to implement such a debugging facility within the application is rather complex and hardware-specific (in case hardware support isn't available, software watchpoints require implementing an entire debugger), so I would recommend using liblldb (which is Apache-2.0 licensed IIRC), as it provides a lldb::SBWatchpoint class which you can leverage. The Python API is documented: https://lldb.llvm.org/python_api/lldb.SBWatchpoint.html.
The C++ API is similar, but there's a lot of boilerplate to write that I don't see documented anywhere, so the API is private; you'd have to look at LLDB's own source code.

Slurm: by default assign a certain number of GPUs

If I do not specify any --gres=gpu:1 option then the process will use up all GPUs in the compute node.
We only use Slurm for GPU sharing so we would like that every process be assigned one GPU automatically... Is it possible to specify that by default srun --gres=gpu:1?
You can set a default for --gres by setting the SBATCH_GRES env variable to all users, for instance in /etc/profile.d on the login node. Simply create a file in there, that has the following content:
export SBATCH_GRES=gpu:1
Note that the documentation says
Note that environment variables will override any options set in a batch script
so people who will want to use more than one, or not use a GPU at all will need to override this default using the command line option, and won't be able to override it with a #SBATCH --gres line in their submission script.
Another option would be, to set the CUDA_VISIBLE_DEVICES to an empty string for all users by default. Then, in jobs that request GPUs, the variable will be modified by Slurm according to the request, and jobs that do not make the GPU request will not 'see' the GPUs.
If users are likely to play the system (the CUDA_VISIBLE_DEVICES variable can be overwritten by the users), then you will have to set cgroups.

How can I configure SLURM at the user level (e.g. with something like a ".slurmrc")?

Is there something like .slurmrc for SLURM that would allow each user to set their own defaults for parameters that they would normally specify on the command line.
For example, I run 95% of my jobs on what I'll call our HighMem partition. Since my routine jobs can easily go over the default of 1GB, I almost always request 10GB of RAM. To make the best use of my time, I would like to put the partition and RAM requests in a configuration file so that I don't have to type them in all the time. So, instead of typing the following:
sbatch --partition=HighMem --mem=10G script.sh
I could just type this:
sbatch script.sh
I tried searching for multiple variations on "SLURM user-level configuration" and it seemed that all SLURM-related hits dealt with slurm.conf (a global-level configuration file).
I even tried creating slurm.conf and .slurmrc in my home directory, just in case that worked, but they didn't have any effect on the partition used.
update 1
Yes, I thought about scontrol, but the only configuration file it deals with is global and most parameters in it aren't even relevant for a normal user.
update 2
My supervisor pointed out the SLURM Perl API to me. The last time I looked at it, it seemed too complicated to me, but this time upon looking at the code for https://github.com/SchedMD/slurm/blob/master/contribs/perlapi/libslurm/perl/t/06-complete.t, it would seem that it wouldn't too be hard to create a script that behaves similar to sbatch that reads in a default configuration file and sets the desired parameters. However, I haven't had any success in setting the 'std_out' to a file name that gets written to.
If your example is representative, defining an alias
alias sbatch='sbatch --partition=HighMem --mem=10G'
could be the easiest way. Alternatively, a Bash function could also be used
sbatch() {
command sbatch --partition=HighMem --mem=10G "$#"
}
Put any of these in your .bash_profile for persistence.

Secure Erase of a Bash Environmental Variable

Suppose I have in a Bash shell script an environmental variable that holds a sensitive value (e.g. a password). How may I securely overwrite the memory that holds this variable's value before exiting my script?
If possible, the technique used to do so would not be dependent on the particular implementation of Bash I'm using. I'd like to find a standards-respecting/canonical way to do this that works on all correct Bash implementations.
Please note that the following are not in the scope of the question:
1. How the sensitive value is placed into the environmental variable
2. How the sensitive value stored in the environmental variable is passed to the program that consumes it
7/10/2017 5:03 AM Update to Address Comment by rici
rici, thank you for your comment, copied here:
"Exiting the script is really the only way to reliably delete an
environment variable from the script's resident memory. Why do you
feel the string is less safe after the script terminates than while it
is running?"
My intent here is to follow good practice and actively scrub all cryptographically-sensitive values from memory as soon as I am through using them.
I do not know if Bash actively scrubs the memory used by a script when that script exits. I suspect that it does not. If it indeed does not, the sensitive cryptographic value will remain resident in memory and is subject to capture by an adversary.
In C/C++, one can easily scrub a value's memory location. I am trying to find out of this is possible in Bash. It may be that Bash is simply not the right tool for security-sensitive applications.
First off, we need to distinguish between environment variables and shell variables. Environment variables exist for the lifetime of the process and cannot be overwritten. Not only that, but on many systems they are trivially visible to other processes. For example Linux provides the /proc filesystem which allows for lots of introspection of running processes, including observing their environment variables.
Here's an example of a Bash script that attempts to overwrite an environment variable. Notice that although the value within the script changes, the process' environment is not changed:
$ SECRET=mysecret bash -c \
'strings /proc/$$/environ | grep SECRET
SECRET=overwritten
echo "SECRET=$SECRET"
strings /proc/$$/environ | grep SECRET'
SECRET=mysecret
SECRET=overwritten
SECRET=mysecret
So it is never safe to store secrets in environment variables unless you control all access to the machine.
Holding a secret in a (non-environment) shell variable is much more secure, as an attacker would need to be able to access the memory of the process, which is generally something only the kernel can do. And while you're correct that minimizing the time you hold onto such secrets is a good practice, it's not generally worth jumping through lots of hoops for. It's far more important to secure your system and execution environment, because a motivated attacker who has sufficient access can observe a secret even if it only lives in memory for a brief time. Holding a secret in memory for longer than strictly necessary is only a marginal increase in risk, whereas running a privileged program on an insecure system already means all bets are off.

The time "command" lets your profile runtime. Is there a command that lets you profile max and average memory consumption?

What tools can I use, besides continually checking "top" to profile a linux binary file.
Ubuntu shootouts seems to do this and it looks like their using custom written python scripts
http://benchmarksgame.alioth.debian.org/
I was wondering if there's a more out of the box way of doing this
Is there a command that lets you profile max and average memory
consumption?
Yes, use a standalone time tool instead of builtin bash time. To invoke it, you should specify its full path, usually /usr/bin/time. At least you can see maximum resident set size in it's output for maximum memory consumption. See sample output in this question Shell execution: time vs. /usr/bin/time, for example.

Resources