Error about TMP_DIR space using parallel program - linux

I am using parallel for freebayes to call variants using,
TMPDIR=/path/tmp freebayes-parallel <(fasta_generate_regions.py genome.fasta.fai 100000) 36 \
-f genome.fasta -L $bam_list.txt > freebayes.vcf
But it gives error:
parallel: Error: Output is incomplete. Cannot append to buffer file tmp. Is the disk full?
parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
I found some other posts where they suggested to change the tmp where space is available. I am not sure how to do that. Thank you for your help!

Related

GNU Parallel Memory Leak when Running Python Script

I have a python script that uses a function to temporarily downloads files from a bucket, transforms them into an ndarray, and concludes by saving it (final size ~ 10GB) to another bucket.
I need to run this script ~200 times, so I created an sh file, run_reshape.sh, to parallelize the runs that follows this layout:
#!/bin/sh
python3 reshape.py 'group_1'
python3 reshape.py 'group_2'
...
I have been trying to parallelize these runs using GNU Parallel in the following way:
parallel --jobs 6 --tmpdir scratch/tmp --cleanup < run_reshape.sh
After 2-3 successful runs of the .py script on different cores, I get the following error from GNU Parallel:
parallel: Error: Output is incomplete. Cannot append to buffer file in $TMPDIR. Is the disk full?
parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
I'm not sure how the disk could be full. When I check free -m after parallel throws the error, I have >120GB of available space on disk.
I have checked both .parallel/tmp/ and scratch/tmp/. scratch/tmp/ is empty and .parallel/tmp/ has a 6 byte file in it. Also all variables within the python script are located inside a function that is called without its own variable assignment. As an extra precaution, I also delete them and call gc.collect() at the conclusion of reshape.py.
Any help with this is greatly appreciated!
Extra Info
In case it's helpful here is the basic outline of reshape.py:
# Define reshape function
def reshape_images(arg):
x_len = 1000
new_shape = np.empty((x_len, 2048, 2048), dtype=(np.float16))
new_shape[:] = np.nan
for n in range(x_len):
with gcs_file_system.open(arg+str([n])+'.jpg') as file:
im = Image.open(file)
np_im = np.array(im, dtype='np.float16')
new_shape[n]=np_im
del im
del np_im
save_string = f'{arg}.npy'
np.save(file_io.FileIO(f'{save_string}', 'w'), new_shape)
del new_shape
# Run reshape function
reshape_images(sys.argv[1])
# Clear memory of namespace variables
gc.collect()
I'm not sure how the disk could be full. When I check free -m after parallel throws the error, I have >120GB of available space on disk.
You need to do df scratch/tmp before GNU Parallel stops.
GNU Parallel opens temporary files in --tmpdir, removes them immediately, but keeps them open. This is to avoid that files need to be cleaned up if GNU Parallel is killed.
You will most likely discover a situation where:
scratch/tmp is full
there are no files in scratch/tmp
But as soon as GNU Parallel ends, the space will be free.
So if you only look at df after GNU Parallel has finished, you will not be looking at the time when the disk is full.
In other words: What you see is a 100% normal behaviour when scratch/tmp is too small.
Try setting --tmpdir to a dir with more available space.
Or try:
seq 100000000 | parallel -uj1 -N0 df scratch/tmp
while running your jobs and see the disk fill up.

How do I implement "file -s <file>" on Linux in pure Go?

Intent:
Does Go have the functionality (package or otherwise) to perform a special file stat on Linux akin to the command file -s <path>
Example:
[root#localhost ~]# file /proc/uptime
/proc/uptime: empty
[root#localhost ~]# file -s /proc/uptime
/proc/uptime: ASCII text
Use Case:
I have a fileglob of files in /proc/* that I need to very quickly detect if they are truly empty instead of appearing to be empty.
Using The os Package:
Code:
result,_ := os.Stat("/proc/uptime")
fmt.Println("Name:",result.Name()," Size:",result.Size()," Mode:",int(result.Mode()))
fmt.Printf("%q",result)
Result:
Name: uptime Size: 0 Mode: 292
&{"uptime" '\x00' 'Ĥ' {%!q(int64=63606896088) %!q(int32=413685520) %!q(*time.Location=&{ [] [] 0 0 <nil>})} {'\x03' %!q(uint64=4026532071) '\x01' '脤' '\x00' '\x00' '\x00' '\x00' '\x00' 'Ѐ' '\x00' {%!q(int64=1471299288) %!q(int64=413685520)} {%!q(int64=1471299288) %!q(int64=413685520)} {%!q(int64=1471299288) %!q(int64=413685520)} ['\x00' '\x00' '\x00']}}
Obvious Workaround:
There is the obvious workaround of the following. But it's a little over the top to need to call in a bash shell in order to get file stats.
output,_ := exec.Command("bash","-c","file -s","/proc/uptime").Output()
//parse output etc...
EDIT/MY PRACTICAL USE CASE:
Quickly determining which files are zero size without needing to read each one of them first.
file -s /cgroup/memory/lsf/<cluster>/*/tasks | <clean up commands> | uniq -c
6 /cgroup/memory/lsf/<cluster>/<jobid>/tasks: ASCII text
805 /cgroup/memory/lsf/<cluster>/<jobid>/tasks: empty
So in this case, I know that only those 6 jobs are running and the rest (805) have terminated. Reading the file works like this:
# cat /cgroup/memory/lsf/<cluster>/<jobid>/tasks
#
or
# cat /cgroup/memory/lsf/<cluster>/<jobid>/tasks
12352
53455
...
I'm afraid you might be confusing matters here: file is special in precisely a way it "knows" a set of heuristics to carry out its tasks.
To my knowledge, Go does not have anything like this in its standard library, and I've not came across a 3rd-party package implementing a file-like functionality (though I invite you to search by relevant keywords on http://godoc.org)
On the other hand, Go provides full access to the syscall interface of the underlying OS so when it comes to querying the OS in a way file does it, there's nothing you could not do in plain Go.
So I suggest you to just fetch the source code of file, learn what it does in its mode turned on by the "-s" command-line option and implement that in your Go code.
We'll try to have you with specific problems doing that — should you have any.
Update
Looks like I've managed to grasp the OP is struggling with: a simple check:
$ stat -c %s /proc/$$/status && wc -c < $_
0
849
That is, the stat call on a file under /proc shows it has no contents but actually reading from that file returns that contents.
OK, so the solution is simple: instead of doing a call to os.Stat() while traversing the subtree of the filesystem one should instead merely attempt to read a single byte from the file, like in:
var buf [1]byte
f, err := os.Open(fname)
if err != nil {
// do something, or maybe ignore.
// A not existing file is OK to ignore
// (the POSIX error code will be ENOENT)
// because after the `path/filepath.Walk()` fetched an entry for
// this file from its directory, the file might well have gone.
}
_, err = f.Read(buf[:])
if err != nil {
if err == io.EOF {
// OK, we failed to read 1 byte, so the file is empty.
}
// Otherwise, deal with the error
}
f.Close()
You might try to be more clever and first obtain the stat information
(using a call to os.Stat()) to see if the file is a regular file—to
not attempt reading from sockets etc.
I have a fileglob of files in /proc/* that I need to very quickly
detect if they are truly empty instead of appearing to be empty.
They are truly empty in some sense (eg. they occupy no space on file system). If you want to check whether any data can be read from them, try reading from them - that's what file -s does:
-s, --special-files
Normally, file only attempts to read and
determine the type of argument files which stat(2) reports are
ordinary files. This prevents problems, because reading special files
may have peculiar consequences. Specifying the -s option causes file
to also read argument files which are block or character special
files. This is useful for determining the filesystem types of the
data in raw disk partitions, which are block special files. This
option also causes file to disregard the file size as reported by
stat(2) since on some systems it reports a zero size for raw disk
partitions.

Executing unzip command programmatically

I have created a shell script and inside of it is a simple statement unzip -o $1 and on running through terminal and passing a .zip file as parameter it works fine and takes 5 second to create unzipped folder.Now I am trying to do the same thing in scala and my code is as below :
object ZipExt extends App {
val process = Runtime.getRuntime.exec(Array[String]("/home/administrator/test.sh", "/home/administrator/MyZipFile_0.8.6.3.zip"))
process.waitFor
println("done")
}
Now whenever I am trying to execute ZipExt it gets stuck in process.waitFor forever and print statement is not reached.I have tried using this code both locally and on server also. I tried other possibilites also like creating local variable inside shellscript, including exit statements inside shell script, trying to unzip other .zip other than mines, even sometimes print statement is executing but no unzipped file is created there. So I am pretty sure there is something wrong about executing unzip command programmatically to unzip a file or there is some other way around to unzip a zipped file programtically.I have been stuck with this problem for like 2 days, so somebody plz help..
The information you have given us appears to be insufficient to reproduce the problem:
% mkdir 34088099
% cd 34088099
% mkdir junk
% touch junk/a junk/b junk/c
% zip -r junk.zip junk
updating: junk/ (stored 0%)
adding: junk/a (stored 0%)
adding: junk/b (stored 0%)
adding: junk/c (stored 0%)
% rm -r junk
% echo 'unzip -o $1' > test.sh
% chmod +x test.sh
% scala
Welcome to Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66).
Type in expressions to have them evaluated.
Type :help for more information.
scala> val process = Runtime.getRuntime.exec(Array[String]("./test.sh", "junk.zip"))
process: Process = java.lang.UNIXProcess#35432107
scala> process.waitFor
res0: Int = 0
scala> :quit
% ls junk
a b c
I would suggest trying this same reproduction on your own machine. If it succeeds for you too, then start systematically reducing the differences between the succeeding case and the failing case, a step at a time. This will help narrow down what the possible causes are.

Compressing the core files during core generation

Is there way to compress the core files during core dump generation?
If the storage space is limited in the system, is there a way of conserving it in case of need for core dump generation with immediate compression?
Ideally the method would work on older versions of linux such as 2.6.x.
The Linux kernel /proc/sys/kernel/core_pattern file will do what you want: http://www.mjmwired.net/kernel/Documentation/sysctl/kernel.txt#191
Set the filename to something like |/bin/gzip -1 > /var/crash/core-%t-%p-%u.gz and your core files should be saved compressed for you.
For an embedded Linux systems, following script change perfectly works to generate compressed core files in 2 steps
step 1: create a script
touch /bin/gen_compress_core.sh
chmod +x /bin/gen_compress_core.sh
cat > /bin/gen_compress_core.sh #!/bin/sh exec /bin/gzip -f - >"/var/core/core-$1.$2.gz"
ctrl +d
step 2: update the core pattern file
cat > /proc/sys/kernel/core_pattern |/bin/gen_compress_core.sh %e %p ctrl+d
As suggested by other answer, the Linux kernel /proc/sys/kernel/core_pattern file is good place to start: http://www.mjmwired.net/kernel/Documentation/sysctl/kernel.txt#141
As documentation says you can specify the special character "|" which will tell kernel to output the file to script. As suggested you could use |/bin/gzip -1 > /var/crash/core-%t-%p-%u.gz as name, however it doesn't seem to work for me. I expect that the reason is that on my system kernel doesn't treat the > character as a output, rather it probably passes it as a parameter to gzip.
In order to avoid this problem, like other suggested you can create your file in some location I am using /home//crash/core.sh, create it using the following command, replacing with your user. Alternatively you can also obviously change the entire path.
echo -e '#!/bin/bash\nexec /bin/gzip -f - >"/home/<username>/crashes/core-$1-$2-$3-$4-$5.gz"' > ~/crashes/core.sh
Now this script will take 5 input parameters and concatenate them and add to core-path. The full paths must be specified in the ~/crashes/core.sh. Also the location of this script can be specified. Now lets tell kernel to use tour executable with parameters when generating file:
sudo sysctl -w kernel.core_pattern="|/home/<username>/crashes/core.sh %e %p %h %t"
Again should be replaced (or entire path to match location and name of core.sh script). Next step is to crash some program, lets create example crashing cpp file:
int main (){
int * a = nullptr;
int b = *a;
}
After compiling and running there are 2 options, either we will see:
Segmentation fault (core dumped)
Or
Segmentation fault
In case we see the latter, there are few possible reasons.
ulimit is not set, ulimit -c should specify what is limit for cores
apport or your distro core dump collector is not running, this should be investigated further
there is an error in script we wrote, I suggest than checking some basic dump path to check if the other things aren't reason the below should create /tmp/core.dump:
sudo sysctl -w kernel.core_pattern="/tmp/core.dump"
I know there is already an answer for this question however it wasn't obvious for me why it isn't working "out of the box" so I wanted to summarize my findings, hope it helps someone.

Synchronize shell script execution

A modified version of a shell script converts an audio file from FLAC to MP3 format. The computer has a quad-core CPU. The script is run using:
./flac2mp3.sh $(find flac -type f)
This converts the FLAC files in the flac directory (no spaces in file names) to MP3 files in the mp3 directory (at the same level as flac). If the destination MP3 file already exists, the script skips the file.
The problem is that sometimes two instances of the script check for the existence of the same MP3 file at nearly the same time, resulting in mangled MP3 files.
How would you run the script multiple times (i.e., once per core), without having to specify a different file set on each command-line, and without overwriting work?
Update - Minimal Race Condition
The script uses the following locking mechanism:
# Convert FLAC to MP3 using tags from flac file.
#
if [ ! -e $FLAC.lock ]; then
touch $FLAC.lock
flac -dc "$FLAC" | lame${lame_opts} \
--tt "$TITLE" \
--tn "$TRACKNUMBER" \
--tg "$GENRE" \
--ty "$DATE" \
--ta "$ARTIST" \
--tl "$ALBUM" \
--add-id3v2 \
- "$MP3"
rm $FLAC.lock
fi;
However, this still leaves a race condition.
The "lockfile" command provides what you're trying to do for shell scripts without the race condition. The command was written by the procmail folks specifically for this sort of purpose and is available on most BSD/Linux systems (as procmail is available for most environments).
Your test becomes something like this:
lockfile -r 3 $FLAC.lock
if test $? -eq 0 ; then
flac -dc "$FLAC" | lame${lame_opts} \
--tt "$TITLE" \
--tn "$TRACKNUMBER" \
--tg "$GENRE" \
--ty "$DATE" \
--ta "$ARTIST" \
--tl "$ALBUM" \
--add-id3v2 \
- "$MP3"
fi
rm -f $FLAC.lock
Alternatively, you could make lockfile keep retrying indefinitely so you don't need to test the return code, and instead can test for the output file for determining whether to run flac.
If you don't have lockfile and cannot install it (in any of its versions - there are several implementations) a robust and portable atomic mutex is mkdir.
If the directory you attempt to create already exists, mkdir will fail, so you can check for that; when creation succeeds, you have a guarantee that no other cooperating process is in the critical section at the same time as your code.
if mkdir "$FLAC.lockdir"; then
# you now have the exclusive lock
: critical section
: code goes here
rmdir "$FLAC.lockdir"
else
: nothing? to skip this file
# or maybe sleep 1 and loop back and try again
fi
For completeness, maybe also look for flock if you are on a set of platforms where that is reliably made available and need a performant alternative to lockfile.
You could implement locking of FLAC files that it's working on. Something like:
if (not flac locked)
lock flac
do work
else
continue to next flac
Send output to a temporary file with a unique name, then rename the file to the desired name.
flac -dc "$FLAC" | lame${lame_opts} \
--tt "$TITLE" \
--tn "$TRACKNUMBER" \
--tg "$GENRE" \
--ty "$DATE" \
--ta "$ARTIST" \
--tl "$ALBUM" \
--add-id3v2 \
- "$MP3.$$"
mv "$MP3.$$" "$MP3"
If a race condition leaks through your file locking system every once in a while, the final output will still be the result of one process.
To lock a file process you can create a file with the same name with a .lock extension.
Before starting the encoding check the existence of the .lock file, and optionally make sure the date of the lockfile isn't too old (in case the process dies). If it does not exist, create it before the encoding starts, and remove it after the encoding is complete.
You can also flock the file, but this only really works in c where you are calling flock() and writing to the file then closing and unlocking. For a shell script, you probably are calling another utility to do the writing of the file.
How about writing a Makefile?
ALL_FLAC=$(wildcard *.flac)
ALL_MP3=$(patsubst %.flac, %.mp3, $(ALL_FLAC)
all: $(ALL_MP3)
%.mp3: %.flac
$(FLAC) ...
Then do
$ make -j4 all
In bash it's possible to set noclobber option to avoid file overwriting.
help set | egrep 'noclobber|-C'
Use a tool like FLOM (Free LOck Manager) and simply serialize your command as below:
flom -- flac ....

Resources