GNU Parallel Memory Leak when Running Python Script - python-3.x

I have a python script that uses a function to temporarily downloads files from a bucket, transforms them into an ndarray, and concludes by saving it (final size ~ 10GB) to another bucket.
I need to run this script ~200 times, so I created an sh file, run_reshape.sh, to parallelize the runs that follows this layout:
#!/bin/sh
python3 reshape.py 'group_1'
python3 reshape.py 'group_2'
...
I have been trying to parallelize these runs using GNU Parallel in the following way:
parallel --jobs 6 --tmpdir scratch/tmp --cleanup < run_reshape.sh
After 2-3 successful runs of the .py script on different cores, I get the following error from GNU Parallel:
parallel: Error: Output is incomplete. Cannot append to buffer file in $TMPDIR. Is the disk full?
parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
I'm not sure how the disk could be full. When I check free -m after parallel throws the error, I have >120GB of available space on disk.
I have checked both .parallel/tmp/ and scratch/tmp/. scratch/tmp/ is empty and .parallel/tmp/ has a 6 byte file in it. Also all variables within the python script are located inside a function that is called without its own variable assignment. As an extra precaution, I also delete them and call gc.collect() at the conclusion of reshape.py.
Any help with this is greatly appreciated!
Extra Info
In case it's helpful here is the basic outline of reshape.py:
# Define reshape function
def reshape_images(arg):
x_len = 1000
new_shape = np.empty((x_len, 2048, 2048), dtype=(np.float16))
new_shape[:] = np.nan
for n in range(x_len):
with gcs_file_system.open(arg+str([n])+'.jpg') as file:
im = Image.open(file)
np_im = np.array(im, dtype='np.float16')
new_shape[n]=np_im
del im
del np_im
save_string = f'{arg}.npy'
np.save(file_io.FileIO(f'{save_string}', 'w'), new_shape)
del new_shape
# Run reshape function
reshape_images(sys.argv[1])
# Clear memory of namespace variables
gc.collect()

I'm not sure how the disk could be full. When I check free -m after parallel throws the error, I have >120GB of available space on disk.
You need to do df scratch/tmp before GNU Parallel stops.
GNU Parallel opens temporary files in --tmpdir, removes them immediately, but keeps them open. This is to avoid that files need to be cleaned up if GNU Parallel is killed.
You will most likely discover a situation where:
scratch/tmp is full
there are no files in scratch/tmp
But as soon as GNU Parallel ends, the space will be free.
So if you only look at df after GNU Parallel has finished, you will not be looking at the time when the disk is full.
In other words: What you see is a 100% normal behaviour when scratch/tmp is too small.
Try setting --tmpdir to a dir with more available space.
Or try:
seq 100000000 | parallel -uj1 -N0 df scratch/tmp
while running your jobs and see the disk fill up.

Related

Error about TMP_DIR space using parallel program

I am using parallel for freebayes to call variants using,
TMPDIR=/path/tmp freebayes-parallel <(fasta_generate_regions.py genome.fasta.fai 100000) 36 \
-f genome.fasta -L $bam_list.txt > freebayes.vcf
But it gives error:
parallel: Error: Output is incomplete. Cannot append to buffer file tmp. Is the disk full?
parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
I found some other posts where they suggested to change the tmp where space is available. I am not sure how to do that. Thank you for your help!

Use more than one core in bash

I have a linux tool that (greatly simplifying) cuts me the sequences specified in illumnaSeq file. I have 32 files to grind. One file is processed in about 5 hours. I have a server on the centos, it has 128 cores.
I've found a few solutions, but each one works in a way that only uses one core. The last one seems to fire 32 nohups, but it'll still pressurize the whole thing with one core.
My question is, does anyone have any idea how to use the server's potential? Because basically every file can be processed independently, there are no relations between them.
This is the current version of the script and I don't know why it only uses one core. I wrote it with the help of advice here on stack and found on the Internet:
#!/bin/bash
FILES=/home/daw/raw/*
count=0
for f in $FILES
to
base=${f##*/}
echo "process $f file..."
nohup /home/daw/scythe/scythe -a /home/daw/scythe/illumina_adapters.fa -o "OUT$base" $f &
(( count ++ ))
if (( count = 31 )); then
wait
count=0
fi
done
I'm explaining: FILES is a list of files from the raw folder.
The "core" line to execute nohup: the first path is the path to the tool, -a path is the path to the file with paternas to cut, out saves the same file name as the processed + OUT at the beginning. The last parameter is the input file to be processed.
Here readme tools:
https://github.com/vsbuffalo/scythe
Does anybody know how you can handle it?
P.S. I also tried move nohup before count, but it's still use one core. I have no limitation on server.
IMHO, the most likely solution is GNU Parallel, so you can run up to say, 64 jobs in parallel something like this:
parallel -j 64 /home/daw/scythe/scythe -a /home/daw/scythe/illumina_adapters.fa -o OUT{.} {} ::: /home/daw/raw/*
This has the benefit that jobs are not batched, it keeps 64 running at all times, starting a new one as each job finishes, which is better than waiting potentially 4.9 hours for all 32 of your jobs to finish before starting the last one which takes a further 5 hours after that. Note that I arbitrarily chose 64 jobs here, if you don't specify otherwise, GNU Parallel will run 1 job per CPU core you have.
Useful additional parameters are:
parallel --bar ... gives a progress bar
parallel --dry-run ... does a dry run so you can see what it would do without actually doing anything
If you have multiple servers available, you can add them in a list and GNU Parallel will distribute the jobs amongst them too:
parallel -S server1,server2,server3 ...

Subprocess.Popen vs .call: What is the correct way to call a C-executable from shell script using python where all 6 jobs can run in parallel

Using subprocess.Popen is producing incomplete results where as subprocess.call is giving correct output
This is related to a regression script which has 6 jobs and each job performs same task but on different input files. And I'm running everything in parallel using SubProcess.Popen
Task is performed using a shell script which has calls to a bunch of C-compiled executables whose job is to generate some text reports followed by converting text report info into jpg images
sample of shell script (runit is the file name) with calling C-compile executables
#!/bin/csh -f
#file name : runit
#C - Executable 1
clean_spgs
#C - Executable 2
scrub_spgs_all file1
scrub_spgs_all file2
#C - Executable 3
scrub_pick file1 1000
scrub_pick file2 1000
while using subprocess.Popen, both scrub_spgs_all and scrub_pick are trying to run in parallel causing the script to generate incomplete results i.e. output text files doesn't contain complete information and also missing some of output text reports.
subprocess.Popen call is
resrun_proc = subprocess.Popen("./"+runrescompare, shell=True, cwd=rescompare_dir, stdout=subprocess.PIPE, stderr=subprocess.POPT, universal_newlines=True)
where runrescompare is a shell script and has
#!/bin/csh
#some other text
./runit
Where as using subprocess.call is generating all the output text files and jpg images correctly but I can't run all 6 jobs in parallel.
resrun_proc = subprocess.call("./"+runrescompare, shell=True, cwd=rescompare_dir, stdout=subprocess.PIPE, stderr=subprocess.POPT, universal_newlines=True)
What is the correct way to call a C-exctuable from shell script using python subprocess calls where all 6 jobs can run in parallel(using python 3.5.1?
Thanks.
You tried to simulate multiprocessing with subprocess.Popen() which does not work like you want: the output is blocked after a while unless you consume it, for instance with communicate() (but this is blocking) or by reading the output, but with 6 concurrent handles in a loop, you are bound to get deadlocks.
The best way is run the subprocess.call lines in separate threads.
There are several ways to do it. Small simple example with locking:
import threading,time
lock=threading.Lock()
def func1(a,b,c):
lock.acquire()
print(a,b,c)
lock.release()
time.sleep(10)
tl=[]
t = threading.Thread(target=func1,args=[1,2,3])
t.start()
tl.append(t)
t=threading.Thread(target=func1,args=[4,5,6])
t.start()
tl.append(t)
# wait for all threads to complete (if you want to wait, else
# you can skip this loop)
for t in tl:
t.join()
I took the time to create an example more suitable to your needs:
2 threads executing a command and getting the output, then printing it within a lock to avoid mixup. I have used check_output method for this. I'm using windows, and I list C and D drives in parallel.
import threading,time,subprocess
lock=threading.Lock()
def func1(runrescompare,rescompare_dir):
resrun_proc = subprocess.check_output(runrescompare, shell=True, cwd=rescompare_dir, stderr=subprocess.PIPE, universal_newlines=True)
lock.acquire()
print(resrun_proc)
lock.release()
tl=[]
t=threading.Thread(target=func1,args=["ls","C:/"])
t.start()
tl.append(t)
t=threading.Thread(target=func1,args=["ls","D:/"])
t.start()
tl.append(t)
# wait for all threads to complete (if you want to wait, else
# you can skip this loop)
for t in tl:
t.join()

How do I implement "file -s <file>" on Linux in pure Go?

Intent:
Does Go have the functionality (package or otherwise) to perform a special file stat on Linux akin to the command file -s <path>
Example:
[root#localhost ~]# file /proc/uptime
/proc/uptime: empty
[root#localhost ~]# file -s /proc/uptime
/proc/uptime: ASCII text
Use Case:
I have a fileglob of files in /proc/* that I need to very quickly detect if they are truly empty instead of appearing to be empty.
Using The os Package:
Code:
result,_ := os.Stat("/proc/uptime")
fmt.Println("Name:",result.Name()," Size:",result.Size()," Mode:",int(result.Mode()))
fmt.Printf("%q",result)
Result:
Name: uptime Size: 0 Mode: 292
&{"uptime" '\x00' 'Ĥ' {%!q(int64=63606896088) %!q(int32=413685520) %!q(*time.Location=&{ [] [] 0 0 <nil>})} {'\x03' %!q(uint64=4026532071) '\x01' '脤' '\x00' '\x00' '\x00' '\x00' '\x00' 'Ѐ' '\x00' {%!q(int64=1471299288) %!q(int64=413685520)} {%!q(int64=1471299288) %!q(int64=413685520)} {%!q(int64=1471299288) %!q(int64=413685520)} ['\x00' '\x00' '\x00']}}
Obvious Workaround:
There is the obvious workaround of the following. But it's a little over the top to need to call in a bash shell in order to get file stats.
output,_ := exec.Command("bash","-c","file -s","/proc/uptime").Output()
//parse output etc...
EDIT/MY PRACTICAL USE CASE:
Quickly determining which files are zero size without needing to read each one of them first.
file -s /cgroup/memory/lsf/<cluster>/*/tasks | <clean up commands> | uniq -c
6 /cgroup/memory/lsf/<cluster>/<jobid>/tasks: ASCII text
805 /cgroup/memory/lsf/<cluster>/<jobid>/tasks: empty
So in this case, I know that only those 6 jobs are running and the rest (805) have terminated. Reading the file works like this:
# cat /cgroup/memory/lsf/<cluster>/<jobid>/tasks
#
or
# cat /cgroup/memory/lsf/<cluster>/<jobid>/tasks
12352
53455
...
I'm afraid you might be confusing matters here: file is special in precisely a way it "knows" a set of heuristics to carry out its tasks.
To my knowledge, Go does not have anything like this in its standard library, and I've not came across a 3rd-party package implementing a file-like functionality (though I invite you to search by relevant keywords on http://godoc.org)
On the other hand, Go provides full access to the syscall interface of the underlying OS so when it comes to querying the OS in a way file does it, there's nothing you could not do in plain Go.
So I suggest you to just fetch the source code of file, learn what it does in its mode turned on by the "-s" command-line option and implement that in your Go code.
We'll try to have you with specific problems doing that — should you have any.
Update
Looks like I've managed to grasp the OP is struggling with: a simple check:
$ stat -c %s /proc/$$/status && wc -c < $_
0
849
That is, the stat call on a file under /proc shows it has no contents but actually reading from that file returns that contents.
OK, so the solution is simple: instead of doing a call to os.Stat() while traversing the subtree of the filesystem one should instead merely attempt to read a single byte from the file, like in:
var buf [1]byte
f, err := os.Open(fname)
if err != nil {
// do something, or maybe ignore.
// A not existing file is OK to ignore
// (the POSIX error code will be ENOENT)
// because after the `path/filepath.Walk()` fetched an entry for
// this file from its directory, the file might well have gone.
}
_, err = f.Read(buf[:])
if err != nil {
if err == io.EOF {
// OK, we failed to read 1 byte, so the file is empty.
}
// Otherwise, deal with the error
}
f.Close()
You might try to be more clever and first obtain the stat information
(using a call to os.Stat()) to see if the file is a regular file—to
not attempt reading from sockets etc.
I have a fileglob of files in /proc/* that I need to very quickly
detect if they are truly empty instead of appearing to be empty.
They are truly empty in some sense (eg. they occupy no space on file system). If you want to check whether any data can be read from them, try reading from them - that's what file -s does:
-s, --special-files
Normally, file only attempts to read and
determine the type of argument files which stat(2) reports are
ordinary files. This prevents problems, because reading special files
may have peculiar consequences. Specifying the -s option causes file
to also read argument files which are block or character special
files. This is useful for determining the filesystem types of the
data in raw disk partitions, which are block special files. This
option also causes file to disregard the file size as reported by
stat(2) since on some systems it reports a zero size for raw disk
partitions.

Compressing the core files during core generation

Is there way to compress the core files during core dump generation?
If the storage space is limited in the system, is there a way of conserving it in case of need for core dump generation with immediate compression?
Ideally the method would work on older versions of linux such as 2.6.x.
The Linux kernel /proc/sys/kernel/core_pattern file will do what you want: http://www.mjmwired.net/kernel/Documentation/sysctl/kernel.txt#191
Set the filename to something like |/bin/gzip -1 > /var/crash/core-%t-%p-%u.gz and your core files should be saved compressed for you.
For an embedded Linux systems, following script change perfectly works to generate compressed core files in 2 steps
step 1: create a script
touch /bin/gen_compress_core.sh
chmod +x /bin/gen_compress_core.sh
cat > /bin/gen_compress_core.sh #!/bin/sh exec /bin/gzip -f - >"/var/core/core-$1.$2.gz"
ctrl +d
step 2: update the core pattern file
cat > /proc/sys/kernel/core_pattern |/bin/gen_compress_core.sh %e %p ctrl+d
As suggested by other answer, the Linux kernel /proc/sys/kernel/core_pattern file is good place to start: http://www.mjmwired.net/kernel/Documentation/sysctl/kernel.txt#141
As documentation says you can specify the special character "|" which will tell kernel to output the file to script. As suggested you could use |/bin/gzip -1 > /var/crash/core-%t-%p-%u.gz as name, however it doesn't seem to work for me. I expect that the reason is that on my system kernel doesn't treat the > character as a output, rather it probably passes it as a parameter to gzip.
In order to avoid this problem, like other suggested you can create your file in some location I am using /home//crash/core.sh, create it using the following command, replacing with your user. Alternatively you can also obviously change the entire path.
echo -e '#!/bin/bash\nexec /bin/gzip -f - >"/home/<username>/crashes/core-$1-$2-$3-$4-$5.gz"' > ~/crashes/core.sh
Now this script will take 5 input parameters and concatenate them and add to core-path. The full paths must be specified in the ~/crashes/core.sh. Also the location of this script can be specified. Now lets tell kernel to use tour executable with parameters when generating file:
sudo sysctl -w kernel.core_pattern="|/home/<username>/crashes/core.sh %e %p %h %t"
Again should be replaced (or entire path to match location and name of core.sh script). Next step is to crash some program, lets create example crashing cpp file:
int main (){
int * a = nullptr;
int b = *a;
}
After compiling and running there are 2 options, either we will see:
Segmentation fault (core dumped)
Or
Segmentation fault
In case we see the latter, there are few possible reasons.
ulimit is not set, ulimit -c should specify what is limit for cores
apport or your distro core dump collector is not running, this should be investigated further
there is an error in script we wrote, I suggest than checking some basic dump path to check if the other things aren't reason the below should create /tmp/core.dump:
sudo sysctl -w kernel.core_pattern="/tmp/core.dump"
I know there is already an answer for this question however it wasn't obvious for me why it isn't working "out of the box" so I wanted to summarize my findings, hope it helps someone.

Resources