Unzip then process in linux

Unzip then process in linux - linux

So I have a script which unzips a file:
#!/bin/bash -e
# will unzip the data without removing the zipped version
gzip -dc $1 > RawData/unzipped/$(basename $1 .gz)
I then want to execute code on that unzipped file, I have
# will run fast qc on the argument passed
fastqc RawData/unzipped/$(basename $1 .gz) --outdir=fastReports/
but the second script never seems to execute. (Note these are in the same script so I was assuming it would execute the initial script before the second one)
Zipped:
14624_1#10_1.fastq.gz 14624_1#12_2.fastq.gz 14624_1#4_1.fastq.gz 14624_1#7_1.fastq.gz
14624_1#10_2.fastq.gz 14624_1#1_2.fastq.gz 14624_1#4_2.fastq.gz 14624_1#7_2.fastq.gz
14624_1#11_1.fastq.gz 14624_1#2_1.fastq.gz 14624_1#5_1.fastq.gz 14624_1#8_1.fastq.gz
14624_1#11_2.fastq.gz 14624_1#2_2.fastq.gz 14624_1#5_2.fastq.gz 14624_1#8_2.fastq.gz
14624_1#1_1.fastq.gz 14624_1#3_1.fastq.gz 14624_1#6_1.fastq.gz 14624_1#9_1.fastq.gz
14624_1#12_1.fastq.gz 14624_1#3_2.fastq.gz 14624_1#6_2.fastq.gz 14624_1#9_2.fastq.gz
Extracted:
14624_1#10_1.fastq 14624_1#12_1.fastq 14624_1#3_1.fastq 14624_1#5_2.fastq 14624_1#8_1.fastq
14624_1#10_2.fastq 14624_1#12_2.fastq 14624_1#3_2.fastq 14624_1#6_1.fastq 14624_1#8_2.fastq
14624_1#11_1.fastq 14624_1#1_2.fastq 14624_1#4_1.fastq 14624_1#6_2.fastq 14624_1#9_1.fastq
14624_1#11_2.fastq 14624_1#2_1.fastq 14624_1#4_2.fastq 14624_1#7_1.fastq 14624_1#9_2.fastq
14624_1#1_1.fastq 14624_1#2_2.fastq 14624_1#5_1.fastq 14624_1#7_2.fastq

You might just use zcat and process the file on the fly:
fastqc <(zcat path/to/file.gz)
Btw, the <() syntax is a Process Substitution.
If you need both the unzipped file and the process result you may use tee:
fastqc <(zcat path/to/file.gz | tee file)

Related

Find patterns and rename multiple files

I have a list of machine names and hostnames
ex)
# cat /tmp/machine_list.txt
[one]apple machine #1 myserver1
[two]apple machine #2 myserver2
[three]apple machine #3 myserver3
and, server each directory
and each directory contains an tar file and a file with the host name written on it.
# ls /tmp/sos1/*
sosreport1.tar.gz
hostname_map.txt
# cat /tmp/sos1/hostname_map.txt
myserver1
# ls /tmp/sos2/*
sosreport2.tar.gz
hostname_map.txt
# cat /tmp/sos2/hostname_map.txt
myserver2
# ls /tmp/sos3/*
sosreport3.tar.gz
hostname_map.txt
# cat /tmp/sos3/hostname_map.txt
myserver3
Is it possible to rename the sosreport*.tar.gz by referencing the hostname_map in each directory relative to the /tmp/machine_list.txt file? (like below)
# ls /tmp/sos1/*
[one]apple_machine_#1_myserver1_sosreport1.tar.gz
# ls /tmp/sos2/*
[two]apple_machine_#2_myserver2_sosreport2.tar.gz
# ls /tmp/sos3/*
[three]apple_machine_#3_myserver3_sosreport3.tar.gz
A single change is possible, but what about multiple changes?

Something like this?
srvname () {
awk -v srv="$(cat "$1")" -F '\t' '$2==srv { print $1; exit }' machine_list.txt
}
for dir in /tmp/sos*/; do
server=$(srvname "$dir"/hostname_map.txt)
mv "$dir"/sosreport*.tar.gz "$dir/$server.tar.gz"
done
Demo: https://ideone.com/TS5VyQ
The function assumes your mapping file is tab-delimited. If you want underscores instead of spaces in the server names, change the mapping file.
This should be portable to POSIX sh; the cat could be replaced with a Bash redirection, but I feel that it's not worth giving up portability for such a small change.
If this were my project, I'd probably make the function into a self-contained reusable script (with the input file replaced with a here document in the script itself) since there will probably be more situations where you need to perform the same mapping.

Why is the command in /proc/XXX/cmdline truncated but not the arguments

I have a small bash script
#!/bin/bash
echo $(cat /proc/$PPID/cmdline | strings -1)
I call this script from a perl script which is run through nginx.
my $output_string = `/tmp/my_bash_script.sh`;
print $output_string;
When I load this in a browser, the result is something like:
/mnt/my_working_d -d /etc/my_httpd -f /etc/my_httpd/conf/httpd.conf
The location of the perl script is indeed somewhere in /mnt/my_working_directory/.... but why is this truncated and is there anyting I can do to log the whole command. I don't think the cmdline limit of 4k characters (?) which seems hardcoded in the kernel applies here.

stdout all at once instead of line by line

I wrote a script that gets load and mem information for a list of servers by ssh'ing to each server. However, since there are around 20 servers, it's not very efficient to wait for the script to end. That's why I thought it might be interesting to make a crontab that writes the output of the script to a file, so all I need to do is cat this file whenever I need to know load and mem information for the 20 servers. However, when I cat this file during the execution of the crontab it will give me incomplete information. That's because the output of my script is written line by line to the file instead of all at once at termination. I wonder what needs to be done to make this work...
My crontab:
* * * * * (date;~/bin/RUP_ssh) &> ~/bin/RUP.out
My bash script (RUP_ssh):
for comp in `cat ~/bin/servers`; do
ssh $comp ~/bin/ca
done
Thanks,
niefpaarschoenen

You can buffer the output to a temporary file and then output all at once like this:
outputbuffer=`mktemp` # Create a new temporary file, usually in /tmp/
trap "rm '$outputbuffer'" EXIT # Remove the temporary file if we exit early.
for comp in `cat ~/bin/servers`; do
ssh $comp ~/bin/ca >> "$outputbuffer" # gather info to buffer file
done
cat "$outputbuffer" # print buffer to stdout
# rm "$outputbuffer" # delete temporary file, not necessary when using trap

Assuming there is a string to identify which host the mem/load data has come from you can update your txt file as each result comes in. Asuming the data block is one line long you could use
for comp in `cat ~/bin/servers`; do
output=$( ssh $comp ~/bin/ca )
# remove old mem/load data for $comp from RUP.out
sed -i '/'"$comp"'/d' RUP.out # this assumes that the string "$comp" is
# integrated into the output from ca, and
# not elsewhere
echo "$output" >> RUP.out
done
This can be adapted depending on the output of ca. There is lots of help on sed across the net.

Can I write a script in the command line which iterates though all files in a dir?

I would usually write a script for the following command but this time I only want to use it once and therefore would like to write it in the command line.
The script processes all files in a dir.
for FILE in *.tif # grab all the tif files
do
NEWFILE=test/${FILE} # create the new file name
gdal_translate -a_srs EPSG:25832 $FILE $NEWFILE
done
sorry...I forgot to mention that I did try "
for FILE in *.tif do NEWFILE = test_${FILE} gdal_translate -outsize 50% 50% %FILE %NEWFILE done"
..but it freezes with a > on the next line...as though it is waiting for something else.

There is basically no difference between an interactive command and a script. If you want to put your commands on one line, separate them with semicolons instead of line breaks.
for f in *.tif; do gdal_translate -a_srs EPSG:25832 $f test/$f; done
The secondary prompt is displayed by the shell if your command was not yet complete, such as if you are in the middle of a quoted string or a compound command, or if the previous line ended in a backslash.

You need semicolons between your script lines. Try
for FILE in *.tif; do NEWFILE=test/${FILE}; gdal_translate -a_srs EPSG:25832 $FILE $NEWFILE; done

Synchronize shell script execution

A modified version of a shell script converts an audio file from FLAC to MP3 format. The computer has a quad-core CPU. The script is run using:
./flac2mp3.sh $(find flac -type f)
This converts the FLAC files in the flac directory (no spaces in file names) to MP3 files in the mp3 directory (at the same level as flac). If the destination MP3 file already exists, the script skips the file.
The problem is that sometimes two instances of the script check for the existence of the same MP3 file at nearly the same time, resulting in mangled MP3 files.
How would you run the script multiple times (i.e., once per core), without having to specify a different file set on each command-line, and without overwriting work?
Update - Minimal Race Condition
The script uses the following locking mechanism:
# Convert FLAC to MP3 using tags from flac file.
#
if [ ! -e $FLAC.lock ]; then
touch $FLAC.lock
flac -dc "$FLAC" | lame${lame_opts} \
--tt "$TITLE" \
--tn "$TRACKNUMBER" \
--tg "$GENRE" \
--ty "$DATE" \
--ta "$ARTIST" \
--tl "$ALBUM" \
--add-id3v2 \
- "$MP3"
rm $FLAC.lock
fi;
However, this still leaves a race condition.

The "lockfile" command provides what you're trying to do for shell scripts without the race condition. The command was written by the procmail folks specifically for this sort of purpose and is available on most BSD/Linux systems (as procmail is available for most environments).
Your test becomes something like this:
lockfile -r 3 $FLAC.lock
if test $? -eq 0 ; then
flac -dc "$FLAC" | lame${lame_opts} \
--tt "$TITLE" \
--tn "$TRACKNUMBER" \
--tg "$GENRE" \
--ty "$DATE" \
--ta "$ARTIST" \
--tl "$ALBUM" \
--add-id3v2 \
- "$MP3"
fi
rm -f $FLAC.lock
Alternatively, you could make lockfile keep retrying indefinitely so you don't need to test the return code, and instead can test for the output file for determining whether to run flac.

If you don't have lockfile and cannot install it (in any of its versions - there are several implementations) a robust and portable atomic mutex is mkdir.
If the directory you attempt to create already exists, mkdir will fail, so you can check for that; when creation succeeds, you have a guarantee that no other cooperating process is in the critical section at the same time as your code.
if mkdir "$FLAC.lockdir"; then
# you now have the exclusive lock
: critical section
: code goes here
rmdir "$FLAC.lockdir"
else
: nothing? to skip this file
# or maybe sleep 1 and loop back and try again
fi
For completeness, maybe also look for flock if you are on a set of platforms where that is reliably made available and need a performant alternative to lockfile.

You could implement locking of FLAC files that it's working on. Something like:
if (not flac locked)
lock flac
do work
else
continue to next flac

Send output to a temporary file with a unique name, then rename the file to the desired name.
flac -dc "$FLAC" | lame${lame_opts} \
--tt "$TITLE" \
--tn "$TRACKNUMBER" \
--tg "$GENRE" \
--ty "$DATE" \
--ta "$ARTIST" \
--tl "$ALBUM" \
--add-id3v2 \
- "$MP3.$$"
mv "$MP3.$$" "$MP3"
If a race condition leaks through your file locking system every once in a while, the final output will still be the result of one process.

To lock a file process you can create a file with the same name with a .lock extension.
Before starting the encoding check the existence of the .lock file, and optionally make sure the date of the lockfile isn't too old (in case the process dies). If it does not exist, create it before the encoding starts, and remove it after the encoding is complete.
You can also flock the file, but this only really works in c where you are calling flock() and writing to the file then closing and unlocking. For a shell script, you probably are calling another utility to do the writing of the file.

How about writing a Makefile?
ALL_FLAC=$(wildcard *.flac)
ALL_MP3=$(patsubst %.flac, %.mp3, $(ALL_FLAC)
all: $(ALL_MP3)
%.mp3: %.flac
$(FLAC) ...
Then do
$ make -j4 all

In bash it's possible to set noclobber option to avoid file overwriting.
help set | egrep 'noclobber|-C'

Use a tool like FLOM (Free LOck Manager) and simply serialize your command as below:
flom -- flac ....

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Unzip then process in linux - linux

You might just use zcat and process the file on the fly: fastqc <(zcat path/to/file.gz) Btw, the <() syntax is a Process Substitution. If you need both the unzipped file and the process result you may use tee: fastqc <(zcat path/to/file.gz | tee file)

Related

Find patterns and rename multiple files

Why is the command in /proc/XXX/cmdline truncated but not the arguments

stdout all at once instead of line by line

Can I write a script in the command line which iterates though all files in a dir?

Synchronize shell script execution

Categories

Resources