Looping through a set of files linux based on filepath - linux

I have directory with lots of compressed data with with a couple of file names. I have two file types als.sumstats.lmm.chr and als.sumstats.meta.chr. After chr there is a number 1-22. I want to loop through only the als.sumstats.meta.chr. However, my code is not working. I keep getting gzip: /ALSsummaryGWAS/Summary_Statistics_GWAS_2016/als.sumstats.meta.chr*.txt.gz: no such file or directory, suggesting my files are not being found with my loop. Can someone help. This is what I have right now.
#!/bin/bash
FILES=/ALSsummaryGWAS/Summary_Statistics_GWAS_2016/als.sumstats.meta.chr*.txt.gz
for f in $FILES;
do
echo "$FILES"
echo "extracting columns 2,1,3,9"
gunzip -c $f | awk '{print $2, $1, $3, $14+$15}' >> ALSGWAS.txt
done

In your script snippet, wildcard '*' pattern is stored as a string in the $FILES variable which needs to be evaluated at some point to get the list of matching files.
In order to evaluate it, you can use eval like this:
FILES="ls -1 /ALSsummaryGWAS/Summary_Statistics_GWAS_2016/als.sumstats.meta.chr*.txt.gz"
for f in $(eval $FILES);
do
echo "$FILES"
echo "processing $f"
echo "extracting columns 2,1,3,9"
gunzip -c $f | awk '{print $2, $1, $3, $14+$15}' >> ALSGWAS.txt
done
But eval is not a recommended way to do such operations (eval is dangerous), so you can try it like this:
FILES=$(ls -1 /ALSsummaryGWAS/Summary_Statistics_GWAS_2016/als.sumstats.meta.chr*.txt.gz)
for f in $FILES;
do
echo "$FILES"
echo "processing $f"
echo "extracting columns 2,1,3,9"
gunzip -c $f | awk '{print $2, $1, $3, $14+$15}' >> ALSGWAS.txt
done

Related

Difficulty to create .txt file from loop in bash

I've this data :
cat >data1.txt <<'EOF'
2020-01-27-06-00;/dev/hd1;100;/
2020-01-27-12-00;/dev/hd1;100;/
2020-01-27-18-00;/dev/hd1;100;/
2020-01-27-06-00;/dev/hd2;200;/usr
2020-01-27-12-00;/dev/hd2;200;/usr
2020-01-27-18-00;/dev/hd2;200;/usr
EOF
cat >data2.txt <<'EOF'
2020-02-27-06-00;/dev/hd1;120;/
2020-02-27-12-00;/dev/hd1;120;/
2020-02-27-18-00;/dev/hd1;120;/
2020-02-27-06-00;/dev/hd2;230;/usr
2020-02-27-12-00;/dev/hd2;230;/usr
2020-02-27-18-00;/dev/hd2;230;/usr
EOF
cat >data3.txt <<'EOF'
2020-03-27-06-00;/dev/hd1;130;/
2020-03-27-12-00;/dev/hd1;130;/
2020-03-27-18-00;/dev/hd1;130;/
2020-03-27-06-00;/dev/hd2;240;/usr
2020-03-27-12-00;/dev/hd2;240;/usr
2020-03-27-18-00;/dev/hd2;240;/usr
EOF
I would like to create a .txt file for each filesystem ( so hd1.txt, hd2.txt, hd3.txt and hd4.txt ) and put in each .txt file the sum of the value from each FS from each dataX.txt. I've some difficulties to explain in english what I want, so here an example of the result wanted
Expected content for the output file hd1.txt:
2020-01;/dev/hd1;300;/
2020-02;/dev/hd1;360;/
2020-03;/dev/hd1;390:/
Expected content for the file hd2.txt:
2020-01;/dev/hd2;600;/usr
2020-02;/dev/hd2;690;/usr
2020-03;/dev/hd2;720;/usr
The implementation I've currently tried:
for i in $(cat *.txt | awk -F';' '{print $2}' | cut -d '/' -f3| uniq)
do
cat *.txt | grep -w $i | awk -F';' -v date="$(cat *.txt | awk -F';' '{print $1}' | cut -d'-' -f-2 | uniq )" '{sum+=$3} END {print date";"$2";"sum}' >> $i
done
But it doesn't works...
Can you show me how to do that ?
Because the format seems to be so constant, you can delimit the input with multiple separators and parse it easily in awk:
awk -v FS='[;-/]' '
prev != $9 {
if (length(output)) {
print output >> fileoutput
}
prev = $9
sum = 0
}
{
sum += $9
output = sprintf("%s-%s;/%s/%s;%d;/%s", $1, $2, $7, $8, sum, $11)
fileoutput = $8 ".txt"
}
END {
print output >> fileoutput
}
' *.txt
Tested on repl generates:
+ cat hd1.txt
2020-01;/dev/hd1;300;/
2020-02;/dev/hd1;360;/
2020-03;/dev/hd1;390;/
+ cat hd2.txt
2020-01;/dev/hd2;600;/usr
2020-02;/dev/hd2;690;/usr
2020-03;/dev/hd2;720;/usr
Alternatively, you could -v FS=';' and use split to split first and second column to extract the year and month and the hdX number.
If you seek a bash solution, I suggest you invert the loops - first iterate over files, then over identifiers in second column.
for file in *.txt; do
prev=
output=
while IFS=';' read -r date dev num path; do
hd=$(basename "$dev")
if [[ "$hd" != "${prev:-}" ]]; then
if ((${#output})); then
printf "%s\n" "$output" >> "$fileoutput"
fi
sum=0
prev="$hd"
fi
sum=$((sum + num))
output=$(
printf "%s;%s;%d;%s" \
"$(cut -d'-' -f1-2 <<<"$date")" \
"$dev" "$sum" "$path"
)
fileoutput="${hd}.txt"
done < "$file"
printf "%s\n" "$output" >> "$fileoutput"
done
You could also almost translate awk to bash 1:1 by doing IFS='-;/' in while read loop.

Printing awk output in same line after grep

I have a very crude script getinfo.sh that gets me information from all files with name FILENAME1 and FILENAME2 in all subfolders and the path of the subfolder. The awk result should only pick the nth line from FILENAME2 if the script is called with "getinfo.sh n". I want all the info printed in one line!
The problem is that if i use print instead of printf the info is written to a new line but my script works. If i use printf i can see the last bit of the awk command in the command propt after the script ist done, but it is not paset after the grep command in the same line. All in all the complete line would be pretty long, but that is intentionally. Would you be willing to tell me what i am doing wrong?
#!/bin/bash
IFS=$'\n'
while read -r fname ;
do
pushd $(dirname "${fname}") > /dev/null
printf '%q' "${PWD##*/}"
grep 'Search_term ' FILENAME1 | tail -1
awk '{ if(NR==n) printf "%s",$0 }' n=$1 $2 FILENAME2
popd > /dev/null
done < <(find . -type f -name 'FILENAME1')
I would also be happy to grep the nth line if this is easier?
SOLUTION:
#!/bin/bash
IFS=$'\n'
while read -r fname ;
do
pushd $(dirname "${fname}") > /dev/null
{
printf '%q' "${PWD##*/}"
grep 'Search_term' FILENAME1 | tail -1
} | tr -d '\n'
if [ "$1" -eq "$1" ] 2>/dev/null
then
awk '{ if(NR==n) printf "%s",$0 }' n="$1" FILENAME2
fi
printf "\n"
popd > /dev/null
done < <(find . -type f -name 'FILENAME1')
You made it clearer in the comments.
I want the output of printf '%q' "${PWD##*/}" and grep 'Search_term ' FILENAME1 | tail -1 and awk '{ if(NR==n) printf "%s",$0 }' n=$1 $2 FILENAME2 to be printed in one line
So first, we have three commands, that each print a single line of output. As the commands do not matter, let's wrap them in functions to simplify the answer:
cmd1() { printf '%q\n' "${PWD##*/}"; }
cmd2() { grep .... ; }
cmd3() { awk ....; }
To print them without newlines between them, we can:
Use a command substitution, which removes trailing empty newlines. With some printf:
printf "%s%s%s\n" "$(cmd1)" "$(cmd2)" "$(cmd3)"
or some echo:
echo "$(cmd1) $(cmd2) $(cmd3)"
or append to a variable:
str="$(cmd1)"
str+=" $(cmd2)"
str+=" $(cmd3)"
printf" %s\n" "$str"
and so on.
We can remove newlines from the stream, using tr -d '\n':
{
cmd1
cmd2
cmd3
} | tr -d '\n'
echo # newlines were removed, so add one to the end.
or we can also remove the newlines only from the first n-1 commands, but I think this is less readable:
{
cmd1
cmd2
} | tr -d'\n'
cmd3 # the trailing newline will be added by cmd3
If i do not pass a number the awk command should be omited.
I see that your awk command expands both $1 and $2, and i see only $1 to be passed as the n=$1 environment variable to awk. I don't know what is $2. You can write if-s on the value of $# the number of arguments:
if (($# == 2)); then
awk '{ if(NR==n) printf "%s",$0 }' n="$1" "$2" FILENAME2
fi
and similar for each case you want to handle. Remember about proper quoting.
Your command shows the unused parameter $2, I deleted that one.
You can add a newline at the end of the awk using the END block, but you also want an extra newline when you call your script without a line number. echo will do.
#!/bin/bash
IFS=$'\n'
while read -r fname ;
do
pushd $(dirname "${fname}") > /dev/null
# Add result of grep in same printf statement
printf '%s %s' "${PWD##*/}" "$(grep 'Search_term ' FILENAME1 | tail -1)"
if (( $# -eq 1 )); then
# use $1 as an awk variable, number n
# use $2 as a different file to read from
awk -v n=$1 '{ if(NR==n) printf "%s ",$0 }' FILENAME2
fi
# Add line-ending
echo
popd > /dev/null
done < <(find . -type f -name 'FILENAME1')

Hex compare in bash scripting

I am facing some issue when I am reading the 3rd word(a hex string) of each line in a text file and compare it with a hex number. Can some one please help me on it.
#!/bin/bash
A=$1
cat $A | while read a; do
a1=$(echo \""$a"\" | awk '{ print $3 }')
#echo $a > cut -d " " -f 3
echo $a1
(("$a1" == 0x10F7))
echo $?
done
But when I use below, the comparison happens correctly,
a1= 0xADCAFE
(( "$a1" == 0x10F7 ))
echo $?
Then why it is showing issue when I read like below,
a1=$(echo \""$a"\" | awk '{ print $3 }')
or> a1=$(echo $a | awk '{ print $3 }')
echo $a prints intended hex value, but comparison does not happen.
Regards,
Running Awk inside a while read loop is an antipattern. Just do the loop in Awk; it's good at that.
awk '$3 == 4343' "$1"
If you want to compare against a string whose value is "0x10F7" then it's
awk '$3 == "0x10F7"' "$1"
If you want to match either, case insensitively etc, a regex is a good way to do that.
awk '$3 ~ /^(0x10[Ff]7|4343)$/' "$1"
Notice how the $1 in double quotes is handled by the shell, and gets replaced by a (properly quoted!) copy of the script's first command-line argument before Awk runs, while the Awk script in single quotes has its own namespace, so $3 is an Awk variable which refers to the third field in the current input line.
Either way, avoid the useless use of cat and always always always quote variables which contain file names with double quotes.
That's literal double quotes. You seem to have tried both a dangerous bare $a and a doubly double-quoted "\"$a\"" where the simple "$a" would be what you actually want.
Thank you all for your responses, Now my script is working fine. I was trying to match two files, below script does the purpose
#!/bin/bash
A=$1
B=$2
dos2unix -f "$A"
dos2unix -f "$B"
rm search_match.txt search_data_match.txt search_nomatch.txt search_data_nomatch.txt
while read line;do
search_word=$(echo $line | awk '{ print $1 }')
grep "$search_word" $B >> temp_file.txt
while read var;do
file1_hex=$(echo $line | awk '{ print $2 }')
file2_hex=$(echo $var | awk '{ print $3 }')
(("$file1_hex" == "$file2_hex"))
zero=$(echo $?)
if [ "$zero" -eq 0 ] ; then
echo $line >> search_match.txt
echo $var >> search_data_match.txt
else
echo $line >> search_nomatch.txt
echo $var >> search_data_nomatch.txt
fi
done < "temp_file.txt"
rm temp_file.txt
done < "$A"

splitting folder names with awk in a directory

There are some directories in the working directory with this template
cas-2-32
sat-4-64
...
I want to loop over the directory names and grab the second and third part of folder names. I have wrote this script. The body shows what I want to do. But the awk command seems to be wrong
#!/bin/bash
for file in `ls`; do
if [ -d $file ]; then
arg2=`awk -F "-" '{print $2}' $file`
echo $arg2
arg3=`awk -F "-" '{print $3}' $file`
echo $arg3
fi
done
but it says
awk: cmd. line:1: fatal: cannot open file `cas-2-32' for reading (Invalid argument)
awk expects a filename as input. Since you have said the cas-2-32 etc are directories, awk fails for the same reason.
Feed the directory names to awk using echo:
#!/bin/bash
for file in `ls`; do
if [ -d $file ]; then
arg2=$(echo $file | awk -F "-" '{print $2}')
echo $arg2
arg3=$(echo $file | awk -F "-" '{print $3}')
echo $arg3
fi
done
Simple comand: ls | awk '{ FS="-"; print $2" "$3 }'
If you want the values in each line just add "\n" instead of a space in awk's print.
When executed like this
awk -F "-" '{print $2}' $file
awk treats $file's value as the file to be parsed, instead of parsing $file's value itself.
The minimal fix is to use a here-string which can feed the value of a variable into stdin of a command:
awk -F "-" '{print $2}' <<< $file
By the way, you don't need ls if you merely want a list of files in current directory, use * instead, i.e.
for file in *; do
One way:
#!/bin/bash
for file in *; do
if [ -d $file ]; then
tmp="${file#*-}"
arg2="${tmp%-*}"
arg3="${tmp#*-}"
echo "$arg2"
echo "$arg3"
fi
done
The other:
#!/bin/bash
IFS="-"
for file in *; do
if [ -d $file ]; then
set -- $file
arg2="$2"
arg3="$3"
echo "$arg2"
echo "$arg3"
fi
done

Bash loop to compare files

I'm obviously missing something simply, and know the problem is that it's creating a blank output which is why it can't compare. However if someone could shed some light on this it would be great - I haven't isolated it.
Ultimately, I'm trying to compare the md5sum from a list stored in a txt file, to that stored on the server. If errors, I need it to report that. Here's the output:
root#vps [~/testinggrounds]# cat md5.txt | while read a b; do
> md5sum "$b" | read c d
> if [ "$a" != "$c" ] ; then
> echo "md5 of file $b does not match"
> fi
> done
md5 of file file1 does not match
md5 of file file2 does not match
root#vps [~/testinggrounds]# md5sum file*
2a53da1a6fbfc0bafdd96b0a2ea29515 file1
bcb35cddc47f3df844ff26e9e2167c96 file2
root#vps [~/testinggrounds]# cat md5.txt
2a53da1a6fbfc0bafdd96b0a2ea29515 file1
bcb35cddc47f3df844ff26e9e2167c96 file2
Not directly answering your question, but md5sum(1):
-c, --check
read MD5 sums from the FILEs and check them
Like:
$ ls
1.txt 2.txt md5.txt
$ cat md5.txt
d3b07384d113edec49eaa6238ad5ff00 1.txt
c157a79031e1c40f85931829bc5fc552 2.txt
$ md5sum -c md5.txt
1.txt: OK
2.txt: OK
The problem that you are having is that your inner read is executed in a subshell. In bash, a subshell is created when you pipe a command. Once the subshell exits, the variables $c and $d are gone. You can use process substitution to avoid the subshell:
while read -r -u3 sum filename; do
read -r cursum _ < <(md5sum "$filename")
if [[ $sum != $cursum ]]; then
printf 'md5 of file %s does not match\n' "$filename"
fi
done 3<md5.txt
The redirection 3<md5.txt causes the file to be opened as file descriptor 3. The -u 3 option to read causes it to read from that file descriptor. The inner read still reads from stdin.
I'm not going to argue. I simply try to avoid double read from inside loops.
#! /bin/bash
cat md5.txt | while read sum file
do
prev_sum=$(md5sum $file | awk '{print $1}')
if [ "$sum" != "$prev_sum" ]
then
echo "md5 of file $file does not match"
else
echo "$file is fine"
fi
done

Resources