AWK don't use first value of a column - linux

First of all, thank you for your help. I have a problem with awk and using the while read. I have a file separated in two columns that each column has 8 values. My script consist of selecting the second columnn and download 8 different files and decompress them. The problem is that the my script doesn't download the first value of the column.
This is my script
#!/bin/bash
cat $1 | while read line
do
echo "Downloading fasta files from NCBI..."
awk '{print $2}' | wget -i- 2>> log
gzip -d *.gz
done
This is the file I am using
Salmonella_enterica_subsp_enterica_Typhi https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/717/755/GCF_003717755.1_ASM371775v1/GCF_003717755.1_ASM371775v1_translated_cds.faa.gz
Salmonella_enterica_subsp_enterica_Paratyphi_A https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/818/115/GCF_000818115.1_ASM81811v1/GCF_000818115.1_ASM81811v1_translated_cds.faa.gz
Salmonella_enterica_subsp_enterica_Paratyphi_B https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/705/GCF_000018705.1_ASM1870v1/GCF_000018705.1_ASM1870v1_translated_cds.faa.gz
Salmonella_enterica_subsp_enterica_Infantis https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/011/182/555/GCA_011182555.2_ASM1118255v2/GCA_011182555.2_ASM1118255v2_translated_cds.faa.gz
Salmonella_enterica_subsp_enterica_Typhimurium_LT2 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_translated_cds.faa.gz
Salmonella_enterica_subsp_diarizonae https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/324/755/GCF_003324755.1_ASM332475v1/GCF_003324755.1_ASM332475v1_translated_cds.faa.gz
Salmonella_enterica_subsp_arizonae https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/635/675/GCA_900635675.1_31885_G02/GCA_900635675.1_31885_G02_translated_cds.faa.gz
Salmonella_bongori https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/006/113/225/GCF_006113225.1_ASM611322v2/GCF_006113225.1_ASM611322v2_translated_cds.faa.gz

The problem is not the download. Check the output of
#!/bin/bash
cat "$1" | while read line
do
awk '{print $2}'
done
This also prints only 7 of the 8 urls. When entering the loop, the read reads the first line into the variable line. However, you never use that variable, so the line is lost. Then awk reads the remaining 7 lines from stdin in one go. The loop only runs once.
You probably wanted to write
#!/bin/bash
cat "$1" | while read -r line
do
echo "Downloading fasta files from NCBI..."
echo "$line" | awk '{print $2}' | wget -i- 2>> log
gzip -d *.gz
done
But there is an easier and safer way:
awk '{print $2}' "$1" | wget -i- 2>> log
gzip -d *.gz

Since the command cut is made to select a column, why not simply issue:
#!/bin/bash
for url in $(cut -f2 "$1")
do
wget "$url" >> log
done
gzip -d *.gz

Related

bash script: calculate sum size of files

I'm working on Linux and need to calculate the sum size of some files in a directory.
I've written a bash script named cal.sh as below:
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
echo $line
done<`ls -l | grep opencv | awk '{print $5}'`
However, when I executed this script ./cal.sh, I got an error:
./cal.sh: line 6: `ls -l | grep opencv | awk '{print $5}'`: ambiguous redirect
And if I execute it with sh cal.sh, it seems to work but I will get some weird message at the end of output:
25
31
385758: File name too long
Why does sh cal.sh seem to work? Where does File name too long come from?
Alternatively, you can do:
du -cb *opencv* | awk 'END{print $1}'
option -b will display each file in bytes and -c will print the total size.
Ultimately, as other answers will point out, it's not a good idea to parse the output of ls because it may vary between systems. But it's worth knowing why the script doesn't work.
The ambiguous redirect error is because you need quotes around your ls command i.e.:
while IFS='' read -r line || [[ -n "$line" ]]; do
echo $line
done < "`ls -l | grep opencv | awk '{print $5}'`"
But this still doesn't do what you want. The "<" operator is expecting a filename, which is being defined here as the output of the ls command. But you don't want to read a file, you want to read the output of ls. For that you can use the "<<<" operator, also known as a "here string" i.e.:
while IFS='' read -r line || [[ -n "$line" ]]; do
echo $line
done <<< "`ls -l | grep opencv | awk '{print $5}'`"
This works as expected, but has some drawbacks. When using a "here string" the command must first execute in full, then store the output of said command in a temporary variable. This can be a problem if the command takes long to execute or has a large output.
IMHO the best and most standard method of iterating a commands output line by line is the following:
ls -l | grep opencv | awk '{print $5} '| while read -r line ; do
echo "line: $line"
done
I would recommend against using that pipeline to get the sizes of the files you want - in general parsing ls is something that you should avoid. Instead, you can just use *opencv* to get the files and stat to print the size:
stat -c %s *opencv*
The format specifier %s prints the size of each file in bytes.
You can pipe this to awk to get the sum:
stat -c %s *opencv* | awk '{ sum += $0 } END { if (sum) print sum }'
The if is there to ensure that no input => no output.

Reformatting name / content pairs from grep in a bash script

I'm attempting to create a bash script that will grep a single file for two separate pieces of data, and print them to stdout.
So far this is what I have:
#!/bin/sh
cd /my/filePath/to/directory
APP=`grep -r --include "inputs.conf" "\[" | grep -oP '^[^\/]+'`
INPUT=`grep -r --include "inputs.conf" "\[" | grep -oP '\[[^\]]+'`
for i in $APP
do
{cd /opt/splunk/etc/deployment-apps
INPUT=`grep -r --include "inputs.conf" "\[" | grep -oP '\[[^\]]+'`
echo -n "$i | $INPUT"}
done
echo "";
exit
Which gives me an output printing the entire output of the first command (which is about 200 lines), then a |, then the other results from the second command. I was thinking I could create an array to do this, however I'm still learning bash.
This is an output example from the command without piping to grep:
TA-XA6x-Server/local/inputs.conf:[perfmon://Processor]
There are 200+ of these in a single execution, and I was looking to have the format be printed as something like this
app="TA-XA6x-Server/local/inputs.conf:" | input="[perfmon://Processor]"
There are essentially two pieces of information I'm attempting to stitch together:
the file path to the file
the contents of the file itself (the input)
Here is an example of the file path:
/opt/splunk/etc/deployment-apps/TA-XA6x-Server/local/inputs.conf
and this is an example of the inputs.conf file contents:
[perfmon://TCPv4]
The easy, mostly-working-ish approach is something like this:
#!/bin/bash
while IFS=: read -r name content; do
printf 'app="%s" | input="%s"\n' "$name" "$content"
done < <(grep -r --include "inputs.conf" "\[")
If you need to work reliably with all possible filenames (including names with colons or newlines) and have GNU grep available, consider the --null argument to grep and adjusting the read usage appropriately:
#!/bin/bash
while IFS= read -r -d '' name && IFS= read -r content; do
printf 'app="%s" | input="%s"\n' "$name" "$content"
done < <(grep -r --null --include "inputs.conf" "\[")

Reading words from an input file and grepping the lines containing the words from another file

I have a file containing list of 4000 words (A.txt). Now I want to grep lines from another file (sentence_per_line.txt) containing those 4000 words mentioned in the file A.txt.
The shell script I wrote for the above problem is
#!/bin/bash
file="A.txt"
while IFS= read -r line
do
# display $line or do somthing with $line
printf '%s\n' "$line"
grep $line sentence_per_line.txt >> output.txt
# tried printing the grep command to check its working or not
result=$(grep "$line" sentence_per_line.txt >> output.txt)
echo "$result"
done <"$file"
And A.txt looks like this
applicable
available
White
Black
..
The code is neither working nor does it shows any error.
Grep has this built in:
grep -f A.txt sentence_per_line.txt > output.txt
Remarks to your code:
Looping over a file to execute grep/sed/awk on each line is typically an antipattern, see this Q&A.
If your $line parameter contains more than one word, you have to quote it (doesn't hurt anyway), or grep tries to look for the first word in a file named after the second word:
grep "$line" sentence_per_line.txt >> output.txt
If you write output in a loop, don't redirect within the loop, do it outside:
while read -r line; do
grep "$line" sentence_per_line.txt
done < "$file" > output.txt
but remember, it's usually not a good idea in the first place.
If you'd like to write to a file and at the same time see what you're writing, you can use tee:
grep "$line" sentence_per_line.txt | tee output.txt
writes to output.txt and stdout.
If A.txt contains words which you want to match only if the complete word matches, i.e., pattern should not match longerpattern, you can use grep -wf – the -w matches only complete words.
If the words in A.txt aren't regular expressions, but fixed strings, you can use grep -fF – the -F option looks for fixed strings and is faster. These two can be combined: grep -WfF

bash: grep in loop does not grep

I have (probably a obvious/stupid) problem:
I want to loop over a list of paths, cut them and use the strings to grep in log files.
While every step works fine on its own and 'processed manually' results in hits - grep does not find anything when in the loop?
for FILE in `awk -F "/" '{print $13}' /tmp/files_not_visible.uniq`; do
echo -e "\n\n$FILE\n";
grep "$FILE" /var/log/PATH/FILENAME-2015.12.*;
done
I also tried to do a while loop as reverse exercise, but fails with the same non-result
while read FILE; do
echo $FILE;
echo $FILE | awk -F "/" '{print $13}' | grep -f - /var/log/PATH/FILENAME-2015.12.* ;
done < /tmp/files_not_visible.uniq/tmp/files_not_visible.uniq
So, I guess there is some systematic issue, how I handle the search string with grep?
Found it: the list of files contained invisible characters as the last character of the line! Probably the user, who send me the list of files, created it on some other OS! And I only copied -of course- the visible characters when testing by hand!
Fixed the loop by cutting the last character of a line with
> sed -e 's/.$//'

Extract strings in a text file using grep

I have file.txt with names one per line as shown below:
ABCB8
ABCC12
ABCC3
ABCC4
AHR
ALDH4A1
ALDH5A1
....
I want to grep each of these from an input.txt file.
Manually i do this one at a time as
grep "ABCB8" input.txt > output.txt
Could someone help to automatically grep all the strings in file.txt from input.txt and write it to output.txt.
You can use the -f flag as described in Bash, Linux, Need to remove lines from one file based on matching content from another file
grep -o -f file.txt input.txt > output.txt
Flag
-f FILE, --file=FILE:
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
-o, --only-matching:
Print only the matched (non-empty) parts of a matching line, with
each such part on a separate output line.
for line in `cat text.txt`; do grep $line input.txt >> output.txt; done
Contents of text.txt:
ABCB8
ABCC12
ABCC3
ABCC4
AHR
ALDH4A1
ALDH5A1
Edit:
A safer solution with while read:
cat text.txt | while read line; do grep "$line" input.txt >> output.txt; done
Edit 2:
Sample text.txt:
ABCB8
ABCB8XY
ABCC12
Sample input.txt:
You were hired to do a job; we expect you to do it.
You were hired because ABCB8 you kick ass;
we expect you to kick ass.
ABCB8XY You were hired because you can commit to a rational deadline and meet it;
ABCC12 we'll expect you to do that too.
You're not someone who needs a middle manager tracking your mouse clicks
If You don't care about the order of lines, the quick workaround would be to pipe the solution through a sort | uniq:
cat text.txt | while read line; do grep "$line" input.txt >> output.txt; done; cat output.txt | sort | uniq > output2.txt
The result is then in output.txt.
Edit 3:
cat text.txt | while read line; do grep "\<${line}\>" input.txt >> output.txt; done
Is that fine?

Resources