Extract strings in a text file using grep

Extract strings in a text file using grep - string

I have file.txt with names one per line as shown below:
ABCB8
ABCC12
ABCC3
ABCC4
AHR
ALDH4A1
ALDH5A1
....
I want to grep each of these from an input.txt file.
Manually i do this one at a time as
grep "ABCB8" input.txt > output.txt
Could someone help to automatically grep all the strings in file.txt from input.txt and write it to output.txt.

You can use the -f flag as described in Bash, Linux, Need to remove lines from one file based on matching content from another file
grep -o -f file.txt input.txt > output.txt
Flag
-f FILE, --file=FILE:
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
-o, --only-matching:
Print only the matched (non-empty) parts of a matching line, with
each such part on a separate output line.

for line in `cat text.txt`; do grep $line input.txt >> output.txt; done
Contents of text.txt:
ABCB8
ABCC12
ABCC3
ABCC4
AHR
ALDH4A1
ALDH5A1
Edit:
A safer solution with while read:
cat text.txt | while read line; do grep "$line" input.txt >> output.txt; done
Edit 2:
Sample text.txt:
ABCB8
ABCB8XY
ABCC12
Sample input.txt:
You were hired to do a job; we expect you to do it.
You were hired because ABCB8 you kick ass;
we expect you to kick ass.
ABCB8XY You were hired because you can commit to a rational deadline and meet it;
ABCC12 we'll expect you to do that too.
You're not someone who needs a middle manager tracking your mouse clicks
If You don't care about the order of lines, the quick workaround would be to pipe the solution through a sort | uniq:
cat text.txt | while read line; do grep "$line" input.txt >> output.txt; done; cat output.txt | sort | uniq > output2.txt
The result is then in output.txt.
Edit 3:
cat text.txt | while read line; do grep "\<${line}\>" input.txt >> output.txt; done
Is that fine?

Related

Linux merging files

I'm doing a linux online course but im stuck with a question, you can find the question below.
You will get three files called a.bf, b.bf and c.bf. Merge the contents of these three files and write it to a new file called abc.bf. Respect the order: abc.bf must contain the contents of a.bf first, followed by those of b.bf, followed by those of c.bf.
Example
Suppose the given files have the following contents:
a.bf contains +++.
b.bf contains [][][][].
c.bf contains <><><>.
The file abc.bf should then have
+++[][][][]<><><>
as its content.
I know how to merge the 3 files but when i use cat my output is:
+++
[][][]
<><><>
When i use paste my output is "+++ 'a lot of spaces' [][][][] 'a lot of spaces' <><><>"
My output that i need is +++[][][][]<><><>, i dont want the spaces between the content. Can someone help me?

What you want to do is delete the newline characters.
With tr:
cat {a,b,c}.bf | tr --delete '\n' > abc.bf
With echo & sed:
echo $(cat {a,b,c}.bf) | sed -E 's/ //g' > abc.bf
With xargs & sed:
<{a,b,c}.bf xargs | sed -E 's/ //g' > abc.bf
Note that sed is only used to remove the spaces.
With cat & sed:
cat {a,b,c}.bf | sed -z 's/\n//g'

echo -n "$(cat a.bf)$(cat b.bf)$(cat c.bf)" > abc.bf
echo -n will not output trailing newlines

AWK don't use first value of a column

First of all, thank you for your help. I have a problem with awk and using the while read. I have a file separated in two columns that each column has 8 values. My script consist of selecting the second columnn and download 8 different files and decompress them. The problem is that the my script doesn't download the first value of the column.
This is my script
#!/bin/bash
cat $1 | while read line
do
echo "Downloading fasta files from NCBI..."
awk '{print $2}' | wget -i- 2>> log
gzip -d *.gz
done
This is the file I am using
Salmonella_enterica_subsp_enterica_Typhi https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/717/755/GCF_003717755.1_ASM371775v1/GCF_003717755.1_ASM371775v1_translated_cds.faa.gz
Salmonella_enterica_subsp_enterica_Paratyphi_A https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/818/115/GCF_000818115.1_ASM81811v1/GCF_000818115.1_ASM81811v1_translated_cds.faa.gz
Salmonella_enterica_subsp_enterica_Paratyphi_B https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/705/GCF_000018705.1_ASM1870v1/GCF_000018705.1_ASM1870v1_translated_cds.faa.gz
Salmonella_enterica_subsp_enterica_Infantis https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/011/182/555/GCA_011182555.2_ASM1118255v2/GCA_011182555.2_ASM1118255v2_translated_cds.faa.gz
Salmonella_enterica_subsp_enterica_Typhimurium_LT2 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_translated_cds.faa.gz
Salmonella_enterica_subsp_diarizonae https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/324/755/GCF_003324755.1_ASM332475v1/GCF_003324755.1_ASM332475v1_translated_cds.faa.gz
Salmonella_enterica_subsp_arizonae https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/635/675/GCA_900635675.1_31885_G02/GCA_900635675.1_31885_G02_translated_cds.faa.gz
Salmonella_bongori https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/006/113/225/GCF_006113225.1_ASM611322v2/GCF_006113225.1_ASM611322v2_translated_cds.faa.gz

The problem is not the download. Check the output of
#!/bin/bash
cat "$1" | while read line
do
awk '{print $2}'
done
This also prints only 7 of the 8 urls. When entering the loop, the read reads the first line into the variable line. However, you never use that variable, so the line is lost. Then awk reads the remaining 7 lines from stdin in one go. The loop only runs once.
You probably wanted to write
#!/bin/bash
cat "$1" | while read -r line
do
echo "Downloading fasta files from NCBI..."
echo "$line" | awk '{print $2}' | wget -i- 2>> log
gzip -d *.gz
done
But there is an easier and safer way:
awk '{print $2}' "$1" | wget -i- 2>> log
gzip -d *.gz

Since the command cut is made to select a column, why not simply issue:
#!/bin/bash
for url in $(cut -f2 "$1")
do
wget "$url" >> log
done
gzip -d *.gz

Reading words from an input file and grepping the lines containing the words from another file

I have a file containing list of 4000 words (A.txt). Now I want to grep lines from another file (sentence_per_line.txt) containing those 4000 words mentioned in the file A.txt.
The shell script I wrote for the above problem is
#!/bin/bash
file="A.txt"
while IFS= read -r line
do
# display $line or do somthing with $line
printf '%s\n' "$line"
grep $line sentence_per_line.txt >> output.txt
# tried printing the grep command to check its working or not
result=$(grep "$line" sentence_per_line.txt >> output.txt)
echo "$result"
done <"$file"
And A.txt looks like this
applicable
available
White
Black
..
The code is neither working nor does it shows any error.

Grep has this built in:
grep -f A.txt sentence_per_line.txt > output.txt
Remarks to your code:
Looping over a file to execute grep/sed/awk on each line is typically an antipattern, see this Q&A.
If your $line parameter contains more than one word, you have to quote it (doesn't hurt anyway), or grep tries to look for the first word in a file named after the second word:
grep "$line" sentence_per_line.txt >> output.txt
If you write output in a loop, don't redirect within the loop, do it outside:
while read -r line; do
grep "$line" sentence_per_line.txt
done < "$file" > output.txt
but remember, it's usually not a good idea in the first place.
If you'd like to write to a file and at the same time see what you're writing, you can use tee:
grep "$line" sentence_per_line.txt | tee output.txt
writes to output.txt and stdout.
If A.txt contains words which you want to match only if the complete word matches, i.e., pattern should not match longerpattern, you can use grep -wf – the -w matches only complete words.
If the words in A.txt aren't regular expressions, but fixed strings, you can use grep -fF – the -F option looks for fixed strings and is faster. These two can be combined: grep -WfF

Concatenation of huge number of selective files from a directory in Shell

I have more than 50000 files in a directory such as file1.txt, file2.txt, ....., file50000.txt. I would like to concatenate of some files whose file numbers are listed in the following text file (need.txt).
need.txt
1
4
35
45
71
.
.
.
I tried with the following. Though it is working, but I look for more simpler and short way.
n1=1
n2=$(wc -l < need.txt)
while [ $n1 -le $n2 ]
do
f1=$(awk 'NR=="$n1" {print $1}' need.txt)
cat file$f1.txt >> out.txt
(( n1++ ))
done

This might also work for you:
sed 's/.*/file&.txt/' < need.txt | xargs cat > out.txt

Something like this should work for you:
sed -e 's/.*/file&.txt/' need.txt | xargs cat > out.txt
It uses sed to translate each line into the appropriate file name and then hands the filenames to xargs to hand them to cat.
Using awk it could be done this way:
awk 'NR==FNR{ARGV[ARGC]="file"$1".txt"; ARGC++; next} {print}' need.txt > out.txt
Which adds each file to the ARGV array of files to process and then prints every line it sees.

It is possible do it without any sed or awk command. Directly using bash built-in functions and cat (of course).
for i in $(cat need.txt); do cat file${i}.txt >> out.txt; done
And as you want, it is quite simple.

How to find words from one file in another file?

In one text file, I have 150 words. I have another text file, which has about 100,000 lines.
How can I check for each of the words belonging to the first file whether it is in the second or not?
I thought about using grep, but I could not find out how to use it to read each of the words in the original text.
Is there any way to do this using awk? Or another solution?
I tried with this shell script, but it matches almost every line:
#!/usr/bin/env sh
cat words.txt | while read line; do
if grep -F "$FILENAME" text.txt
then
echo "Se encontró $line"
fi
done
Another way I found is:
fgrep -w -o -f "words.txt" "text.txt"

You can use grep -f:
grep -Ff "first-file" "second-file"
OR else to match full words:
grep -w -Ff "first-file" "second-file"
UPDATE: As per the comments:
awk 'FNR==NR{a[$1]; next} ($1 in a){delete a[$1]; print $1}' file1 file2

Use grep like this:
grep -f firstfile secondfile
SECOND OPTION
Thank you to Ed Morton for pointing out that the words in the file "reserved" are treated as patterns. If that is an issue - it may or may not be - the OP can maybe use something like this which doesn't use patterns:
File "reserved"
cat
dog
fox
and file "text"
The cat jumped over the lazy
fox but didn't land on the
moon at all.
However it did land on the dog!!!
Awk script is like this:
awk 'BEGIN{i=0}FNR==NR{res[i++]=$1;next}{for(j=0;j<i;j++)if(index($0,res[j]))print $0}' reserved text
with output:
The cat jumped over the lazy
fox but didn't land on the
However it did land on the dog!!!
THIRD OPTION
Alternatively, it can be done quite simply, but more slowly in bash:
while read r; do grep $r secondfile; done < firstfile

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extract strings in a text file using grep - string

Related

Linux merging files

AWK don't use first value of a column

Reading words from an input file and grepping the lines containing the words from another file

Concatenation of huge number of selective files from a directory in Shell

How to find words from one file in another file?

Categories

Resources