How to find words from one file in another file? - linux

In one text file, I have 150 words. I have another text file, which has about 100,000 lines.
How can I check for each of the words belonging to the first file whether it is in the second or not?
I thought about using grep, but I could not find out how to use it to read each of the words in the original text.
Is there any way to do this using awk? Or another solution?
I tried with this shell script, but it matches almost every line:
#!/usr/bin/env sh
cat words.txt | while read line; do
if grep -F "$FILENAME" text.txt
then
echo "Se encontró $line"
fi
done
Another way I found is:
fgrep -w -o -f "words.txt" "text.txt"

You can use grep -f:
grep -Ff "first-file" "second-file"
OR else to match full words:
grep -w -Ff "first-file" "second-file"
UPDATE: As per the comments:
awk 'FNR==NR{a[$1]; next} ($1 in a){delete a[$1]; print $1}' file1 file2

Use grep like this:
grep -f firstfile secondfile
SECOND OPTION
Thank you to Ed Morton for pointing out that the words in the file "reserved" are treated as patterns. If that is an issue - it may or may not be - the OP can maybe use something like this which doesn't use patterns:
File "reserved"
cat
dog
fox
and file "text"
The cat jumped over the lazy
fox but didn't land on the
moon at all.
However it did land on the dog!!!
Awk script is like this:
awk 'BEGIN{i=0}FNR==NR{res[i++]=$1;next}{for(j=0;j<i;j++)if(index($0,res[j]))print $0}' reserved text
with output:
The cat jumped over the lazy
fox but didn't land on the
However it did land on the dog!!!
THIRD OPTION
Alternatively, it can be done quite simply, but more slowly in bash:
while read r; do grep $r secondfile; done < firstfile

Related

how to show the third line of multiple files

I have a simple question. I am trying to check the 3rd line of multiple files in a folder, so I used this:
head -n 3 MiseqData/result2012/12* | tail -n 1
but this doesn't work obviously, because it only shows the third line of the last file. But I actually want to have last line of every file in the result2012 folder.
Does anyone know how to do that?
Also sorry just another questions, is it also possible to show which file the particular third line belongs to?
like before the third line is shown, is it also possible to show the filename of each of the third line extracted from?
because if I used head or tail command, the filename is also shown.
thank you
With Awk, the variable FNR is the number of the "record" (line, by default) in the current file, so you can simply compare it to 3 to print the third line of each input file:
awk 'FNR == 3' MiseqData/result2012/12*
A more optimized version for long files would skip to the next file on match, since you know there's only that one line where the condition is true:
awk 'FNR == 3 { print; nextfile }' MiseqData/result2012/12*
However, not all Awks support nextfile (but it is also not exclusive to GNU Awk).
A more portable variant using your head and tail solution would be a loop in the shell:
for f in MiseqData/result2012/12*; do head -n 3 "$f" | tail -n 1; done
Or with sed (without GNU extensions, i.e., the -s argument):
for f in MiseqData/result2012/12*; do sed '3q;d' "$f"; done
edit: As for the additional question of how to print the name of each file, you need to explicitly print it for each file yourself, e.g.,
awk 'FNR == 3 { print FILENAME ": " $0; nextfile }' MiseqData/result2012/12*
for f in MiseqData/result2012/12*; do
echo -n `basename "$f"`': '
head -n 3 "$f" | tail -n 1
done
for f in MiseqData/result2012/12*; do
echo -n "$f: "
sed '3q;d' "$f"
done
With GNU sed:
sed -s -n '3p' MiseqData/result2012/12*
or shorter
sed -s '3!d' MiseqData/result2012/12*
From man sed:
-s: consider files as separate rather than as a single continuous long stream.
You can do this:
awk 'FNR==3' MiseqData/result2012/12*
If you like the file name as well:
awk 'FNR==3 {print FILENAME,$0}' MiseqData/result2012/12*
This might work for you (GNU sed & parallel):
parallel -k sed -n '3p\;3q' {} ::: file1 file2 file3
Parallel applies the sed command to each file and returns the results in order.
N.B. All files will only be read upto the 3rd line.
Also,you may be tempted (as I was) to use:
sed -ns '3p;3q' file1 file2 file3
but this will only return the first file.
Hi bro I am answering this question as we know FNR is used to check no of lines so we can run this command to get 3rd line of every file.
awk 'FNR==3' MiseqData/result2012/12*

Search A and replace B in A|B in shell scripting/SED/AWK

I have a text file with layout as:
tableName1|counterVariable1
tableName2|counterVariable2
I want to replace the counterVariable1 with some other variable say counterVariableNew.
How can I accomplish this?
I have tried various SED/AWK approaches, closest one is mentioned below:
cat $fileName | grep -w $tableName | sed -i 's/$tableName\|counterVariable/$tableName\|counterVariableNew'
But all the 3 commands are not merging properly, please help!
Your script is an example of [ useless use of cat ]. But the key point here is to escape the pipe delimiter which has a special meaning(it stands for OR) when used with awk FS. So below script should do
# cat 42000479
tableName1|counterVariable1
tableName2|counterVariable2
tableName3|counterVariable2
# awk -F\| '$1=="tableName2"{$2="counterVariableNew"}1' 42000479
tableName1|counterVariable1
tableName2 counterVariableNew
tableName3|counterVariable2
An alternate way of doing the same stuff is below
# awk -v FS='|' '$1=="tableName2"{$2="counterVariableNew"}1' 42000479
Stuff inside the single quote will not be expanded.
awk -F'|' -v OFS='|' '/tableName1/ {$2="counterVariableNew"}1' file
tableName1|counterVariableNew
tableName2|counterVariable2
This will search for A (tableName1) and replace B (counterVariable1) to counterVariableNew.
Or by using sed :
sed -r '/tableName1/ s/(^.*\|)(.*)/\1counterVariableNew/g' file
tableName1|counterVariableNew
tableName2|counterVariable2
For word bounded search: Enclose the pattern inside \< and \> .
sed -r '/\<tableName1\>/ s/(^.*\|)(.*)/\1counterVariableNew/g' file
awk -F'|' -v OFS='|' '/\<tableName1\>/ {$2="counterVariableNew"}1' file

grep a large list against a large file

I am currently trying to grep a large list of ids (~5000) against an even larger csv file (3.000.000 lines).
I want all the csv lines, that contain an id from the id file.
My naive approach was:
cat the_ids.txt | while read line
do
cat huge.csv | grep $line >> output_file
done
But this takes forever!
Are there more efficient approaches to this problem?
Try
grep -f the_ids.txt huge.csv
Additionally, since your patterns seem to be fixed strings, supplying the -F option might speed up grep.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
Use grep -f for this:
grep -f the_ids.txt huge.csv > output_file
From man grep:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero
patterns, and therefore matches nothing. (-f is specified by POSIX.)
If you provide some sample input maybe we can even improve the grep condition a little more.
Test
$ cat ids
11
23
55
$ cat huge.csv
hello this is 11 but
nothing else here
and here 23
bye
$ grep -f ids huge.csv
hello this is 11 but
and here 23
grep -f filter.txt data.txt gets unruly when filter.txt is larger than a couple of thousands of lines and hence isn't the best choice for such a situation. Even while using grep -f, we need to keep a few things in mind:
use -x option if there is a need to match the entire line in the second file
use -F if the first file has strings, not patterns
use -w to prevent partial matches while not using the -x option
This post has a great discussion on this topic (grep -f on large files):
Fastest way to find lines of a file from another larger file in Bash
And this post talks about grep -vf:
grep -vf too slow with large files
In summary, the best way to handle grep -f on large files is:
Matching entire line:
awk 'FNR==NR {hash[$0]; next} $0 in hash' filter.txt data.txt > matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
awk -F, 'FNR==NR {hash[$1]; next} $2 in hash' filter.txt data.txt > matching.txt
and for grep -vf:
Matching entire line:
awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > not_matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
awk -F, 'FNR==NR {hash[$0]; next} !($2 in hash)' filter.txt data.txt > not_matching.txt
You may get a significant search speedup with ugrep to match the strings in the_ids.txt in your large huge.csv file:
ugrep -F -f the_ids.txt huge.csv
This works with GNU grep too, but I expect ugrep to run several times faster.

Extract strings in a text file using grep

I have file.txt with names one per line as shown below:
ABCB8
ABCC12
ABCC3
ABCC4
AHR
ALDH4A1
ALDH5A1
....
I want to grep each of these from an input.txt file.
Manually i do this one at a time as
grep "ABCB8" input.txt > output.txt
Could someone help to automatically grep all the strings in file.txt from input.txt and write it to output.txt.
You can use the -f flag as described in Bash, Linux, Need to remove lines from one file based on matching content from another file
grep -o -f file.txt input.txt > output.txt
Flag
-f FILE, --file=FILE:
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
-o, --only-matching:
Print only the matched (non-empty) parts of a matching line, with
each such part on a separate output line.
for line in `cat text.txt`; do grep $line input.txt >> output.txt; done
Contents of text.txt:
ABCB8
ABCC12
ABCC3
ABCC4
AHR
ALDH4A1
ALDH5A1
Edit:
A safer solution with while read:
cat text.txt | while read line; do grep "$line" input.txt >> output.txt; done
Edit 2:
Sample text.txt:
ABCB8
ABCB8XY
ABCC12
Sample input.txt:
You were hired to do a job; we expect you to do it.
You were hired because ABCB8 you kick ass;
we expect you to kick ass.
ABCB8XY You were hired because you can commit to a rational deadline and meet it;
ABCC12 we'll expect you to do that too.
You're not someone who needs a middle manager tracking your mouse clicks
If You don't care about the order of lines, the quick workaround would be to pipe the solution through a sort | uniq:
cat text.txt | while read line; do grep "$line" input.txt >> output.txt; done; cat output.txt | sort | uniq > output2.txt
The result is then in output.txt.
Edit 3:
cat text.txt | while read line; do grep "\<${line}\>" input.txt >> output.txt; done
Is that fine?

Bash Sorting STDIN

I want to write a bash script that sorts the input by rules in different files. The first rule is to write all chars or strings in file1. The second rule is to write all numbers in file2. The third rule is to write all alphanumerical strings in file3. All specials chars must be ignored. Because I am not familiar with bash I don t know how to realize this.
Could someone help me?
Thanks,
Haniball
Thanks for the answers,
I wrote this script,
#!/bin/bash
inp=0 echo "Which filename for strings?"
read strg
touch $strg
echo "Which filename for nums?"
read nums
touch $nums
echo "Which filename for alphanumerics?"
read alphanums
touch $alphanums
while [ "$inp" != "quit" ]
do
echo "Input: "
read inp
echo $inp | grep -o '\<[a-zA-Z]+>' > $strg
echo $inp | grep -o '\<[0-9]>' > $nums
echo $inp | grep -o -E '\<[0-9]{2,}>' > $nums
done
After I ran it, it only writes string in the stringfile.
Greetings, Haniball
Sure can help. See here:
How To Ask Questions The Smart Way
Help Vampires: A Spotter’s Guide
cool site about the bash is here: http://wiki.bash-hackers.org/doku.php
for sorting try man sort
for pattern matching try man grep
other useful tools: man sed man awk man strings man tee
And it is always correct tag your homework as "homework" ;)
You can try something like:
<input_file strings -1 -a | tee chars_and_strings.txt |\
grep "^[A-Za-z0-9][A-Za-z0-9]*$" | tee alphanum.txt |\
grep "^[0-9][0-9]*$" > numonly.txt
The above is only for USA - no international (read unicode) chars, where things coming a little bit more complicated.
grep is sufficient (your question is a bit vague. If I got something wrong, let me know...)
Using the following input file:
this is a string containing words,
single digits as in 1 and 2 as well
as whole numbers 42 1066
all chars or strings
$ grep -o '\<[a-zA-Z]\+\>' sorting_input
this
is
a
string
containing
words
single
digits
as
in
and
as
well
all single digit numbers
$ grep -o '\<[0-9]\>' sorting_input
1
2
all multiple digit numbers
$ grep -o -E '\<[0-9]{2,}\>' sorting_input
42
1066
Redirect the output to a file, i.e. grep ... > file1
Bash really isn't the best language for this kind of task. While possible, ild highly recommend the use of perl, python, or tcl for this.
That said, you can write all of stdin from input to a temporary file with shell redirection. Then, use a command like grep to output matches to another file. It might look something like this.
#!/bin/bash
cat > temp
grep pattern1 > file1
grep pattern2 > file2
grep pattern3 > file3
rm -f temp
Then run it like this:
cat file_to_process | ./script.sh
I'll leave the specifics of the pattern matching to you.

Resources