Identify which keywords can be found in which files - linux

The Problem
Suppose I have a text file containing a list of words. Each word appears on a separate line. Let's take the following as an example and we'll call it "my_dictionary_file":
my_dictionary_file.txt
Bill
Henry
Martha
Sally
Alex
Paul
In my current directory, I have several files which contain the above names. The problem is that I do not know which files contain which names. This is what I'd like to find out; a sort of matching game. In other words, I want to match each name in my_dictionary_file.txt to the file in which the name appears.
As an example, let's say that the files in my working directory look like the following:
file1.txt
There is a man called Bill. He is tall.
file2.txt
There is a girl called Martha. She is small.
file3.txt
Henry and Sally are a couple.
file4.txt
Alex and Paul are two bachelors.
What I've tried
First. Using the fgrep command with the -o and -f options,
$ fgrep -of my_dictionary_file.txt file1.txt
Bill
I can identify that the name Bill can be found in file1.txt.
Second. Using the fgrep command with the -r -l and -f options,
$ fgrep -rlf names.txt .
./names.txt
./file1.txt
./file4.txt
./file3.txt
./file2.txt
I can search through all of the files in the current directory to find out if the files contain the list of names in my_dictionary_file.txt
The sought-after solution
The solution that I am looking for would be along the lines of combining both of the two attempts above. To be more explicit, I'd like to know that:
Bill belongs to file1.txt
Martha belongs to file2.txt
Henry and Sally belong to file3.txt
Alex and Paul belong to file4.txt
Any suggestions or pointers towards commands other than fgrep would be greatly appreciated!
Note
The actual problem that I am trying to solve is a scaled up version of this simplified example. I'm hoping to base my answer on responses to this question, so bear in mind that in reality the dictionary file contains hundreds of names and that there are a hundred or more files in the current directory.
Typing
$ fgrep -of my_dictionary_file.txt file1.txt
Bill
$ fgrep -of my_dictionary_file.txt file2.txt
Martha
$ fgrep -of my_dictionary_file.txt file3.txt
Henry Sally
$ fgrep -of my_dictionary_file.txt file4.txt
Alex Paul
does, of course, get me the results, but I'm looking for an efficient method to collect the results for me - perhaps, pipe the results to a single .txt file.

If you fgrep all the files at once with the -o option, fgrep should print both the file name and the text that matched:
$ fgrep -of dict.txt file*.txt
file1.txt:Bill
file2.txt:Martha
file3.txt:Henry
file3.txt:Sally
file4.txt:Alex
file4.txt:Paul

Related

How to make a strict match with awk

I am querying one file with the other file and have them as following:
File1:
Angela S Darvill| text text text text
Helen Stanley| text text text text
Carol Haigh S|text text text text .....
File2:
Carol Haigh
Helen Stanley
Angela Darvill
This command:
awk 'NR==FNR{_[$1];next} ($1 in _)' File2.txt File1.txt
returns lines that overlap, BUT doesn’t have a strict match. Having a strict match, only Helen Stanley should have been returned.
How do you restrict awk on a strict overlap?
With your shown samples please try following. You were on right track, you need to do 2 things, 1st: take whole line as an index in array a while reading file2.txt and set field seapeator to | before awk starts reading file1
awk -F'|' 'NR==FNR{a[$0];next} $1 in a' File2.txt File1.txt
Command above doesn’t work for me (I am on Mac, don’t know whether it matters), but
awk 'NR==FNR{_[$0];next} ($1 in _)' File2.txt. FS="|" File1.txt
worked well
You can also use grep to match from File2.txt as a list of regexes to make an exact match.
You can use sed to prepare the matches. Here is an example:
sed -E 's/[ \t]*$//; s/^(.*)$/^\1|/' File2.txt
^Carol Haigh|
^Helen Stanley|
^Angela Darvill|
...
Then use process with that sed as an -f argument to grep:
grep -f <(sed -E 's/[ \t]*$//; s/^(.*)$/^\1|/' File2.txt) File1.txt
Helen Stanley| text text text text
Since your example File2.txt has trailing spaces, the sed has s/[ \t]*$//; as the first substitution. If your actual file does not have those trading spaces, you can do:
grep -f <(sed -E 's/.*/^&|/' File2.txt) File1.txt
Ed Morton brings up a good point that grep will still interpret RE meta-characters in File2.txt. You can use the flag -F so only literal strings are used:
grep -F -f <(sed -E 's/.*/&|/' File2.txt) File1.txt

how to find a file name that contains a string "abcd " in the file content from multiple files in the directory using awk

the "bookrecords" directory has multiple files
bookrecords
1.txt
2.txt
3.txt .....
file 2.txt has the content
2.txt
author: abcd
title: efg
year: 1980
how can I get the file name 2.txt by using author: as a keyword
using awk command
I try to use grep command but I want to use awk command
SearchRecord()
{
name = abcd
total=`cat $bookrecords | grep -cwi $name`
record=`cat $bookrecords | grep -wi $name`
menu
}
with awk
$ awk -v search_string="$name" '$0~search_string{print FILENAME; exit}' bookrecords/*
however, I think grep is better if you're not structurally searching
$ grep -lF "$name" bookrecords/*
If you have a deep directory hierarchy below bookrecords, you could do a
grep -Elsrx '[[:space:]]*author:[[:space:]]*abcd[[:space:]]*' bookrecords
If you only want to look at *.txt files inside bookrecords, do a
grep -Elx '[[:space:]]*author:[[:space:]]*abcd[[:space:]]*' bookrecords/*.txt
The -l produces only the file names in the output, and -x takes care that the regular expression matches the whole line. Hence, a line containing myauthor:abcde would not be selected. Adapt the regexp to your taste, and be careful when replacing abcd by a real author name (it should not contain unescaped regexp characters, such as for instance a period).
If you want to start the line with at least one white space, and have at least one space after the colon:
grep -Elsrx '[[:space:]]+author:[[:space:]]+abcd[[:space:]]*' bookrecords

Append a file in the middle of another file in bash

I need to append a file in a specific location of another file.
I got the line number so, my file is:
file1.txt:
I
am
Cookie
While the second one is
file2.txt:
a
black
dog
named
So, after the solution, file1.txt should be like
I
am
a
black
dog
named
Cookie
The solution should be compatible with the presence of characters like " and / in both files.
Any tool is ok as long as it's native (I mean, no new software installation).
Another option apart from what RavinderSingh13 suggested using sed:
To add the text of file2.txt into file1.txt after a specific line:
sed -i '2 r file2.txt' file1.txt
Output:
I
am
a
black
dog
named
Cookie
Further to add the file after a matched pattern:
sed -i '/^YourPattern/ r file2.txt' file1.txt
Could you please try following and let me know if this helps you.
awk 'FNR==3{system("cat file2.txt")} 1' file1.txt
Output will be as follows.
I
am
a
black
dog
named
Cookie
Explanation: Checking here if line number is 3 while reading Input_file named file1.txt, if yes then using system utility of awk which will help us to call shell's commands, then I am printing the file2.txt with use of cat command. Then mentioning 1 will be printing all the lines from file1.txt. Thus we could concatenate lines from file2.txt into file1.txt.
How about
head -2 file1 && cat file2 && tail -1 file1
You can count the number of lines to decide head and tail parameters in file1 using
wc -l file1

How to make grep to stop searching in each file after N lines?

It's best to describe the use by a hypothetical example:
Searching for some useful header info in a big collection of email storage (each email in a separate file). e.g. doing stats of top mail client apps used.
Normally if you do grep you can specify -m to stop at first match but let's say an email does not contact X-Mailer or whatever it is we are looking for in a header? It will scan through the whole email. Since most headers are <50 lines performance could be increased by telling grep to search only 50 lines on any file. I could not find a way to do that.
I don't know if it would be faster but you could do this with awk:
awk '/match me/{print;exit}FNR>50{exit}' *.mail
will print the first line matching match me if it appears in the first 50 lines. (If you wanted to print the filename as well, grep style, change print; to print FILENAME ":" $0;)
awk doesn't have any equivalent to grep's -r flag, but if you need to recursively scan directories, you can use find with -exec:
find /base/dir -iname '*.mail' \
-exec awk '/match me/{print FILENAME ":" $0;exit}FNR>50{exit}' {} +
You could solve this problem by piping head -n50 through grep but that would undoubtedly be slower since you'd have to start two new processes (one head and one grep) for each file. You could do it with just one head and one grep but then you'd lose the ability to stop matching a file as soon as you find the magic line, and it would be awkward to label the lines with the filename.
you can do something like this
head -50 <mailfile>| grep <your keyword>
Try this command:
for i in *
do
head -n 50 $i | grep -H --label=$i pattern
done
output:
1.txt: aaaaaaaa pattern aaaaaaaa
2.txt: bbbb pattern bbbbb
ls *.txt | xargs head -<N lines>| grep 'your_string'

Comparing two text files with each other

If I had to text files, for example:
file1.txt
apple
orange
pear
banana
file2.txt
banana
pear
How would I take all phrases on the lines of file2.txt away from file1.txt
So file1.txt would be left with:
apple
orange
grep -v -F -f file2.txt file1.txt
-v means listing only the lines of file1.txt that do not match the pattern, and -f means taking the patterns from the file, in this case — file2.txt. And -F — interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
grep command is built-in on OS X and Linux. On Windows you'll have to install it; for example via Cygwin.
combine file1 not file2
On Debian and derivatives, combine can be found in the moreutils package.
If the files are huge (but must also be sorted), comm may be preferable to the more general grep solution proposed by Ivan since it operates line by line and thus, would not need to load the entirety of file2.txt into memory (or search it for each line).
comm -3 file1-sorted.txt file2-sorted.txt | sed 's/^\t//'
The sed command is needed to remove a leading tab inserted by comm.

Resources