Check if all lines from one file are present somewhere in another file - linux

I used file1 as a source of data for file2 and now I need to make sure that every single line of text from file1 occurs somewhere in file2 (and find out which lines are missing, if any). It's probably important to note that while file1 has conveniently one search term per line, the terms can occur anywhere in the file2 including in the middle of a word. Also would help if the matching was case insensitive - doesn't matter if the text in file2 is even in all caps as long as it's there.
The lines in file1 include spaces and all sorts of other special characters like --.

if grep -Fqvf file2 file1; then
echo $"There are lines in file1 that don’t occur in file2."
fi
Grep options mean:
-F, --fixed-strings PATTERN is a set of newline-separated fixed strings
-f, --file=FILE obtain PATTERN from FILE
-v, --invert-match select non-matching lines
-q, --quiet, --silent suppress all normal output

You can try
awk -f a.awk file1 file2
where a.awk is
BEGIN { IGNORECASE=1 }
NR==FNR {
a[$0]++
next
}
{
for (i in a)
if (index($0,i))
delete a[i]
}
END {
for (i in a)
print i
}

The highest voted answer to this post -- grep -Fqvf file2 file1 -- is not quite correct; there are a number of issues with it, all stemming from a single major issue: namely, that the direction of comparison is flipped. We are using every line in file2 to search file1 to make sure that all lines in file1 are covered. This fits with how grep works and is elegant, but it doesn't actually solve the problem. I discovered this while comparing two package lists -- one the output of pacman -Qqe, the other a list I'd made compiling those packages into different groupings to simplify setting up a new computer. I wanted to make sure that I hadn't missed any packages in my groupings.
The first problem is major -- if file2 contains a single empty line, the output will always be false (ie, it will not identify that there are missing lines). This is because the empty line in file2 will match every line of file1. So with the following files, we do not correctly identify that zsh is missing from file2:
file1 file2
acpi acpi
... ...
r r
... ...
yaourt yaourt
zsh
<EOF> <EOF>
$ grep -Fvf file2 file1
[ no output ]
Ok, so we can just strip empty lines, right?
$ grep -Fv "$(grep -ve '^&' file2)" file1
zsh
Great! But now we get to another problem. Let's say we remove yaourt from file2. We'd expect the output to now be
yaourt
zsh
But here's what we actually get
$ grep -Fv "$(grep -ve '^&' file2)" file1
zsh
Why is that? Well, it's the same reason that an empty line causes problems. In this case, the line r in file2 is matching yaourt in file1. Removing empty lines only fixed the most egregious case of this more general problem.
Apart from the false negatives here, there are also false positives from not handling the case OP called out --
It's probably important to note that while file1 has conveniently one search term per line, the terms can occur anywhere in the file2 including in the middle of a word.
So this would mean that if ohmyzsh is in file2, that would be a match for zsh in file1. But that would not happen, since we are searching file1 for ohmyzsh, and obviously, zsh doesn't match, given it is a substring of ohmyzsh. This last example illustrates why searching file1 with the lines of file2 categorically will not work. But if we search file2 with the lines of file1, we will get all the matches in file2, but not know if we have a match for every line of file1. The number of matches doesn't help, since we could have multiple matches for, say, sh (zsh, bash, fish, ...) but no matches for acpi.
This is all a very long way of saying that this isn't a problem that can be solved with O(1) greps. You'd need to use a loop. With a loop, the problem is trivial.
readarray -t terms < file1 # bash
# zsh: terms=("${(#f)$(< file1)}")
for term in "${terms[#]}"; do # I know `do` "should" be on a separate line; bite me
grep -Fq "$term" file2 ||
{ echo "$term does not appear in file2" && break }
done

Related

How can I find the number of 8 letter words that do not contain the letter "e", using the grep command?

I want to find the number of 8 letter words that do not contain the letter "e" in a number of text files (*.txt). In the process I ran into two issues: my lack of understanding in quantifiers and how to exclude characters.
I'm quite new to the Unix terminal, but this is what I have tried:
cat *.txt | grep -Eo "\w+" | grep -i ".*[^e].*"
I need to include the cat command because it otherwise includes the names of the text files in the pipe. The second pipe is to have all the words in a list, and it works, but the last pipe was meant to find all the words that do not have the letter "e" in them, but doesn't seem to work. (I thought "." for no or any number of any character, followed by a character that is not an "e", and followed by another "." for no or any number of any character.)
cat *.txt | grep -Eo "\w+" | grep -wi "[a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]"
This command works to find the words that contain 8 characters, but it is quite ineffective, because I have to repeat "[a-z]" 8 times. I thought it could also be "[a-z]{8}", but that doesn't seem to work.
cat *.txt | grep -Eo "\w+" | grep -wi "[a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]" | grep -i ".*[^e].*"
So finally, this would be my best guess, however, the third pipe is ineffective and the last pipe doesn't work.
You may use this grep:
grep -hEiwo '[a-df-z]{8}' *.txt
Here:
[a-df-z]{8}: Matches all letters except e
-h: Don't print filename in output
-i: Ignore case search
-o: Print matches only
-w: Match complete words
In case you are ok with GNU awk and assuming that you want to print only the exact words and could be multiple matches in a line if this is the case one could try following.
awk -v IGNORECASE="1" '{for(i=1;i<=NF;i++){if($i~/^[a-df-z]{8}$/){print $i}}}' *.txt
OR without the use of IGNORCASE one could try:
awk '{for(i=1;i<=NF;i++){if(tolower($i)~/^[a-df-z]{8}$/){print $i}}}' *.txt
NOTE: Considering that you want exact matches of 8 letters only in lines. 8 letter words followed by a punctuation mark will be excluded.
Here is a crazy thought with GNU awk:
awk 'BEGIN{FPAT="\\<\\w{8}\\>"}{c+=NF}END{print c}' file
Or if you want to make it work only on a select set of characters:
awk 'BEGIN{FPAT="\\<[a-df-z]{8}\\>"}{c+=NF}END{print c}' file
What this does is, it defines the fields, to be a set of 8 characters (\w as a word-constituent or [a-df-z] as a selected set) which is enclosed by word-boundaries (\< and \>). This is done with FPAT (note the Gory details about escaping).
Sometimes you might also have words which contain diatrics, so you have to expand. Then this might be the best solution:
awk 'BEGIN{FPAT="\\<\\w{8}\\>"}{for(i=1;i<=NF;++i) if($i !~ /e/) c++}END{print c}' file

Excluding lines from a .csv based on pattern in another .csv

I want to compare values from 2 .csv files at Linux, excluding lines from the first file when its first column (which is always an IP) matches any of the IPs from the second file.
Any way of doing that via command line is good for me (via grep, for example) would be OK by me.
File1.csv is:
10.177.33.157,IP,Element1
10.177.33.158,IP,Element2
10.175.34.129,IP,Element3
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5
File2.csv:
10.177.33.157 < Exists on the first file
10.177.33.158 < Exists on the first file
10.175.34.129 < Exists on the first file
80.10.2.42 < Does not exist on the first file
80.10.3.194 < Does not exist on the first file
Output file desired:
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5
Simply with awk:
awk -F',' 'NR==FNR{ a[$1]; next }!($1 in a)' file2.csv file1.csv
The output:
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5
Use -f option from grep to compare files. -v to invert match. -F for fixed-strings. man grep goes a long way.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains
zero patterns, and therefore matches nothing. (-f is specified by POSIX.)
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX.)
-F, --fixed-strings, --fixed-regexp
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by POSIX,
--fixed-regexp is an obsoleted alias, please do not use it new scripts.)
Result:
$ grep -vFf f2.csv f1.csv
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5

Read one file to search another file and print out missing lines

I am following the example in this post finding contents of one file into another file in unix shell script but want to print out differently.
Basically file "a.txt", with the following lines:
alpha
0891234
beta
Now, the file "b.txt", with the lines:
Alpha
0808080
0891234
gamma
I would like the output of the command is:
alpha
beta
The first one is "incorrect case" and second one is "missing from b.txt". The 0808080 doesn't matter and it can be there.
This is different from using grep -f "a.txt" "b.txt" and print out 0891234 only.
Is there an elegant way to do this?
Thanks.
Use grep with following options:
grep -Fvf b.txt a.txt
The key is to use -v:
-v, --invert-match
Invert the sense of matching, to select non-matching lines.
When reading patterns from a file I recommend to use the -F option as long as you not explicitly want that patterns are treated as regular expressions.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings (instead of regular expressions), separated by newlines, any of which
is to be matched.

Delete lines from a file matching first 2 fields from a second file in shell script

Suppose I have setA.txt:
a|b|0.1
c|d|0.2
b|a|0.3
and I also have setB.txt:
c|d|200
a|b|100
Now I want to delete from setA.txt lines that have the same first 2 fields with setB.txt, so the output should be:
b|a|0.3
I tried:
comm -23 <(sort setA.txt) <(sort setB.txt)
But the equality is defined for whole line, so it won't work. How can I do this?
$ awk -F\| 'FNR==NR{seen[$1,$2]=1;next;} !seen[$1,$2]' setB.txt setA.txt
b|a|0.3
This reads through setB.txt just once, extracts the needed information from it, and then reads through setA.txt while deciding which lines to print.
How it works
-F\|
This sets the field separator to a vertical bar, |.
FNR==NR{seen[$1,$2]=1;next;}
FNR is the number of lines read so far from the current file and NR is the total number of lines read. Thus, when FNR==NR, we are reading the first file, setB.txt. If so, set the value of associative array seen to true, 1, for the key consisting of fields one and two. Lastly, skip the rest of the commands and start over on the next line.
!seen[$1,$2]
If we get to this command, we are working on the second file, setA.txt. Since ! means negation, the condition is true if seen[$1,$2] is false which means that this combination of fields one and two was not in setB.txt. If so, then the default action is performed which is to print the line.
This should work:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p' setB.txt |sed -f- setA.txt
How this works:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p'
generates an output:
/^c|d/d
/^a|b/d
which is then used as a sed script for the next sed after the pipe and outputs:
b|a|0.3
(IFS=$'|'; cat setA.txt | while read x y z; do grep -q -P "\Q$x|$y|\E" setB.txt || echo "$x|$y|$z"; done; )
explanation: grep -q means only test if grep can find the regexp, but do not output, -P means use Perl syntax, so that the | is matched as is because the \Q..\E struct.
IFS=$'|' will make bash to use | instead of the spaces (SPC, TAB, etc.) as token separator.

How to do something like grep -B to select only one line?

Everything is in the title. Basicaly let's say I have this pattern
some text lalala
another line
much funny wow grep
I grep funny and I want my output to be "lalala"
Thank you
One possible answer is to use either ed or ex to do this (it is trivial in them):
ed - yourfile <<< 'g/funny/.-2p'
(Or replace ed with ex. You might have red, the restricted editor, too; it can't modify files.) This looks for the pattern /funny/ globally, and whenever it is found, prints the line 2 before the matching line (that's the .-2p part). Or, if you want the most recent line containing 'lalala' before the line matching 'funny':
ed - yourfile <<< 'g/funny/?lalala?p'
The only problem is if you're trying to process standard input rather than a file; then you have to save the standard input to a file and process that file, which spoils the concurrency.
You can't do negative offsets in sed (though GNU sed allows you to do positive offsets, so you could use sed -n '/lalala/,+2p' file to get the 'lalala' to 'funny' lines (which isn't quite what you want) based on finding 'lalala', but you cannot find the 'lalala' lines based on finding 'funny'). Standard sed does not allow offsets at all.
If you need to print just the IP address found on a line 8 lines before the pattern-matching line, you need a slightly more involved ed script, but it is still doable:
ed - yourfile <<< 'g/funny/.-8s/.* //p'
This uses the same basic mechanism to find the right line, then runs a substitute command to remove everything up to the last space on the line and print the modified version. Since there isn't a w command, it doesn't actually modify the file.
Since grep -B only prints each full number of lines before the match, you'll have to pipe the output into something like grep or Awk.
grep -B 2 "funny" file|awk 'NR==1{print $NF; exit}'
You could also just use Awk.
awk -v s="funny" '/[[:space:]]lalala$/{n=NR+2; o=$NF}NR==n && $0~s{print o}' file
For the specific example of an IP address 8 lines before the match as mentioned in your comment:
awk -v s="funny" '
/[[:space:]][0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$/ {
n=NR+8
ip=$NF
}
NR==n && $0~s {
print ip
}' file
These Awk solutions first find the output field you might want, then print the output only if the word you want exists in the nth following line.
Here's an attempt at a slightly generalized Awk solution. It maintains a circular queue of the last q lines and prints the line at the head of the queue when it sees a match.
#!/bin/sh
: ${q=8}
e=$1
shift
awk -v q="$q" -v e="$e" '{ m[(NR%q)+1] = $0 }
$0 ~ e { print m[((NR+1)%q)+1] }' "${#--}"
Adapting to a different default (I set it to 8) or proper option handling (currently, you'd run it like q=3 ./qgrep regex file) as well as remembering (and hence printing) the entire line should be easy enough.
(I also didn't bother to make it work correctly if you see a match in the first q-1 lines. It will just print an empty line then.)

Resources