Excluding lines from a .csv based on pattern in another .csv - linux

I want to compare values from 2 .csv files at Linux, excluding lines from the first file when its first column (which is always an IP) matches any of the IPs from the second file.
Any way of doing that via command line is good for me (via grep, for example) would be OK by me.
File1.csv is:
10.177.33.157,IP,Element1
10.177.33.158,IP,Element2
10.175.34.129,IP,Element3
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5
File2.csv:
10.177.33.157 < Exists on the first file
10.177.33.158 < Exists on the first file
10.175.34.129 < Exists on the first file
80.10.2.42 < Does not exist on the first file
80.10.3.194 < Does not exist on the first file
Output file desired:
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5

Simply with awk:
awk -F',' 'NR==FNR{ a[$1]; next }!($1 in a)' file2.csv file1.csv
The output:
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5

Use -f option from grep to compare files. -v to invert match. -F for fixed-strings. man grep goes a long way.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains
zero patterns, and therefore matches nothing. (-f is specified by POSIX.)
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX.)
-F, --fixed-strings, --fixed-regexp
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by POSIX,
--fixed-regexp is an obsoleted alias, please do not use it new scripts.)
Result:
$ grep -vFf f2.csv f1.csv
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5

Related

Test if each line in a file contains one of multiple strings in another file

I have a text file (we'll call it keywords.txt) that contains a number of strings that are separated by newlines (though this isn't set in stone; I can separate them with spaces, commas or whatever is most appropriate). I also have a number of other text files (which I will collectively call input.txt).
What I want to do is iterate through each line in input.txt and test whether that line contains one of the keywords. After that, depending on what input file I'm working on at the time, I would need to either copy matching lines in input.txt into output.txt and ignore non-matching lines or copy non-matching lines and ignore matching.
I searched for a solution but, though I found ways to do parts of what I'm trying to do, I haven't found a way to do everything I'm asking for here. While I could try and combine the various solutions I found, my main concern is that I would end up wondering if what I coded would be the best way of doing this.
This is a snippet of what I currently have in keywords.txt:
google
adword
chromebook.com
cobrasearch.com
feedburner.com
doubleclick
foofle.com
froogle.com
gmail
keyhole.com
madewithcode.com
Here is an example of what can be found in one of my input.txt files:
&expandable_ad_
&forceadv=
&gerf=*&guro=
&gIncludeExternalAds=
&googleadword=
&img2_adv=
&jumpstartadformat=
&largead=
&maxads=
&pltype=adhost^
In this snippet, &googleadword= is the only line that would match the filter and there are scenarios in my case where output.txt will either have only the matching line inserted or every line that doesn't match the keywords.
1. Assuming the content of keywords.txt is separated by newlines:
google
adword
chromebook.com
...
The following will work:
# Use keywords.txt as your pattern & copy matching lines in input.txt to output.txt
grep -Ff keywords.txt input.txt > output.txt
# Use keywords.txt as your pattern & copy non-matching lines in input.txt to output.txt
grep -vFf keywords.txt input.txt > output.txt
2. Assuming the content of keywords.txt is separated by vertical bars:
google|adword|chromebook.com|...
The following will work:
# Use keywords.txt as your pattern & copy matching lines in input.txt to output.txt
grep -Ef keywords.txt input.txt > output.txt
# Use keywords.txt as your pattern & copy non-matching lines in input.txt to output.txt
grep -vEf keywords.txt input.txt > output.txt
3. Assuming the content of keywords.txt is separated by commas:
google,adword,chromebook.com,...
There are many ways of achieving the same, but a simple way would be to use tr to replace all commas with vertical bars and then interpret the pattern with grep's extended regular expression.
# Use keywords.txt as your pattern & copy matching lines in input.txt to output.txt
grep -E $(tr ',' '|' < keywords.txt) input.txt > output.txt
# Use keywords.txt as your pattern & copy non-matching lines in input.txt to output.txt
grep -vE $(tr ',' '|' < keywords.txt) input.txt > output.txt
Grep Options
-v, --invert-match
Selected lines are those not matching any of the specified patterns.
-F, --fixed-strings
Interpret each data-matching pattern as a list of fixed strings,
separated by newlines, instead of as a regular expression.
-E, --extended-regexp
Interpret pattern as an extended regular expression
(i.e. force grep to behave as egrep).
-f file, --file=file
Read one or more newline separated patterns from file.
Empty pattern lines match every input line.
Newlines are not considered part of a pattern.
If file is empty, nothing is matched.
Read more about grep
Read more about tr

Print list of filenames except for the ones specified in a text file

I have a folder test with
file1.txt
file2.wav
file3.txt
file4.py
I have a dont_include.txt file outside the test folder with contents
file1.txt
file4.py
How to print all the filenames in the test folder except for the ones listed in dont_include.txt
Desired output:
file2.wav
file3.txt
You can use grep:
ls test | grep -xvFf dont_include.txt -
-f means the list of patterns is taken from file, dont_include.txt in this case
-F means the patterns are interpreted as literal strings rather than regular expressions, so a.txt won't match abtxt
-x matches only whole lines, i.e. other_file.txt won't be matched by file.txt
-v means we want to print the non-matching lines
- (optional) means the list we're filtering is the standard intput, i.e. the output of ls in this case.
It doesn't work for files with newlines in their names (but storing them in a file one per line is already wrong anyway).

How to Grep the complete sequences containing a specific motif in a fasta file?

How to Grep the complete sequences containing a specific motif in a fasta file or txt file with one linux command and write them into another file? Also, I want to include the lines beginning with a ">" before these target sequences.
Example:I have a fasta file of 10000 sequences.
$cat file.fa
>name1
AEDIA
>name2
ALKME
>name3
AAIII
I want to grep sequences containing KME, so I should get:
>name2
ALKME
Attached is the current way I am using based on the answers I got. Maybe others may find it helpful. Thanks to Pierre Lindenbaum, Philipp Bayer, cpad0112 and batMan.
Preprocessing the fasta file first and get each sequence into one line (which is very important)
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < file.fa > file1.fa
Get rid of the first empty line
tail -n +2 file1.fa > file2.fa
Extract the target sequences containing the substring including their names and save it into another file
LC_ALL=C grep -B 1 KME file2.fa > result.txt
Note: Take KME as the target substring as an example
if you have multiline fasta files. First linearize with awk, and use another awk to filter the sequence containing the motif. using grep would be dangerous a sequence name contains a short motif.
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '{if(index($2,"KME")!=0) printf("%s\n%s\n",$1,$2);}'
grep -B1 KME file > output_file
-B1 : prints 1 line before the match as well

Check if all lines from one file are present somewhere in another file

I used file1 as a source of data for file2 and now I need to make sure that every single line of text from file1 occurs somewhere in file2 (and find out which lines are missing, if any). It's probably important to note that while file1 has conveniently one search term per line, the terms can occur anywhere in the file2 including in the middle of a word. Also would help if the matching was case insensitive - doesn't matter if the text in file2 is even in all caps as long as it's there.
The lines in file1 include spaces and all sorts of other special characters like --.
if grep -Fqvf file2 file1; then
echo $"There are lines in file1 that don’t occur in file2."
fi
Grep options mean:
-F, --fixed-strings PATTERN is a set of newline-separated fixed strings
-f, --file=FILE obtain PATTERN from FILE
-v, --invert-match select non-matching lines
-q, --quiet, --silent suppress all normal output
You can try
awk -f a.awk file1 file2
where a.awk is
BEGIN { IGNORECASE=1 }
NR==FNR {
a[$0]++
next
}
{
for (i in a)
if (index($0,i))
delete a[i]
}
END {
for (i in a)
print i
}
The highest voted answer to this post -- grep -Fqvf file2 file1 -- is not quite correct; there are a number of issues with it, all stemming from a single major issue: namely, that the direction of comparison is flipped. We are using every line in file2 to search file1 to make sure that all lines in file1 are covered. This fits with how grep works and is elegant, but it doesn't actually solve the problem. I discovered this while comparing two package lists -- one the output of pacman -Qqe, the other a list I'd made compiling those packages into different groupings to simplify setting up a new computer. I wanted to make sure that I hadn't missed any packages in my groupings.
The first problem is major -- if file2 contains a single empty line, the output will always be false (ie, it will not identify that there are missing lines). This is because the empty line in file2 will match every line of file1. So with the following files, we do not correctly identify that zsh is missing from file2:
file1 file2
acpi acpi
... ...
r r
... ...
yaourt yaourt
zsh
<EOF> <EOF>
$ grep -Fvf file2 file1
[ no output ]
Ok, so we can just strip empty lines, right?
$ grep -Fv "$(grep -ve '^&' file2)" file1
zsh
Great! But now we get to another problem. Let's say we remove yaourt from file2. We'd expect the output to now be
yaourt
zsh
But here's what we actually get
$ grep -Fv "$(grep -ve '^&' file2)" file1
zsh
Why is that? Well, it's the same reason that an empty line causes problems. In this case, the line r in file2 is matching yaourt in file1. Removing empty lines only fixed the most egregious case of this more general problem.
Apart from the false negatives here, there are also false positives from not handling the case OP called out --
It's probably important to note that while file1 has conveniently one search term per line, the terms can occur anywhere in the file2 including in the middle of a word.
So this would mean that if ohmyzsh is in file2, that would be a match for zsh in file1. But that would not happen, since we are searching file1 for ohmyzsh, and obviously, zsh doesn't match, given it is a substring of ohmyzsh. This last example illustrates why searching file1 with the lines of file2 categorically will not work. But if we search file2 with the lines of file1, we will get all the matches in file2, but not know if we have a match for every line of file1. The number of matches doesn't help, since we could have multiple matches for, say, sh (zsh, bash, fish, ...) but no matches for acpi.
This is all a very long way of saying that this isn't a problem that can be solved with O(1) greps. You'd need to use a loop. With a loop, the problem is trivial.
readarray -t terms < file1 # bash
# zsh: terms=("${(#f)$(< file1)}")
for term in "${terms[#]}"; do # I know `do` "should" be on a separate line; bite me
grep -Fq "$term" file2 ||
{ echo "$term does not appear in file2" && break }
done

How to get content from one file, that contains in another file

How to get rows from one file, that contains in another file
Example, i have
// first
foo
bar
// second
foo;1;3;p1
bar;1;3;p2
foobar;1;3;p2
This files are big, first file contain ~ 500 000 records, and second ~ 20-15 millions
I need to get this result
// attention there is no "p1" or "p2" for example
foo;1;3
bar;1;3
This looks like it wants the join command, possibly with sorting. But with millions of records, it's time to think seriously about a real DBMS.
join -t\; -o 0,2.2,2.3 <(sort -t\; -k 1,1 first) <(sort -t\; -k 1,1 second)
(This requires bash or zsh for the <(command) syntax; portably, you would need to sort into temporary files or keep the input files sorted.)
grep -f:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
cut -d\; -f1-3:
-d, --delimiter=DELIM
use DELIM instead of TAB for field delimiter
-f, --fields=LIST
select only these fields; also print any line that contains no
delimiter character, unless the -s option is specified
Putting it together: grep -f pattern_file data_file | cut -d\; -f1-3 .

Resources