Test if each line in a file contains one of multiple strings in another file - linux

I have a text file (we'll call it keywords.txt) that contains a number of strings that are separated by newlines (though this isn't set in stone; I can separate them with spaces, commas or whatever is most appropriate). I also have a number of other text files (which I will collectively call input.txt).
What I want to do is iterate through each line in input.txt and test whether that line contains one of the keywords. After that, depending on what input file I'm working on at the time, I would need to either copy matching lines in input.txt into output.txt and ignore non-matching lines or copy non-matching lines and ignore matching.
I searched for a solution but, though I found ways to do parts of what I'm trying to do, I haven't found a way to do everything I'm asking for here. While I could try and combine the various solutions I found, my main concern is that I would end up wondering if what I coded would be the best way of doing this.
This is a snippet of what I currently have in keywords.txt:
google
adword
chromebook.com
cobrasearch.com
feedburner.com
doubleclick
foofle.com
froogle.com
gmail
keyhole.com
madewithcode.com
Here is an example of what can be found in one of my input.txt files:
&expandable_ad_
&forceadv=
&gerf=*&guro=
&gIncludeExternalAds=
&googleadword=
&img2_adv=
&jumpstartadformat=
&largead=
&maxads=
&pltype=adhost^
In this snippet, &googleadword= is the only line that would match the filter and there are scenarios in my case where output.txt will either have only the matching line inserted or every line that doesn't match the keywords.

1. Assuming the content of keywords.txt is separated by newlines:
google
adword
chromebook.com
...
The following will work:
# Use keywords.txt as your pattern & copy matching lines in input.txt to output.txt
grep -Ff keywords.txt input.txt > output.txt
# Use keywords.txt as your pattern & copy non-matching lines in input.txt to output.txt
grep -vFf keywords.txt input.txt > output.txt
2. Assuming the content of keywords.txt is separated by vertical bars:
google|adword|chromebook.com|...
The following will work:
# Use keywords.txt as your pattern & copy matching lines in input.txt to output.txt
grep -Ef keywords.txt input.txt > output.txt
# Use keywords.txt as your pattern & copy non-matching lines in input.txt to output.txt
grep -vEf keywords.txt input.txt > output.txt
3. Assuming the content of keywords.txt is separated by commas:
google,adword,chromebook.com,...
There are many ways of achieving the same, but a simple way would be to use tr to replace all commas with vertical bars and then interpret the pattern with grep's extended regular expression.
# Use keywords.txt as your pattern & copy matching lines in input.txt to output.txt
grep -E $(tr ',' '|' < keywords.txt) input.txt > output.txt
# Use keywords.txt as your pattern & copy non-matching lines in input.txt to output.txt
grep -vE $(tr ',' '|' < keywords.txt) input.txt > output.txt
Grep Options
-v, --invert-match
Selected lines are those not matching any of the specified patterns.
-F, --fixed-strings
Interpret each data-matching pattern as a list of fixed strings,
separated by newlines, instead of as a regular expression.
-E, --extended-regexp
Interpret pattern as an extended regular expression
(i.e. force grep to behave as egrep).
-f file, --file=file
Read one or more newline separated patterns from file.
Empty pattern lines match every input line.
Newlines are not considered part of a pattern.
If file is empty, nothing is matched.
Read more about grep
Read more about tr

Related

Excluding lines from a .csv based on pattern in another .csv

I want to compare values from 2 .csv files at Linux, excluding lines from the first file when its first column (which is always an IP) matches any of the IPs from the second file.
Any way of doing that via command line is good for me (via grep, for example) would be OK by me.
File1.csv is:
10.177.33.157,IP,Element1
10.177.33.158,IP,Element2
10.175.34.129,IP,Element3
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5
File2.csv:
10.177.33.157 < Exists on the first file
10.177.33.158 < Exists on the first file
10.175.34.129 < Exists on the first file
80.10.2.42 < Does not exist on the first file
80.10.3.194 < Does not exist on the first file
Output file desired:
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5
Simply with awk:
awk -F',' 'NR==FNR{ a[$1]; next }!($1 in a)' file2.csv file1.csv
The output:
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5
Use -f option from grep to compare files. -v to invert match. -F for fixed-strings. man grep goes a long way.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains
zero patterns, and therefore matches nothing. (-f is specified by POSIX.)
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX.)
-F, --fixed-strings, --fixed-regexp
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by POSIX,
--fixed-regexp is an obsoleted alias, please do not use it new scripts.)
Result:
$ grep -vFf f2.csv f1.csv
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5

How to Grep the complete sequences containing a specific motif in a fasta file?

How to Grep the complete sequences containing a specific motif in a fasta file or txt file with one linux command and write them into another file? Also, I want to include the lines beginning with a ">" before these target sequences.
Example:I have a fasta file of 10000 sequences.
$cat file.fa
>name1
AEDIA
>name2
ALKME
>name3
AAIII
I want to grep sequences containing KME, so I should get:
>name2
ALKME
Attached is the current way I am using based on the answers I got. Maybe others may find it helpful. Thanks to Pierre Lindenbaum, Philipp Bayer, cpad0112 and batMan.
Preprocessing the fasta file first and get each sequence into one line (which is very important)
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < file.fa > file1.fa
Get rid of the first empty line
tail -n +2 file1.fa > file2.fa
Extract the target sequences containing the substring including their names and save it into another file
LC_ALL=C grep -B 1 KME file2.fa > result.txt
Note: Take KME as the target substring as an example
if you have multiline fasta files. First linearize with awk, and use another awk to filter the sequence containing the motif. using grep would be dangerous a sequence name contains a short motif.
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '{if(index($2,"KME")!=0) printf("%s\n%s\n",$1,$2);}'
grep -B1 KME file > output_file
-B1 : prints 1 line before the match as well

Filter out only matched values from a text file in each line

I have a file "test.txt" with the lines below and also lot bunch of extra stuff after the "version"
soainfra_metrics{metric_group="sca_composite",partition="test",is_active="true",state="on",is_default="true",composite="test123"} map:stats version:1.0
soainfra_metrics{metric_group="sca_composite",partition="gello",is_active="true",state="on",is_default="true",composite="test234"} map:stats version:1.8
soainfra_metrics{metric_group="sca_composite",partition="bolo",is_active="true",state="on",is_default="true",composite="3415"} map:stats version:3.1
soainfra_metrics{metric_group="sca_composite",partition="solo",is_active="true",state="on",is_default="true",composite="hji"} map:stats version:1.1
I tried:
egrep -r 'partition|is_active|state|is_default|composite' test.txt
It's displaying every line, but I need only specific mentioned fields like this below,ignoring rest of the data/stuff or lines
in a nut shell, i want to display only these fields from a line not the rest
partition="test",is_active="true",state="on",is_default="true",composite="test123"
partition="gello",is_active="true",state="on",is_default="true",composite="test234"
partition="bolo",is_active="true",state="on",is_default="true",composite="3415"
partition="solo",is_active="true",state="on",is_default="true",composite="hji"
If your version of grep supports Perl-style regular expressions, then I'd use this:
grep -oP '.*?,\K[^}]+' file
It removes everything up to the first comma (\K kills any previous output) and prints everything up to the }.
Alternatively, using awk:
awk -F'}' '{ sub(/[^,]+,/, ""); print $1 }' file
This sets the field separator to } so the part you're interested in is the first field. It then uses sub to remove the part up to the first comma.
For completeness, you could also use sed:
sed 's/[^,]*,\([^}]*\).*/\1/' file
This captures the part after the first , up to the } and replaces the content of the line with it.
After the grep to pick out the lines you want, use sed to edit the lines:
sed 's/.*\(partition[^}]*\)} map.*/\1/'
This means: "whenever you see anything .*, followed by partition and
any number of non-}, then } map and anything else, grab the part
from partition up to but not including the brace \(...\) as group 1.
The replacement text is just group 1 \1.
Use a pipe | to connect the output of egrep to the input of sed:
egrep ... | sed ...
As far as i understood your file might have more lines you don't want to see, so i would use:
sed -n 's/.*\(partition.*\)}.*/\1/p' file
we use -n p to show only lines where we made substitution. The substitution part just gets the part of the line you need substituting the whole line with the pattern.
This might work for you (GNU sed):
sed -r 's/(partition|is_active|state|is_default|composite)="[^"]*"/\n&\n/g;s/[^\n]*\n([^\n]*)\n[^\n]*/\1,/g;s/,$//' file
Treat the problem as if it were a "decomposed club sandwich". Identify the fillings, remove the bread and tidy up.

Quickest way to remove 70+ strings from a file?

I have 70+ strings I need to find and delete in a file. I need to remove the entire line in the file that the string appears in.
I know I can use sed -i '/string to remove/d' fileA.txt to remove them one at a time. However, considering I have 70+, it will take some time doing it this way.
Is there a way I can put these 70+ strings in a file and have sed go through them one by one? Or if I create a file containing the strings, is there a way to compare the two files so it removes any line from fileA that contains one of the strings?
You could use grep:
grep -vf file_with_words.txt file.txt
where file_with_words.txt would be the file containing the list of words, each word being on a different line and file.txt is the file that you want to remove the lines from.
If your list of words contains regex metacharacters, then tell grep to consider those as fixed strings (if that is what you want):
grep -F -vf file_with_words.txt file.txt
Using sed, you'd need to say:
sed '/word1\|word2\|word3/d' file.txt
or
sed -E '/word1|word2|word3/d' file.txt
You could use command substitution to construct the pattern too:
sed -E "/$(paste -sd'|' file_with_words.txt)/d" file.txt
but grep is clearly the tool to use in this case.
If you want to do the job in bash, here's how:
search=fileA.txt
queries=queries.txt
while read query
do
sed -i '' "/$query/d" $search
done < "$queries"
where queries.txt looks like
I
want
to
delete
these
lines

Replace whitespace with a comma in a text file in Linux

I need to edit a few text files (an output from sar) and convert them into CSV files.
I need to change every whitespace (maybe it's a tab between the numbers in the output) using sed or awk functions (an easy shell script in Linux).
Can anyone help me? Every command I used didn't change the file at all; I tried gsub.
tr ' ' ',' <input >output
Substitutes each space with a comma, if you need you can make a pass with the -s flag (squeeze repeats), that replaces each input sequence of a repeated character that is listed in SET1 (the blank space) with a single occurrence of that character.
Use of squeeze repeats used to after substitute tabs:
tr -s '\t' <input | tr '\t' ',' >output
Try something like:
sed 's/[:space:]+/,/g' orig.txt > modified.txt
The character class [:space:] will match all whitespace (spaces, tabs, etc.). If you just want to replace a single character, eg. just space, use that only.
EDIT: Actually [:space:] includes carriage return, so this may not do what you want. The following will replace tabs and spaces.
sed 's/[:blank:]+/,/g' orig.txt > modified.txt
as will
sed 's/[\t ]+/,/g' orig.txt > modified.txt
In all of this, you need to be careful that the items in your file that are separated by whitespace don't contain their own whitespace that you want to keep, eg. two words.
without looking at your input file, only a guess
awk '{$1=$1}1' OFS=","
redirect to another file and rename as needed
What about something like this :
cat texte.txt | sed -e 's/\s/,/g' > texte-new.txt
(Yes, with some useless catting and piping ; could also use < to read from the file directly, I suppose -- used cat first to output the content of the file, and only after, I added sed to my command-line)
EDIT : as #ghostdog74 pointed out in a comment, there's definitly no need for thet cat/pipe ; you can give the name of the file to sed :
sed -e 's/\s/,/g' texte.txt > texte-new.txt
If "texte.txt" is this way :
$ cat texte.txt
this is a text
in which I want to replace
spaces by commas
You'll get a "texte-new.txt" that'll look like this :
$ cat texte-new.txt
this,is,a,text
in,which,I,want,to,replace
spaces,by,commas
I wouldn't go just replacing the old file by the new one (could be done with sed -i, if I remember correctly ; and as #ghostdog74 said, this one would accept creating the backup on the fly) : keeping might be wise, as a security measure (even if it means having to rename it to something like "texte-backup.txt")
This command should work:
sed "s/\s/,/g" < infile.txt > outfile.txt
Note that you have to redirect the output to a new file. The input file is not changed in place.
sed can do this:
sed 's/[\t ]/,/g' input.file
That will send to the console,
sed -i 's/[\t ]/,/g' input.file
will edit the file in-place
Here's a Perl script which will edit the files in-place:
perl -i.bak -lpe 's/\s+/,/g' files*
Consecutive whitespace is converted to a single comma.
Each input file is moved to .bak
These command-line options are used:
-i.bak edit in-place and make .bak copies
-p loop around every line of the input file, automatically print the line
-l removes newlines before processing, and adds them back in afterwards
-e execute the perl code
If you want to replace an arbitrary sequence of blank characters (tab, space) with one comma, use the following:
sed 's/[\t ]+/,/g' input_file > output_file
or
sed -r 's/[[:blank:]]+/,/g' input_file > output_file
If some of your input lines include leading space characters which are redundant and don't need to be converted to commas, then first you need to get rid of them, and then convert the remaining blank characters to commas. For such case, use the following:
sed 's/ +//' input_file | sed 's/[\t ]+/,/g' > output_file
This worked for me.
sed -e 's/\s\+/,/g' input.txt >> output.csv

Resources