I was trying to delete lines in a text which contain any word in a list. For example:
File 1:
xxx yyy, zzz,
aaa bbb, sss,
ccc fff, zzz,
rrr www, qasd,
File 2:
xxx
zzz
rrr
The target is to delete the lines in file1 which contain any word in file2.
So the output should be:
aaa bbb, sss,
I know how to use sed with single word, like sed '/zzz/d' to delete lines containing zzz. But how it works in multiple words, or words in a file?
You can do this easily with grep:
$ grep -Fwvf file2 file1
aaa bbb, sss,
Options:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore
matches nothing. (-f is specified by POSIX.)
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX.)
-w, --word-regexp
Select only those lines containing matches that form whole words. The test is that the
matching substring must either be at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end of the line or followed by a
non-word constituent character. Word-constituent characters are letters, digits, and the
underscore.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be
matched. (-F is specified by POSIX.)
To store the changes back to file1:
$ grep -Fwvf file2 file1 > tmp && mv tmp file1
try this:
grep -vFwf file2 file1
Related
I've got file.txt to extract lines containing the exact words listed in check.txt file.
# file.txt
CA1C 2637 green
CA1C-S1 2561 green
CA1C-S2 2371 green
# check.txt
CA1C
I tried
grep -wFf check.txt file.txt
but I'm not getting the desired output, i.e. all the three lines were printed.
Instead, I'd like to get only the first line,
CA1C 2637 green
I searched and found this post being relevant, it's easy to do it when doing only one word matching. But how can I improve my code to let grep obtain patterns from check.txt file and print only the whole word matched lines?
A lot of thanks!
The man page for grep says the following about the -w switch:
-w, --word-regexp
Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.
In your case, all three lines start with "CA1C-", which meets the conditions of being at the beginning of the line, and being followed by a non-word constituent character (the hyphen).
I would do this with a loop, reading lines manually from check.txt:
cat check.txt | while read line; do grep "^$line " file.txt; done
CA1C 2637 green
This loop reads the lines from check.txt, and searches for each one at the start of a line in file.txt, with a following space.
There may be a better way to do this, but I couldn't get -f to actually consider whitespace at the end of a line of the input file.
I have 2 commands in my script as follows
awk -F'"(,")?' '
NR==FNR { r[$2] = $3; next }
{ for (n in r) gsub(n, r[n]) } 1' file2.csv file1.csv>xyzabc.csv
and
grep -v -f file3.txt xyzabc.csv>output.csv
so basically these commands compare files to produce a desired output.
My question is when comparing i want the comparison to be done in lower case and without spaces and also the removal of whitespaces and conversion to lowercase should be temporary i.e the original text should be printed in the output file.
for example:
file1: file 2.csv:
I AM A MAN I am a man
I Like DoGs i like DOGS
I like cats I like cats
so when using the commands mentioned above these strings are not equal.
I am trying to use tr 'A-Z' 'a-z' and tr -d [:space:] to do the job however I am struggling with the syntax.
Also after the comparison is done i want to print it exactly the way as mentioned in file2.csv so this conversion into lowercase and removal of whitespace has to be temporary.
Thanks
edit:
I apologize for not being very clear with my samples.
so file1 contains the following data:
file1.csv:
I am a man
I like dogs
I am a doctor
I like cats
I drink coffee
and file2.csv contains the following data:
file2.csv:
I am a man,man
I like dogs,dogs
I drink coffee,I drink tea
I am using my awk command on these two files so what it does is it checks whether the the sentences present in the first column of file2.csv is present in file1.csv and replaces it with the contents in the second column of file2.csv and places the output in a different file.
so when doing the search i want it to be case insensitive and without spaces as file2.csv may contain multiple spaces between words or the case may be different.
Also,after the output is produced i do not want the contents of file1 and file2 to be altered.
As for the grep command command it is a simple find and delete command it looks for the same strings in both files and deletes them. I want this too to be compared without whitespaces as there may be multiple spaces present between words of wither files.
awk '
function asKey(str, tmp) {
tmp = tolower(str)
gsub(/[[:blank:]]+/, " ", tmp)
return tmp
}
NR==FNR {f2[asKey($0)]; next}
asKey($0) in f2
' file2 file1
Given your sample files about, this returns all lines in file1.
I notice mawk does not collapse whitespace with that regex. You might want to replace /[[:blank:]]+/ with /[ \t\r]+/
I have a text file (we'll call it keywords.txt) that contains a number of strings that are separated by newlines (though this isn't set in stone; I can separate them with spaces, commas or whatever is most appropriate). I also have a number of other text files (which I will collectively call input.txt).
What I want to do is iterate through each line in input.txt and test whether that line contains one of the keywords. After that, depending on what input file I'm working on at the time, I would need to either copy matching lines in input.txt into output.txt and ignore non-matching lines or copy non-matching lines and ignore matching.
I searched for a solution but, though I found ways to do parts of what I'm trying to do, I haven't found a way to do everything I'm asking for here. While I could try and combine the various solutions I found, my main concern is that I would end up wondering if what I coded would be the best way of doing this.
This is a snippet of what I currently have in keywords.txt:
google
adword
chromebook.com
cobrasearch.com
feedburner.com
doubleclick
foofle.com
froogle.com
gmail
keyhole.com
madewithcode.com
Here is an example of what can be found in one of my input.txt files:
&expandable_ad_
&forceadv=
&gerf=*&guro=
&gIncludeExternalAds=
&googleadword=
&img2_adv=
&jumpstartadformat=
&largead=
&maxads=
&pltype=adhost^
In this snippet, &googleadword= is the only line that would match the filter and there are scenarios in my case where output.txt will either have only the matching line inserted or every line that doesn't match the keywords.
1. Assuming the content of keywords.txt is separated by newlines:
google
adword
chromebook.com
...
The following will work:
# Use keywords.txt as your pattern & copy matching lines in input.txt to output.txt
grep -Ff keywords.txt input.txt > output.txt
# Use keywords.txt as your pattern & copy non-matching lines in input.txt to output.txt
grep -vFf keywords.txt input.txt > output.txt
2. Assuming the content of keywords.txt is separated by vertical bars:
google|adword|chromebook.com|...
The following will work:
# Use keywords.txt as your pattern & copy matching lines in input.txt to output.txt
grep -Ef keywords.txt input.txt > output.txt
# Use keywords.txt as your pattern & copy non-matching lines in input.txt to output.txt
grep -vEf keywords.txt input.txt > output.txt
3. Assuming the content of keywords.txt is separated by commas:
google,adword,chromebook.com,...
There are many ways of achieving the same, but a simple way would be to use tr to replace all commas with vertical bars and then interpret the pattern with grep's extended regular expression.
# Use keywords.txt as your pattern & copy matching lines in input.txt to output.txt
grep -E $(tr ',' '|' < keywords.txt) input.txt > output.txt
# Use keywords.txt as your pattern & copy non-matching lines in input.txt to output.txt
grep -vE $(tr ',' '|' < keywords.txt) input.txt > output.txt
Grep Options
-v, --invert-match
Selected lines are those not matching any of the specified patterns.
-F, --fixed-strings
Interpret each data-matching pattern as a list of fixed strings,
separated by newlines, instead of as a regular expression.
-E, --extended-regexp
Interpret pattern as an extended regular expression
(i.e. force grep to behave as egrep).
-f file, --file=file
Read one or more newline separated patterns from file.
Empty pattern lines match every input line.
Newlines are not considered part of a pattern.
If file is empty, nothing is matched.
Read more about grep
Read more about tr
I want to compare values from 2 .csv files at Linux, excluding lines from the first file when its first column (which is always an IP) matches any of the IPs from the second file.
Any way of doing that via command line is good for me (via grep, for example) would be OK by me.
File1.csv is:
10.177.33.157,IP,Element1
10.177.33.158,IP,Element2
10.175.34.129,IP,Element3
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5
File2.csv:
10.177.33.157 < Exists on the first file
10.177.33.158 < Exists on the first file
10.175.34.129 < Exists on the first file
80.10.2.42 < Does not exist on the first file
80.10.3.194 < Does not exist on the first file
Output file desired:
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5
Simply with awk:
awk -F',' 'NR==FNR{ a[$1]; next }!($1 in a)' file2.csv file1.csv
The output:
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5
Use -f option from grep to compare files. -v to invert match. -F for fixed-strings. man grep goes a long way.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains
zero patterns, and therefore matches nothing. (-f is specified by POSIX.)
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX.)
-F, --fixed-strings, --fixed-regexp
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by POSIX,
--fixed-regexp is an obsoleted alias, please do not use it new scripts.)
Result:
$ grep -vFf f2.csv f1.csv
10.175.34.130,IP,Element4
10.175.34.131,IP,Element5
Here's the text maybe have some words ,each line has one word.and I accept it as a command line arguments. For example the textile a.txt is like:
about
catb
west
eastren
And what I want to do is to find the words do not in the dictionary, if the words are dictionary words, delete it in the textfile.
I use the following commands:
word=$1
grep "$1$" /usr/share/dict/linux.words -q
for word in $(<a.txt)
do
if [ $word -eq 0 ]
then
sed '/$word/d'
fi
done
Nothing Happened.
grep alone is enough from what I understand
$ grep -xvFf /usr/share/dict/linux.words a.txt
catb
eastren
catb and eastren are words not found in /usr/share/dict/linux.words. The options used are
-x, --line-regexp
Select only those matches that exactly match the whole line. For a regular expression pattern, this is
like parenthesizing the pattern and then surrounding it with ^ and $.
-v, --invert-match
Invert the sense of matching, to select non-matching lines.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings (instead of regular expressions), separated by newlines,
any of which is to be matched.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. If this option is used multiple times or is combined with the
-e (--regexp) option, search for all patterns given. The empty file contains zero patterns, and
therefore matches nothing.
An alternative would be to use a spell checker like hunspell, so you can apply it to any text, not only a pre-formatted file with only one word per line. You can also specify several dictionaries, and it shows only words which are in none of them.
For example, after copy/pasting the content of this page into test.txt
lang1=en_US
lang2=fr_FR
hunspell -d $lang1,$lang2 -l test.txt | sort -u
produces a list of 46 words:
'
Apps
Arqade
catb
'cba'
Cersei
ceving
Drupal
eastren
...
WordPress
Worldbuilding
xvFf
xz
Yodeya