I have the following situation:
source.txt
ID1:email1#domain1.com
ID2:email2#domain2.com
ID3:email3#domain3.com
...
IDs are numeric strings, e.g. 1234, 23412, 897... (one or more digits).
exclude.txt
emailX#domainX.com
emailY#domainY.com
emailZ#domainZ.com
...
i.e. only emails, no IDs.
I want to remove all lines from source.txt which contain emails listed in exclude.txt, preserving the ID:email pairs for the lines which are not removed.
How can I do that with linux command line tools (or simple bash script if needed)?
You can do it easily with awk:
awk -F":" 'NR==FNR{a[$1];next}(!($2 in a))' exclude.txt source.txt
Alternative with grep:
grep -v -F -f exclude.txt source.txt
Use grep with care, since grep does a regex matching. You might need to add also -w option to grep (word matching)
Related
I got a list of Strings like this in a .txt file
asdafdgdhjhgk.de/dsafdfdfgfdggfgg - Abgelaufen seit 26.11.2076 14:08 (seit 12345 Tagen)
Now I want to cut the date out of the strings like: 26.11.2076
All this have to happen in a Shell-Script so I through cut or sed would be a good idea but i didn't found an answer in the internet.
You can use GNU grep with -E with extended regEx support using the -E, --extended-regexp flag.
$ grep -Eo "[[:digit:]]{2}.[[:digit:]]{2}.[[:digit:]]{4}" <<< "asdafdgdhjhgk.de/dsafdfdfgfdggfgg - Abgelaufen seit 26.11.2076 14:08 (seit 12345 Tagen)"
26.11.2076
(or) if you want to run it on a file with multiple such strings, do
$ grep -Eo "[[:digit:]]{2}.[[:digit:]]{2}.[[:digit:]]{4}" input-file
If the structure of the logs/lines are similar from start till the date then following could be used:
awk '{print $5}' input
Or
grep -oP '([3][0-1]|[1-2][0-9]|[0][1-9])\.([0][0-9]|[1][0-2])\.[0-9]{4}' input
Note: this may break for month of feb.
When it comes to text parsing, I almost always prefer Perl.
Multiple comma-separated matches per line:
perl -ne '#_=/((?:\d\d\.){2}\d{4})/g and print join(",", #_), "\n"' file
Multiple matches per line joined into a single column:
perl -ne 'while (/((?:\d\d\.){2}\d{4})/g) {print "$&\n";}' file
The first matches:
perl -ne '/((?:\d\d\.){2}\d{4})/ and print "$1\n"' file
If the dates are followed by time, add (?: \d\d:\d\d) to the regular expressions, e.g.
/((?:\d\d\.){2}\d{4})(?: \d\d:\d\d)/
This will make the matches stricter. Note, (?:) is a non-capturing group.
I also like grep's -P option that enables Perl-compatible regular expressions:
grep -o -P '(?:\d\d\.){2}\d{4}' file
But some implementations may not support it:
This is highly experimental and grep -P may warn of unimplemented features.
(the man page for grep).
I know that you can use regex in grep and use patterns from a file to search another file. But, can you combine these two options?
For example, from the file where the patterns come from (with the -f option for use patterns from a file), I only want to use the first column to search the second file.
I tried this:
grep -E '^(*)\b' -f file_1 file_2 > file_3
To grep the first column from file_1 with the * wildcard, but it is not working. Any ideas?
Grep doesn't use wildcards for patterns, it uses regular expressions, so (*) makes little sense.
If you want to extract the first column from a file, use cut -f1 or awk '{print $1}' (or sed or perl or whatever to extract it), the redirect to grep using the special - (i.e. standard input) as the source file:
cut -f1 file1 | grep -f- file_2 > file_3
I have a text file which looks like this:
haha1,haha2,haha3,haha4
test1,test2,test3,test4,[offline],test5
letter1,letter2,letter3,letter4
output1,output2,[offline],output3,output4
check1,[core],check2
num1,num2,num3,num4
I need to exclude all those lines that have "[ ]" and output them to another file without all those lines that have "[ ]".
I'm currently using this command:
grep ",[" loaded.txt | wc -l > newloaded.txt
But it's giving me an error:
grep: Invalid regular expression
Use grep -F to treat the search pattern as a fixed string. You could also replace wc -l with grep -c.
grep -cF ",[" loaded.txt > newloaded.txt
If you're curious, [ is a special character. If you don't use -F then you'll need to escape it with a backslash.
grep -c ",\[" loaded.txt > newloaded.txt
By the way, I'm not sure why you're using wc -l anyways...? From your problem description, it sounds like grep -v might be more appropriate. -v inverts grep's normal output, printing lines that don't match.
grep -vF ",[" loaded.txt > newloaded.txt
An alternative method to Grep
It's unclear if you want to remove lines that might contain either bracket [], or only the ones where the brackets specifically surround characters. Regardless of which method you intend to use, sed can easily remove lines that fit a definitive pattern:
To delete only lines that contained both brackets surrounding characters [...]:
sed '/\[.*\]/d' loaded.txt > newloaded.txt
Another approach might be to remove any line that contained either bracket:
sed '/\[/d;/\]/d' loaded.txt > newloaded.txt
(eg. lines containing either [ or ] would be deleted)
Your grep command doesn't seem to be excluding anything. Also, why are you using wc? I thought you want the lines, not their count.
So if you just want the lines, as you say, that don't have [], then this should work:
grep -v "\[" loaded.txt > new.txt
You can also use awk for this:
awk -F\[ 'NF==1' file > newfile
cat newfile
haha1,haha2,haha3,haha4
letter1,letter2,letter3,letter4
num1,num2,num3,num4
Or this:
awk '!/\[/' file
I am currently trying to grep a large list of ids (~5000) against an even larger csv file (3.000.000 lines).
I want all the csv lines, that contain an id from the id file.
My naive approach was:
cat the_ids.txt | while read line
do
cat huge.csv | grep $line >> output_file
done
But this takes forever!
Are there more efficient approaches to this problem?
Try
grep -f the_ids.txt huge.csv
Additionally, since your patterns seem to be fixed strings, supplying the -F option might speed up grep.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
Use grep -f for this:
grep -f the_ids.txt huge.csv > output_file
From man grep:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero
patterns, and therefore matches nothing. (-f is specified by POSIX.)
If you provide some sample input maybe we can even improve the grep condition a little more.
Test
$ cat ids
11
23
55
$ cat huge.csv
hello this is 11 but
nothing else here
and here 23
bye
$ grep -f ids huge.csv
hello this is 11 but
and here 23
grep -f filter.txt data.txt gets unruly when filter.txt is larger than a couple of thousands of lines and hence isn't the best choice for such a situation. Even while using grep -f, we need to keep a few things in mind:
use -x option if there is a need to match the entire line in the second file
use -F if the first file has strings, not patterns
use -w to prevent partial matches while not using the -x option
This post has a great discussion on this topic (grep -f on large files):
Fastest way to find lines of a file from another larger file in Bash
And this post talks about grep -vf:
grep -vf too slow with large files
In summary, the best way to handle grep -f on large files is:
Matching entire line:
awk 'FNR==NR {hash[$0]; next} $0 in hash' filter.txt data.txt > matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
awk -F, 'FNR==NR {hash[$1]; next} $2 in hash' filter.txt data.txt > matching.txt
and for grep -vf:
Matching entire line:
awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > not_matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
awk -F, 'FNR==NR {hash[$0]; next} !($2 in hash)' filter.txt data.txt > not_matching.txt
You may get a significant search speedup with ugrep to match the strings in the_ids.txt in your large huge.csv file:
ugrep -F -f the_ids.txt huge.csv
This works with GNU grep too, but I expect ugrep to run several times faster.
I'm using the following command to extract distinct urls that contain .com extension and may contain .us or whatever country extension.
grep '\.com' source.txt -m 700 | uniq | sed -e 's/www.//'
> dest.txt
The problem is that, it extracts urls in the same doamin, the thing tht I don't want. Ex:
abc.yahoo.com
efg.yahoo.com
I only need the yahoo.com. How can I using grep or any other command extract distinct domain names only ?
Maybe something like this?
egrep -io '[a-z0-9\-]+\.[a-z]{2,3}(\.[a-z]{2})?' source.txt
Have you tried using awk in instead of sed and specify "." as the delimiter and only print out the two last fields.
awk -F "." '{ print $(NF-1)"."$NF }'
Perhaps something like this should help:
egrep -o '[^.]*.com' file