How to remove lines contained in file 1 from file 2 if in file 2 they are prefixed? - linux

I have the following situation:
source.txt
ID1:email1#domain1.com
ID2:email2#domain2.com
ID3:email3#domain3.com
...
IDs are numeric strings, e.g. 1234, 23412, 897... (one or more digits).
exclude.txt
emailX#domainX.com
emailY#domainY.com
emailZ#domainZ.com
...
i.e. only emails, no IDs.
I want to remove all lines from source.txt which contain emails listed in exclude.txt, preserving the ID:email pairs for the lines which are not removed.
How can I do that with linux command line tools (or simple bash script if needed)?

You can do it easily with awk:
awk -F":" 'NR==FNR{a[$1];next}(!($2 in a))' exclude.txt source.txt
Alternative with grep:
grep -v -F -f exclude.txt source.txt
Use grep with care, since grep does a regex matching. You might need to add also -w option to grep (word matching)

Related

How to cut the date out of a string in Shell?

I got a list of Strings like this in a .txt file
asdafdgdhjhgk.de/dsafdfdfgfdggfgg - Abgelaufen seit 26.11.2076 14:08 (seit 12345 Tagen)
Now I want to cut the date out of the strings like: 26.11.2076
All this have to happen in a Shell-Script so I through cut or sed would be a good idea but i didn't found an answer in the internet.
You can use GNU grep with -E with extended regEx support using the -E, --extended-regexp flag.
$ grep -Eo "[[:digit:]]{2}.[[:digit:]]{2}.[[:digit:]]{4}" <<< "asdafdgdhjhgk.de/dsafdfdfgfdggfgg - Abgelaufen seit 26.11.2076 14:08 (seit 12345 Tagen)"
26.11.2076
(or) if you want to run it on a file with multiple such strings, do
$ grep -Eo "[[:digit:]]{2}.[[:digit:]]{2}.[[:digit:]]{4}" input-file
If the structure of the logs/lines are similar from start till the date then following could be used:
awk '{print $5}' input
Or
grep -oP '([3][0-1]|[1-2][0-9]|[0][1-9])\.([0][0-9]|[1][0-2])\.[0-9]{4}' input
Note: this may break for month of feb.
When it comes to text parsing, I almost always prefer Perl.
Multiple comma-separated matches per line:
perl -ne '#_=/((?:\d\d\.){2}\d{4})/g and print join(",", #_), "\n"' file
Multiple matches per line joined into a single column:
perl -ne 'while (/((?:\d\d\.){2}\d{4})/g) {print "$&\n";}' file
The first matches:
perl -ne '/((?:\d\d\.){2}\d{4})/ and print "$1\n"' file
If the dates are followed by time, add (?: \d\d:\d\d) to the regular expressions, e.g.
/((?:\d\d\.){2}\d{4})(?: \d\d:\d\d)/
This will make the matches stricter. Note, (?:) is a non-capturing group.
I also like grep's -P option that enables Perl-compatible regular expressions:
grep -o -P '(?:\d\d\.){2}\d{4}' file
But some implementations may not support it:
This is highly experimental and grep -P may warn of unimplemented features.
(the man page for grep).

Use regex in grep while while using two files

I know that you can use regex in grep and use patterns from a file to search another file. But, can you combine these two options?
For example, from the file where the patterns come from (with the -f option for use patterns from a file), I only want to use the first column to search the second file.
I tried this:
grep -E '^(*)\b' -f file_1 file_2 > file_3
To grep the first column from file_1 with the * wildcard, but it is not working. Any ideas?
Grep doesn't use wildcards for patterns, it uses regular expressions, so (*) makes little sense.
If you want to extract the first column from a file, use cut -f1 or awk '{print $1}' (or sed or perl or whatever to extract it), the redirect to grep using the special - (i.e. standard input) as the source file:
cut -f1 file1 | grep -f- file_2 > file_3

grep: Invalid regular expression

I have a text file which looks like this:
haha1,haha2,haha3,haha4
test1,test2,test3,test4,[offline],test5
letter1,letter2,letter3,letter4
output1,output2,[offline],output3,output4
check1,[core],check2
num1,num2,num3,num4
I need to exclude all those lines that have "[ ]" and output them to another file without all those lines that have "[ ]".
I'm currently using this command:
grep ",[" loaded.txt | wc -l > newloaded.txt
But it's giving me an error:
grep: Invalid regular expression
Use grep -F to treat the search pattern as a fixed string. You could also replace wc -l with grep -c.
grep -cF ",[" loaded.txt > newloaded.txt
If you're curious, [ is a special character. If you don't use -F then you'll need to escape it with a backslash.
grep -c ",\[" loaded.txt > newloaded.txt
By the way, I'm not sure why you're using wc -l anyways...? From your problem description, it sounds like grep -v might be more appropriate. -v inverts grep's normal output, printing lines that don't match.
grep -vF ",[" loaded.txt > newloaded.txt
An alternative method to Grep
It's unclear if you want to remove lines that might contain either bracket [], or only the ones where the brackets specifically surround characters. Regardless of which method you intend to use, sed can easily remove lines that fit a definitive pattern:
To delete only lines that contained both brackets surrounding characters [...]:
sed '/\[.*\]/d' loaded.txt > newloaded.txt
Another approach might be to remove any line that contained either bracket:
sed '/\[/d;/\]/d' loaded.txt > newloaded.txt
(eg. lines containing either [ or ] would be deleted)
Your grep command doesn't seem to be excluding anything. Also, why are you using wc? I thought you want the lines, not their count.
So if you just want the lines, as you say, that don't have [], then this should work:
grep -v "\[" loaded.txt > new.txt
You can also use awk for this:
awk -F\[ 'NF==1' file > newfile
cat newfile
haha1,haha2,haha3,haha4
letter1,letter2,letter3,letter4
num1,num2,num3,num4
Or this:
awk '!/\[/' file

grep a large list against a large file

I am currently trying to grep a large list of ids (~5000) against an even larger csv file (3.000.000 lines).
I want all the csv lines, that contain an id from the id file.
My naive approach was:
cat the_ids.txt | while read line
do
cat huge.csv | grep $line >> output_file
done
But this takes forever!
Are there more efficient approaches to this problem?
Try
grep -f the_ids.txt huge.csv
Additionally, since your patterns seem to be fixed strings, supplying the -F option might speed up grep.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
Use grep -f for this:
grep -f the_ids.txt huge.csv > output_file
From man grep:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero
patterns, and therefore matches nothing. (-f is specified by POSIX.)
If you provide some sample input maybe we can even improve the grep condition a little more.
Test
$ cat ids
11
23
55
$ cat huge.csv
hello this is 11 but
nothing else here
and here 23
bye
$ grep -f ids huge.csv
hello this is 11 but
and here 23
grep -f filter.txt data.txt gets unruly when filter.txt is larger than a couple of thousands of lines and hence isn't the best choice for such a situation. Even while using grep -f, we need to keep a few things in mind:
use -x option if there is a need to match the entire line in the second file
use -F if the first file has strings, not patterns
use -w to prevent partial matches while not using the -x option
This post has a great discussion on this topic (grep -f on large files):
Fastest way to find lines of a file from another larger file in Bash
And this post talks about grep -vf:
grep -vf too slow with large files
In summary, the best way to handle grep -f on large files is:
Matching entire line:
awk 'FNR==NR {hash[$0]; next} $0 in hash' filter.txt data.txt > matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
awk -F, 'FNR==NR {hash[$1]; next} $2 in hash' filter.txt data.txt > matching.txt
and for grep -vf:
Matching entire line:
awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > not_matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
awk -F, 'FNR==NR {hash[$0]; next} !($2 in hash)' filter.txt data.txt > not_matching.txt
You may get a significant search speedup with ugrep to match the strings in the_ids.txt in your large huge.csv file:
ugrep -F -f the_ids.txt huge.csv
This works with GNU grep too, but I expect ugrep to run several times faster.

How to extract distinct part of a string from a file in linux

I'm using the following command to extract distinct urls that contain .com extension and may contain .us or whatever country extension.
grep '\.com' source.txt -m 700 | uniq | sed -e 's/www.//'
> dest.txt
The problem is that, it extracts urls in the same doamin, the thing tht I don't want. Ex:
abc.yahoo.com
efg.yahoo.com
I only need the yahoo.com. How can I using grep or any other command extract distinct domain names only ?
Maybe something like this?
egrep -io '[a-z0-9\-]+\.[a-z]{2,3}(\.[a-z]{2})?' source.txt
Have you tried using awk in instead of sed and specify "." as the delimiter and only print out the two last fields.
awk -F "." '{ print $(NF-1)"."$NF }'
Perhaps something like this should help:
egrep -o '[^.]*.com' file

Resources