Using grep cmd to filter by first letter, #, and "." - linux

I have a file (testdata.txt) with many email addresses and random text.
Using the grep command:
I want to make sure they are email addresses and not text, so I want to filter them out so that only lines with "#" are included.
I also want to filter them out so that only email addresses that start with the letter A-M or a-m are shown and have a period separating the first name and last name.
Eg. john.doe#gmail.com
However, johndoe#gmail.com would be included.
Lastly, I want to get the count of all the email addresses that follow these rules.
So far I've only been able to make sure they are email addresses by doing
grep -c "#" testdata.txt
.
Using the grep cmd I also want to check how many email addresses have a government domain ("gov").
I wanted to do a check that it has a # sign in the line and that it also contains gov. However, I don't get the answer I want when I do any of the following.
grep -c "#\|gov" testdata.txt I get the amount of lines that have a # not # and gov
grep -c "#/|gov" testdata.txt I get 0
grep -c "#|gov" testdata.txt I get 0

Going bottom-up with your questions.
You are using grep in its Basic regular expressions mode. In this mode \| means OR, | means the symbol |, and /| mean the symbols /|.
If you were looking for emails in the .gov domain, you would probably be looking for a sequence starting with # and followed by symbols that are permitted in an Internet domain name and the symbols .gov, or .GOV, or .Gov.
Borrowing from another post on this site you would end up with something like
grep -c "#[A-Za-z0-9][A-Za-z0-9.-]*\.\(gov\|Gov\|GOV\)"
skipping another 5 possible spellings for the top level domain, e.g. GoV.
However I would use the -i switch that means ignore case to simplify the expression
grep -ci "#[a-z0-9][a-z0-9.-]*\.gov"
Now you were not very clear regarding the use of dots separating parts of the name:
I also want to filter them out so that only email addresses that start with the letter A-M or a-m are shown and have a period separating the first name and last name. Eg. john.doe#gmail.com However, johndoe#gmail.com would be included.
So I will not touch this part.
Finally You could use range expressions to filter the addresses that start with the letters A-M
grep -ci "[a-m][a-z0-9._%+-]*#[a-z0-9][a-z0-9.-]*\.gov"
Please note that this is not an implementation of the Internet Message Format RFC 5322 address specification but only an approximation used mainly for didactic purpose. Never leave not fully compliant implementations in production code.

Related

How to pass multiple variables in grep

I have a json file that is download using curl. It has some information of a confluence page. I want to extract only 3 parts that downloaded information - that is page: id, status and title.
I have written a bash script for this and my constraint is that I am not sure how to pass multiple variables in grep command
id=id #hardcoded
status=status #hardcoded
echo Enter title you are looking for: #taking input from user here read title_name echo echo echo Here are details
curl -u username:password -sX GET "http://X.X.X.X:8090/rest/api/content?type=page&start=0&limit=200" | python -mjson.tool | grep -Eai "$title_name"|$id|$status"
Aside from a typo (you have an unbalanced quote - please always check the syntax for correctness before you are posting something), the basic idea of your approach would work in that
grep -Eai "$title_name|$id|$status"
would select those text lines which contain those lines which contain the content of one of the variables title_name, id or status.
However, it is a pretty fragile solution. I don't know what can be the actual content of those variables, but for instance, if title_name were set to X.Z, it would also match lines containing the string XYZ, since the dot matches any character. Similarily, if title_name would contain, say, a lone [ or (, grep would complained about an unmatched parentheses error.
If you want to match the string literally and not be taken as regular expressions, it is better to write those pattern into a file (one pattern per line) and use
grep -F -f patternfile
for searching. Of course, since you are using bash, you can also use process substitution if you prefer not using an explicit temporary file.

How can I search for two different patterns in two consecutive lines in a file using SED and print next 4 lines after pattern match?

I am using SED and looking for printing the line matched by pattern and next 4 lines after the pattern match.
Below is the summary of my issue.
"myfile.txt" content has:
As specified in doc.
risk involved in astra.
I am not a schizophrenic;and neither am I.;
Be polite to every idiot you meet.;He could be your boss tomorrow.;
I called the hospital;but the line was dead.;
Yes, I’ve lost to my computer at chess.;But it turned out to be no match for me at kickboxing.;
The urologist is about to leave his office and says:; "Ok, let's piss off now.";
What's the best place to hide a body?;Page two of Google.;
You know you’re old;when your friends start having kids on purpose.;
You won’t find anything more poisonous;than a harmonious;and friendly group of females.;
Two state clerks meet in the corridor.;One asks the other,;"Couldn't sleep either?";
Why do women put on make-up and perfume?;Because they are ugly and they smell.;
Bruce Lee’s all-time favorite drink?;Wataaaaaaaah!;
Daddy what is a transvestite?;-Ask Mommy, he knows.;
That moment when you have eye contact while eating a banana.;
I'm using below command.
sed -n -e '/You/h' -e '/Two/{x;G;p}' myfile.txt
Output by my command:
You won’t find anything more poisonous;than a harmonious;and friendly group of females.;
Two state clerks meet in the corridor.;One asks the other,;"Couldn't sleep either?";
Desired output:
You won’t find anything more poisonous;than a harmonious;and friendly group of females.;
Two state clerks meet in the corridor.;One asks the other,;"Couldn't sleep either?";
Why do women put on make-up and perfume?;Because they are ugly and they smell.;
Bruce Lee’s all-time favorite drink?;Wataaaaaaaah!;
Daddy what is a transvestite?;-Ask Mommy, he knows.;
That moment when you have eye contact while eating a banana.;
With GNU sed:
sed -n '/You/h;{/Two/{x;G;};//,+4p}' myfile.txt
Output:
You won’t find anything more poisonous;than a harmonious;and friendly group of females.;
Two state clerks meet in the corridor.;One asks the other,;"Couldn't sleep either?";
Why do women put on make-up and perfume?;Because they are ugly and they smell.;
Bruce Lee’s all-time favorite drink?;Wataaaaaaaah!;
Daddy what is a transvestite?;-Ask Mommy, he knows.;
That moment when you have eye contact while eating a banana.;
Explanation:
/You/h: copy matching line into the hold space. As there is only one hold space, h will store the last line matching You (ie You won’t...)
/Two/{x: when Two is found, x exchange the pattern space with the hold space. At this point:
into pattern space: You won’t find anything more poisonous;than a harmonious;and friendly group of females.;
into hold space: Two state clerks meet in the corridor.;One asks the other,;"Couldn't sleep either?";
G: appends a new line to the pattern space and copies the hold space after the new line
//,+4p is an address range starting from // (empty address repeats the last regular expression match, ie first 2 lines matching), up to next 4 lines +4. The address range is output with p
maybe this help you;
sed -n -e '/You/h' -e '/Two/{N;N;N;N;x;G;p}' myfile.txt
Example;
user#host:/tmp$ sed -n -e '/You/h' -e '/Two/{N;N;N;N;x;G;p}' myfile.txt
You won’t find anything more poisonous;than a harmonious;and friendly group of females.;
Two state clerks meet in the corridor.;One asks the other,;"Couldn't sleep either?";
Why do women put on make-up and perfume?;Because they are ugly and they smell.;
Bruce Lee’s all-time favorite drink?;Wataaaaaaaah!;
Daddy what is a transvestite?;-Ask Mommy, he knows.;
That moment when you have eye contact while eating a banana.;
This might work for you (GNU sed):
sed -r 'N;/You.*\n.*Two/{:a;$!{N;s/\n/&/4;Ta};p;d};D' file
Read two lines into the pattern space, pattern match and then print four further lines (if possible). Otherwise, delete the first line and repeat.

Retrieve substring with grep

I've got a question concerning grep.
I have some address data in an asc file as simple text. The first 30 characters are for the name. If the name is shorter than the 30 characters whitespaces fill it up to ensure its length is 30. At position 31 is a whitespace to separate the name from the next data which is the address. After the address is also a whitespace and some other data. My plan is to retrieve the address, which starts at index 32 and continues to index 50. I mostly got only nothing or the data beginning at the start of the line. I tried several methods such as
grep -iE '^.{30}' '.{8}$' myfile.asc
or
grep –o -P '^.{31,34}' myfile.asc
I can't search for a certain pattern since every set of data is different except the whitespaces which separate the data. Is it possible to retrieve my substring like that without relying on other methods through a pipe? I prefer to use grep since performance is an issue.
Why don't you use cut instead of grep if you're dealing with fixed positions?
cut -c 32-50 myfile.asc

Grep filtering of the dictionary

I'm having a hard time getting a grasp of using grep for a class i am in was hoping someone could help guide me in this assignment. The Assignment is as follows.
Using grep print all 5 letter lower case words from the linux dictionary that have a single letter duplicated one time (aabbe or ababe not valid because both a and b are in the word twice). Next to that print the duplicated letter followed buy the non-duplicated letters in alphabetically ascending order.
The Teacher noted that we will need to use several (6) grep statements (piping the results to the next grep) and a sed statement (String Editor) to reformat the final set of words, then pipe them into a read loop where you tear apart the three non-dup letters and sort them.
Sample Output:
aback a bck
abaft a bft
abase a bes
abash a bhs
abask a bks
abate a bet
I haven't figured out how to do more then printing 5 character words,
grep "^.....$" /usr/share/dict/words |
Didn't check it thoroughly, but this might work
tr '[:upper:]' '[:lower:]' | egrep -x '[a-z]{5}' | sed -r 's/^(.*)(.)(.*)\2(.*)$/\2 \1\3\4/' | grep " " | egrep -v "(.).*\1"
But do your way because someone might see it here.
All in one sed
sed -n '
# filter 5 letter word
/[a-zA-Z]\{5\}/ {
# lower letters
y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxya/
# filter non single double letter
/\(.\).*\1/ !b
/\(.\).*\(.\).*\1.*\1/ b
/\(.\).*\(.\).*\1.*\2/ b
/\(.\).*\(.\).*\2.*\1/ b
# extract peer and single
s/\(.\)*\(.\)\(.*\)\2\(.*\)/a & \2:\1\3\4/
# sort singles
:sort
s/:\([^a]*\)a\(.*\)$/:\1\2a/
y/abcdefghijklmnopqrstuvwxyz/zabcdefghijklmnopqrstuvwxy/
/^a/ !b sort
# clean and print
s/..//
s/:/ /p
}' YourFile
posix sed so --posix on GNU sed
The first bit, obviously, is to use grep to get it down to just the words that have a single duplication in. I will give you some clues on how to do that.
The key is to use backreferences, which allow you to specify that something that matched a previous expression should appear again. So if you write
grep -E "^(.)...\1...\1$"
then you'll get all the words that have the starting letter reappearing in fifth and ninth positions. The point of the brackets is to allow you to refer later to whatever matched the thing in brackets; you do that with a \1 (to match the thing in the first lot of brackets).
You want to say that there should be a duplicate anywhere in the word, which is slightly more complicated, but not much. You want a character in brackets, then any number of characters, then the repeated character (with no ^ or $ specified).
That will also include ones where there are two or more duplicates, so the next stage is to filter them out. You can do that by a grep -v invocation. Once you've got your list of 5-character words that have at least one duplicate, pipe them through a grep -v call that strips out anything with two (or more) duplicates in. That'll have a (.), and another (.), and a \1, and a \2, and these might appear in several different orders.
You'll also need to strip out anything that has a (.) and a \1 and another \1, since that will have a letter with three occurrences.
That should be enough to get you started, at any rate.
Your next step should be to find the 5-letter words containing a duplicate letter. To do that, you will need to use back-references. Example:
grep "[a-z]*\([a-z]\)[a-z]*\$1[a-z]*"
The $1 picks up the contents of the first parenthesized group and expects to match that group again. In this case, it matches a single letter. See: http://www.thegeekstuff.com/2011/01/advanced-regular-expressions-in-grep-command-with-10-examples--part-ii/ for more description of this capability.
You will next need to filter out those cases that have either a letter repeated 3 times or a word with 2 letters repeated. You will need to use the same sort of back-reference trick, but you can use grep -v to filter the results.
sed can be used for the final display. Grep will merely allow you to construct the correct lines to consider.
Note that the dictionary contains capital letters and also non-letter characters, plus that strange characters used in Southern Europe. say "è".
If you want to distinguish "A" and "a", it's automatic, on the other hand if "A" and "a" are the same letter, in ALL grep invocations you must use the -i option, to instruct grep to ignore case.
Next, you always want to pass the -E option, to avoid the so called backslashitis gravis in the regexp that you want to pass to grep.
Further, if you want to exclude the lines matching a regexp from the output, the correct option is -v.
Eventually, if you want to specify many different regexes to a single grep invocation, this is the way (just an example btw)
grep -E -i -v -e 'regexp_1' -e 'regexp_2' ... -e 'regexp_n'
The preliminaries are after us, let's look forward, use the answer from chiastic-security as a reference to understand the procedings
There are only these possibilities to find a duplicate in a 5 character string
(.)\1
(.).\1
(.)..\1
(.)...\1
grep -E -i -e 'regexp_1' ...
Now you have all the doubles, but this doesn't exclude triples etc that are identified by the following patterns (Edit added a cople of additional matching triples patterns)
(.)\1\1
(.).\1\1
(.)\1.\1
(.)..\1\1
(.).\1.\1
(.)\1\1\1
(.).\1\1\1
(.)\1\1\1\1\
you want to exclude these patterns, so grep -E -i -v -e 'regexp_1' ...
at his point, you have a list of words with at least a couple of the same character, and no triples, etc and you want to drop double doubles, these are the regexes that match double doubles
(.)(.)\1\2
(.)(.)\2\1
(.).(.)\1\2
(.).(.)\2\1
(.)(.).\1\2
(.)(.).\2\1
(.)(.)\1.\2
(.)(.)\2.\1
and you want to exclude the lines with these patterns, so its grep -E -i -v ...
A final hint, to play with my answer copy a few hundred lines of the dictionary in your working directory, head -n 3000 /usr/share/dict/words | tail -n 300 > ./300words so that you can really understand what you're doing, avoiding to be overwhelmed by the volume of the output.
And yes, this is not a complete answer, but it is maybe too much, isn't it?

how to find all strings matching a pattern in linux

Can one suggest a grep command for finding a string in a list of files by providing a part of the string?
I have a list of files in a directory which contains email addresses. I want to extract all email addresses ending with a particular domain name. for example, i want to get a list of all emails ending with "#google.com" in a file.
Directory contains N number of files. The data in each file is a single line seperated with a comma. I have tried so many options with grep command and none worked.
Thanks,
You can try something like:
grep -E -o "\b[a-zA-Z0-9.-_]+#google\.com\b" *.files
Basically include the list of characters in the [a-zA-Z0-9.-_] character class which constitutes acceptable email addresses.

Resources