How can I search for two different patterns in two consecutive lines in a file using SED and print next 4 lines after pattern match? - linux

I am using SED and looking for printing the line matched by pattern and next 4 lines after the pattern match.
Below is the summary of my issue.
"myfile.txt" content has:
As specified in doc.
risk involved in astra.
I am not a schizophrenic;and neither am I.;
Be polite to every idiot you meet.;He could be your boss tomorrow.;
I called the hospital;but the line was dead.;
Yes, I’ve lost to my computer at chess.;But it turned out to be no match for me at kickboxing.;
The urologist is about to leave his office and says:; "Ok, let's piss off now.";
What's the best place to hide a body?;Page two of Google.;
You know you’re old;when your friends start having kids on purpose.;
You won’t find anything more poisonous;than a harmonious;and friendly group of females.;
Two state clerks meet in the corridor.;One asks the other,;"Couldn't sleep either?";
Why do women put on make-up and perfume?;Because they are ugly and they smell.;
Bruce Lee’s all-time favorite drink?;Wataaaaaaaah!;
Daddy what is a transvestite?;-Ask Mommy, he knows.;
That moment when you have eye contact while eating a banana.;
I'm using below command.
sed -n -e '/You/h' -e '/Two/{x;G;p}' myfile.txt
Output by my command:
You won’t find anything more poisonous;than a harmonious;and friendly group of females.;
Two state clerks meet in the corridor.;One asks the other,;"Couldn't sleep either?";
Desired output:
You won’t find anything more poisonous;than a harmonious;and friendly group of females.;
Two state clerks meet in the corridor.;One asks the other,;"Couldn't sleep either?";
Why do women put on make-up and perfume?;Because they are ugly and they smell.;
Bruce Lee’s all-time favorite drink?;Wataaaaaaaah!;
Daddy what is a transvestite?;-Ask Mommy, he knows.;
That moment when you have eye contact while eating a banana.;

With GNU sed:
sed -n '/You/h;{/Two/{x;G;};//,+4p}' myfile.txt
Output:
You won’t find anything more poisonous;than a harmonious;and friendly group of females.;
Two state clerks meet in the corridor.;One asks the other,;"Couldn't sleep either?";
Why do women put on make-up and perfume?;Because they are ugly and they smell.;
Bruce Lee’s all-time favorite drink?;Wataaaaaaaah!;
Daddy what is a transvestite?;-Ask Mommy, he knows.;
That moment when you have eye contact while eating a banana.;
Explanation:
/You/h: copy matching line into the hold space. As there is only one hold space, h will store the last line matching You (ie You won’t...)
/Two/{x: when Two is found, x exchange the pattern space with the hold space. At this point:
into pattern space: You won’t find anything more poisonous;than a harmonious;and friendly group of females.;
into hold space: Two state clerks meet in the corridor.;One asks the other,;"Couldn't sleep either?";
G: appends a new line to the pattern space and copies the hold space after the new line
//,+4p is an address range starting from // (empty address repeats the last regular expression match, ie first 2 lines matching), up to next 4 lines +4. The address range is output with p

maybe this help you;
sed -n -e '/You/h' -e '/Two/{N;N;N;N;x;G;p}' myfile.txt
Example;
user#host:/tmp$ sed -n -e '/You/h' -e '/Two/{N;N;N;N;x;G;p}' myfile.txt
You won’t find anything more poisonous;than a harmonious;and friendly group of females.;
Two state clerks meet in the corridor.;One asks the other,;"Couldn't sleep either?";
Why do women put on make-up and perfume?;Because they are ugly and they smell.;
Bruce Lee’s all-time favorite drink?;Wataaaaaaaah!;
Daddy what is a transvestite?;-Ask Mommy, he knows.;
That moment when you have eye contact while eating a banana.;

This might work for you (GNU sed):
sed -r 'N;/You.*\n.*Two/{:a;$!{N;s/\n/&/4;Ta};p;d};D' file
Read two lines into the pattern space, pattern match and then print four further lines (if possible). Otherwise, delete the first line and repeat.

Related

How would I use grep to find all the words in a file that start with a-f

Here is the code to look for all the words in a file starting with a. What about words that start with a-f?
grep -E '\ba' data.txt
Since the question may be quite ambiguous, I will try to add all the possible solutions I can find for it. Let's start with a sample file, in my case containing the following words so I can explain myself better:
ascii
affair
a-f-f-i-l-i-a-t-e
barcode
break
charset
character
delete
draft
execute
example
force
failure
grip
group
halt
held
inode
instance
joke
jekyll
That said, if you want to grep all the words starting with the letters that go from a to f (this is, every word starting by a, b, c, d, e and f) you would use:
grep -E "^[a-f]" data.txt
And the output you would see is this one:
ascii
affair
barcode
break
charset
character
delete
draft
execute
example
force
failure
Now, if you want to filter the words starting literally by the string a-f for some reason then you would use this:
grep -E "^a-f" data.txt
And you would see that the output would look like this:
a-f-f-i-l-i-a-t-e
In this case the key is not that much grep knowledge but regex stuff.
Hope you find it useful.

Script to extract strings between two strings in linux

I am trying to write a little script that will let me "org-capture" articles from my rss-reader (newsboat). So my scenario is this: I will pipe the article to a script; however, the article gets piped in one line, like this:
Title: ABC boss quits over Australian political interference claims Author: Date: Thu, 27 Sep 2018 09:39:16 +0200 Link: https://www.bbc.co.uk/news/world-australia-45661871 The broadcaster's chair quits amid allegations the government leaned on him to dismiss two journalists.
So what I need to do is to consistently store the link and the title in a variable and then call a command with these variables (emacsclient org-protocol:/ ...)
So basically I need this:
TITLE="ABC boss quits over Australian political interference claims"
URL="https://www.bbc.co.uk/news/world-australia-45661871"
I considered using awk or sed, but they work best for separate lines. So, I thought maybe split the single line at 'Title:', 'Author:', 'Date:' and 'Link:' and then extract with awk/sed.
I found similar use cases and questions here, but not quite the same. I want a pretty minimal script without necessarily using python.
Am I on the right track?
Thanks for helping out.
With GNU awk for the 3rd arg to match():
$ cat tst.awk
match($0,/^Title:\s*(.*)\s+Author:\s*(.*)\s+Date:\s*(.*)\s+Link:\s*(\S+)\s+(.*)/,a) {
printf "TITLE=\"%s\"\n", a[1]
printf "URL=\"%s\"\n", a[4]
}
$ awk -f tst.awk file
TITLE="ABC boss quits over Australian political interference claims"
URL="https://www.bbc.co.uk/news/world-australia-45661871"
I showed how to save all the other fields too so you can also do anything else you need to with your input.
This might work for you (GNU sed):
sed -r 's/^Title: (.*) Author:.* Link: (\S+).*/TITLE="\1"\nURL="\2"/' file
Use pattern matching to extract the fields required. The first may contain spaces so match on the key Author:. The second is a string of non-space characters following the key Link:.

Using grep cmd to filter by first letter, #, and "."

I have a file (testdata.txt) with many email addresses and random text.
Using the grep command:
I want to make sure they are email addresses and not text, so I want to filter them out so that only lines with "#" are included.
I also want to filter them out so that only email addresses that start with the letter A-M or a-m are shown and have a period separating the first name and last name.
Eg. john.doe#gmail.com
However, johndoe#gmail.com would be included.
Lastly, I want to get the count of all the email addresses that follow these rules.
So far I've only been able to make sure they are email addresses by doing
grep -c "#" testdata.txt
.
Using the grep cmd I also want to check how many email addresses have a government domain ("gov").
I wanted to do a check that it has a # sign in the line and that it also contains gov. However, I don't get the answer I want when I do any of the following.
grep -c "#\|gov" testdata.txt I get the amount of lines that have a # not # and gov
grep -c "#/|gov" testdata.txt I get 0
grep -c "#|gov" testdata.txt I get 0
Going bottom-up with your questions.
You are using grep in its Basic regular expressions mode. In this mode \| means OR, | means the symbol |, and /| mean the symbols /|.
If you were looking for emails in the .gov domain, you would probably be looking for a sequence starting with # and followed by symbols that are permitted in an Internet domain name and the symbols .gov, or .GOV, or .Gov.
Borrowing from another post on this site you would end up with something like
grep -c "#[A-Za-z0-9][A-Za-z0-9.-]*\.\(gov\|Gov\|GOV\)"
skipping another 5 possible spellings for the top level domain, e.g. GoV.
However I would use the -i switch that means ignore case to simplify the expression
grep -ci "#[a-z0-9][a-z0-9.-]*\.gov"
Now you were not very clear regarding the use of dots separating parts of the name:
I also want to filter them out so that only email addresses that start with the letter A-M or a-m are shown and have a period separating the first name and last name. Eg. john.doe#gmail.com However, johndoe#gmail.com would be included.
So I will not touch this part.
Finally You could use range expressions to filter the addresses that start with the letters A-M
grep -ci "[a-m][a-z0-9._%+-]*#[a-z0-9][a-z0-9.-]*\.gov"
Please note that this is not an implementation of the Internet Message Format RFC 5322 address specification but only an approximation used mainly for didactic purpose. Never leave not fully compliant implementations in production code.

Grep filtering of the dictionary

I'm having a hard time getting a grasp of using grep for a class i am in was hoping someone could help guide me in this assignment. The Assignment is as follows.
Using grep print all 5 letter lower case words from the linux dictionary that have a single letter duplicated one time (aabbe or ababe not valid because both a and b are in the word twice). Next to that print the duplicated letter followed buy the non-duplicated letters in alphabetically ascending order.
The Teacher noted that we will need to use several (6) grep statements (piping the results to the next grep) and a sed statement (String Editor) to reformat the final set of words, then pipe them into a read loop where you tear apart the three non-dup letters and sort them.
Sample Output:
aback a bck
abaft a bft
abase a bes
abash a bhs
abask a bks
abate a bet
I haven't figured out how to do more then printing 5 character words,
grep "^.....$" /usr/share/dict/words |
Didn't check it thoroughly, but this might work
tr '[:upper:]' '[:lower:]' | egrep -x '[a-z]{5}' | sed -r 's/^(.*)(.)(.*)\2(.*)$/\2 \1\3\4/' | grep " " | egrep -v "(.).*\1"
But do your way because someone might see it here.
All in one sed
sed -n '
# filter 5 letter word
/[a-zA-Z]\{5\}/ {
# lower letters
y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxya/
# filter non single double letter
/\(.\).*\1/ !b
/\(.\).*\(.\).*\1.*\1/ b
/\(.\).*\(.\).*\1.*\2/ b
/\(.\).*\(.\).*\2.*\1/ b
# extract peer and single
s/\(.\)*\(.\)\(.*\)\2\(.*\)/a & \2:\1\3\4/
# sort singles
:sort
s/:\([^a]*\)a\(.*\)$/:\1\2a/
y/abcdefghijklmnopqrstuvwxyz/zabcdefghijklmnopqrstuvwxy/
/^a/ !b sort
# clean and print
s/..//
s/:/ /p
}' YourFile
posix sed so --posix on GNU sed
The first bit, obviously, is to use grep to get it down to just the words that have a single duplication in. I will give you some clues on how to do that.
The key is to use backreferences, which allow you to specify that something that matched a previous expression should appear again. So if you write
grep -E "^(.)...\1...\1$"
then you'll get all the words that have the starting letter reappearing in fifth and ninth positions. The point of the brackets is to allow you to refer later to whatever matched the thing in brackets; you do that with a \1 (to match the thing in the first lot of brackets).
You want to say that there should be a duplicate anywhere in the word, which is slightly more complicated, but not much. You want a character in brackets, then any number of characters, then the repeated character (with no ^ or $ specified).
That will also include ones where there are two or more duplicates, so the next stage is to filter them out. You can do that by a grep -v invocation. Once you've got your list of 5-character words that have at least one duplicate, pipe them through a grep -v call that strips out anything with two (or more) duplicates in. That'll have a (.), and another (.), and a \1, and a \2, and these might appear in several different orders.
You'll also need to strip out anything that has a (.) and a \1 and another \1, since that will have a letter with three occurrences.
That should be enough to get you started, at any rate.
Your next step should be to find the 5-letter words containing a duplicate letter. To do that, you will need to use back-references. Example:
grep "[a-z]*\([a-z]\)[a-z]*\$1[a-z]*"
The $1 picks up the contents of the first parenthesized group and expects to match that group again. In this case, it matches a single letter. See: http://www.thegeekstuff.com/2011/01/advanced-regular-expressions-in-grep-command-with-10-examples--part-ii/ for more description of this capability.
You will next need to filter out those cases that have either a letter repeated 3 times or a word with 2 letters repeated. You will need to use the same sort of back-reference trick, but you can use grep -v to filter the results.
sed can be used for the final display. Grep will merely allow you to construct the correct lines to consider.
Note that the dictionary contains capital letters and also non-letter characters, plus that strange characters used in Southern Europe. say "è".
If you want to distinguish "A" and "a", it's automatic, on the other hand if "A" and "a" are the same letter, in ALL grep invocations you must use the -i option, to instruct grep to ignore case.
Next, you always want to pass the -E option, to avoid the so called backslashitis gravis in the regexp that you want to pass to grep.
Further, if you want to exclude the lines matching a regexp from the output, the correct option is -v.
Eventually, if you want to specify many different regexes to a single grep invocation, this is the way (just an example btw)
grep -E -i -v -e 'regexp_1' -e 'regexp_2' ... -e 'regexp_n'
The preliminaries are after us, let's look forward, use the answer from chiastic-security as a reference to understand the procedings
There are only these possibilities to find a duplicate in a 5 character string
(.)\1
(.).\1
(.)..\1
(.)...\1
grep -E -i -e 'regexp_1' ...
Now you have all the doubles, but this doesn't exclude triples etc that are identified by the following patterns (Edit added a cople of additional matching triples patterns)
(.)\1\1
(.).\1\1
(.)\1.\1
(.)..\1\1
(.).\1.\1
(.)\1\1\1
(.).\1\1\1
(.)\1\1\1\1\
you want to exclude these patterns, so grep -E -i -v -e 'regexp_1' ...
at his point, you have a list of words with at least a couple of the same character, and no triples, etc and you want to drop double doubles, these are the regexes that match double doubles
(.)(.)\1\2
(.)(.)\2\1
(.).(.)\1\2
(.).(.)\2\1
(.)(.).\1\2
(.)(.).\2\1
(.)(.)\1.\2
(.)(.)\2.\1
and you want to exclude the lines with these patterns, so its grep -E -i -v ...
A final hint, to play with my answer copy a few hundred lines of the dictionary in your working directory, head -n 3000 /usr/share/dict/words | tail -n 300 > ./300words so that you can really understand what you're doing, avoiding to be overwhelmed by the volume of the output.
And yes, this is not a complete answer, but it is maybe too much, isn't it?

Grep (a.txt - En word list, b.txt - one string in each line) Q: string from b.txt built only from words or not?

I have a list with English words (1 in each line, around 100.000)-> a.txt and a b.txt contains strings (around 50.000 line, one string in each line, can contain pure words, word+something, garbage). I would like to know which strings from b.txt contains English words only (without any additional chars).
Can I do this with grep?
Example:
a.txt:
apple
pie
b.txt:
applepie
applebs
bspie
bsabcbs
Output:
c.txt:
applepie
Since your question is underspecified, maybe this answer can help as a shot in the dark to clarify your question:
c='cat b.txt'
while IFS='' read -e line
do
c="$c | grep '$line'"
done < a.txt
eval "$c" > c.txt
But this would also match a line like this is my apply on a pie. I don't know if that's what you want.
This is another try:
re=''
while IFS='' read -e line
do
re="$re${re:+|}$line"
done < a.txt
grep -E "^($re)*$" b.txt > c.txt
This will let pass only the lines which have nothing but a concatenation of these words. But it will also let pass things like 'appleapplepieapplepiepieapple'. Again, I don't know if this is what you want.
Given your latest explanation in the question I would propose another approach (because building such a list out of 100000+ words is not going to work).
A working approach for this amount of words could be to remove all recognized words from the text and see which lines get emptied in the process. This can easily be done iteratively without exploding the memory usage or other resources. It will take time, though.
cp b.txt inprogress.txt
while IFS='' read -e line
do
sed -i "s/$line//g" inprogress.txt
done < a.txt
for lineNumber in $(grep -n '^$' inprogress.txt | sed 's/://')
do
sed -n "${lineNumber}p" b.txt
done
rm inprogress.txt
But this still would not really solve your issue; consider if you have the words to and potato in your list, and removing the to would occur first, then this would leave a word pota in your text file, and pota is not a word which would then be removed.
You could address that issue by sorting your word file by word length (longest words first) but that still would be problematic in some cases of compound words, e. g. redart (being red + art) but dart would be removed first, so re would remain. If that is not in your word list, you would not recognize this word.
Actually, your problem is one of logical programming and natural language processing and probably does not fit to SO. You should have a look at the language Prolog which is designed around such problems as yours.
I will post this as an answer as well since I feel this is the correct answer to your specific question.
Your requirement is to find non-English words in a file (b.txt) based on a word list ( a.txt) which contains a list of English words. Based on the example in your question said word list does not contain compound words (e.g. applepie) but you would still like to match the file against compound words based on words in your word list (e.g. apple and pie).
There are two problem you are facing:
Not every permutation of words in a.txt will be a valid English compound word so just based on this your problem is already impossible to solve.
If you, nonetheless, were to attempt building a list of compound words yourself by compiling a list of all possible permutations you cannot easily do this because of the size of your wordlist (and resulting memory problems). You would most probably have to store your words in a more complex data structure, e.g. a tree, and build permutations on the fly by traversing the tree which is not doable in shell scripting.
Because of these points and your actual question being "can this be done with grep?" the answer is no, this is not possible.

Resources