My goal is to extract the paragraphs of a text that contain a specific keyword. Not just the lines that contain the keyword, but the whole paragraph. The rule imposed on my text files is that every paragraph starts with a certain pattern (e.g. Pa0) which is used throughout the text only in the start of the paragraph. Each paragraph ends with a new line character.
For example, imagine I have the following text:
Pa0
This is the first paragraph bla bla bla
This is another line in the same paragraph bla bla
This is a third line bla bla
Pa0
This is the second paragraph bla bla bla
Second line bla bla My keyword is here!
bla bla bla
bla
Pa0
Hey, third paragraph bla bla bla!
bla bla
Pa0
keyword keyword
keyword
Another line! bla
My goal is to extract these paragraphs that contain the word "keyword". For example:
Pa0
This is the second paragraph bla bla bla
Second line bla bla My keyword is here!
bla bla bla
bla
Pa0
keyword keyword
keyword
Another line! bla
I can use e.g. grep for the keyword and -A, -B or -C option to get a constant number of lines before and/or after the line where the keyword is located but this does not seem enough since the beginning and end of the text block depends on the delimiters "Pa0" and "\n".
Any suggestion for grep or another tool (e.g. awk, sed, perl) would be helpful.
It is simple with awk:
awk '/keyword/' RS="\n\n" ORS="\n\n" input.txt
Explanation:
Usually awk operates on a per line basis, because the default value of the record separator RS is \n (a single new line). By changing the RS to two new lines in sequence (an empty line) we can easily operate on a paragraph basis.
/keyword/ is a condition, a regex. Since there is no action after the condition awk will simply print the unchanged record (the paragraph) if it contains keyword.
Setting the output record separator ORS to \n\n will separate the paragraphs of output with an empty line, just like in the input.
if text.txt contains the text you want, then:
$ sed -e '/./{H;$!d;}' -e 'x;/keyword/!d;' text.txt
Pa0
This is the second paragraph bla bla bla
Second line bla bla My keyword is here!
bla bla bla
bla
Pa0
keyword keyword
keyword
Another line! bla
hope this will help
sed -n '/Pa0/,/^$/p' filename
cat filename | sed -n '/Pa0/,/^$/p'
-n, suppress automatic printing of pattern space
-p, Print the current pattern space
/Pa0/, paragraph starting with Pa0 pattern
/^$/, paragraph ending with a blank line
^, start of line
$, end of line
Reference: http://www.cyberciti.biz/faq/sed-display-text/
Related
For example I have the following text:
black brown cat bla bla_____
black brown cat bla bla__
black brown cat bla bla_______
black brown cat bla bla___
black brown cat bla bla_____
black brown cat bla bla
Each line has various length of underscores(can be any other char)
I want ot delete the underscores from the end of lines using f command, delete with x until the pattern not found and go to the next line.
The expected result is the lines without the underscored symbol:
black brown cat bla bla
black brown cat bla bla
black brown cat bla bla
black brown cat bla bla
black brown cat bla bla
How can I do it ?
the easiest would be:
%s/_*$//
f/F is handy but not for this use case, it leads you to the next x char in the line, but doesn't check if there are continuous x or if the char next to x is [^x].
Only speaking to your example, if you must use f/F kind of command, you can do $vTax, here I used hardcoded a, because you have always a_____$ but I think your real text won't be like it.
With a macro recording…
First, record your macro:
qq
f_
D
q
Then, play it back on lines in [range]:
:[range]norm #q
Alternatively:
:[range]norm f_D
With a substitution…
:[range]s/_\+$
Original file:
bla bla test
blabla
start
test
blabla
start
test
The result should be:
bla bla test
blabla
start
edit
blabla
start
edit
So after the first occurence of "start" all "test" should be replaced with "edit"
You can use this sed:
sed '/start/,$s/test/edit/g' file
bla bla test
blabla
start
edit
blabla
start
edit
Explanation:
/start/,$ # find text from first occurrance of start till end
s/test/edit/g # in the found text replace test by edit globally
'bla bla bla', 'bla bla bla'
---------------^------------ (cursor position)
To delete the second 'bla bla bla' I use
da'
but this also deletes the leading space. Is there a way to not include the leading space in the deletion?
(I'm trying to create a macro to replace quoted strings with a function call, ie replace eg
'bla bla bla', 'woot'
with
yada('bla_bla_bla'), yada('woot') )
in macro you can also use command, like this:
s/'.\{-}'/yada(&)/g
This will only apply on '...', the rest (space, comma etc) won't be touched.
You can use vi'i'<operator> to operate on the quotes and their content. This would make your macro look something like that:
vi'i'cyada(<C-r>")
this is my first question on this website.(glad i found out about this community)
I am trying to replace a specific pattern in a file(multiple lines) that looks somehow like this:
Bla bla bla bla |SMTH AWESOME INSIDE >>> LOL| bla bla bla | let's do it again >>> AWESOME |
Into a format that looks like this
Bla bla bla bla ( LOL | SMTH AWESOME INSIDE ) bla bla bla ( AWESOME | let's do it again )
I tried doing this by using a code that parses the line word by word and if it finds out the "|" character starts creating a string that contains the first word,then, after it finds the >>> character it starts creating the second string till it finds the "|" last character, but it didn't work.
I also tried afterwards using AWK(but since i am new to linux i failed as well.
awk -F 'BEGIN { FS=OFS="|" } { sub(/.*<<</,"", $2); }1' $1 }'
and then parse the output with sed(removing the ) and ( characters from both strings. But it didn't work.
Thank you for reading.
It looks like this is just a simple substitution within each line so all you need is sed:
$ sed 's/| *\([^|]*\) >>> \([^|]*\) *|/( \2 | \1 )/g' file
Bla bla bla bla ( LOL | SMTH AWESOME INSIDE ) bla bla bla ( AWESOME | let's do it again )
You can do the same in GNU awk with gensub() or other awks with match() and substr().
With extended regexp in sed:
sed -r 's/\|([^|]+)[[:space:]]*>>>[[:space:]]*([^|]+)\|/( \2 | \1 )/g' File
Logic:
We look for a pattern which starts with | followed by a sequence of non-| characters followed by >>> followed by a sequence of non-| characters again. See the groupings done with ( and ). Then we substitute these patterns according to our need. ( \2 | \1 ) is the replacement pattern where \1 and \2 are the first and second groupings respectively.
With basic regexp in sed:
sed 's/|\([^|]*\)[[:space:]]*>>>[[:space:]]*\([^|]*\)|/( \2 | \1 )/g' File
Perl's regular expressions have a "non-greedy" matching feature that awk's do not:
perl -pe '
s/ \| # the first delimiter
(.*?) # capture up to ...
>>> # the middle delimiter
(.*?) # capture up to ...
\| # the last delimiter
/($2 | $1)/gx
' file
Bla bla bla bla ( LOL | SMTH AWESOME INSIDE ) bla bla bla ( AWESOME | let's do it again )
Let's try with awk:
awk 'NR%2{ printf("%s", $0) } NR%2==0{ printf("( %s %s",$NF,RS); gsub(/>>>.*$/,")"); printf("%s",$0) }' RS='|' file
Bla bla bla bla ( LOL | SMTH AWESOME INSIDE ) bla bla bla ( AWESOME | let's do it again )
The RS defines | as record separator. So when the input record number (NR) isn't module of 2 (NR%2 return 1) then print that record itself. If the NR is module of 2 (NR%2==0 means if record is module of 2), then print a single open parentheses followed by printing last field from it and print record separator (printf("( %s %s",$NF,RS)), then replace >>>.*$ with close parentheses and print the rest of record (gsub(/>>>.*$/,")"); printf("%s",$0))
I have text with either the following structure
bla bla more bla bla
$
PART / 4402000LLINK 4401001
NAME ADHESIVE 8.0 mm LLINK Property
8. 8. 2
END_PART
$
some other bla bla
But the line containing PART could be also:
PART / 4402000 LLINK 4401001
or:
PART / 4402000 LLINK 4401001
So strictly speaking the LLINK could occupy the columns from 16 to 23.
Now I would like to delete all lines between the pattern lines. First pattern is line containing both PART and this LLINK. The second pattern is line containing END_PART. So at the end I will have this:
bla bla more bla bla
$
$
some other bla bla
I am using CentOS with:
> echo $SHELL
/bin/tcsh
so, I can use sed or awk in tcsh e.g. Could you help. Thank you
This sed command can be used.
sed -i -r '/PART.*LLINK/,/END_PART/d' file
With awk, this seems to work :
$ cat file
bla PART bla more bla bla
$
PART / 4402000LLINK 4401001
NAME ADHESIVE 8.0 mm LLINK Property
PART
8. 8. 2
END_PART
$
some other bla bla
AAAAA
PART / 4402120 LLINK 4401001
NAME ADHESIVE 8.0 mm LLINK Property
PART
AAAAAAAAAA
8. 8. 2
END_PART
$AAAA
AAAAAsome other bla bla
awk '{if (($0!~/PART/ || $0!~/LLINK/) && stop == 0) {print} else {if ($0~/END_PART/) {stop=0} else {stop=1}}}' file
bla PART bla more bla bla
$
$
some other bla bla
AAAAA
$AAAA
AAAAAsome other bla bla
Hope this helps !