Trying to replace a pattern with another one - linux

this is my first question on this website.(glad i found out about this community)
I am trying to replace a specific pattern in a file(multiple lines) that looks somehow like this:
Bla bla bla bla |SMTH AWESOME INSIDE >>> LOL| bla bla bla | let's do it again >>> AWESOME |
Into a format that looks like this
Bla bla bla bla ( LOL | SMTH AWESOME INSIDE ) bla bla bla ( AWESOME | let's do it again )
I tried doing this by using a code that parses the line word by word and if it finds out the "|" character starts creating a string that contains the first word,then, after it finds the >>> character it starts creating the second string till it finds the "|" last character, but it didn't work.
I also tried afterwards using AWK(but since i am new to linux i failed as well.
awk -F 'BEGIN { FS=OFS="|" } { sub(/.*<<</,"", $2); }1' $1 }'
and then parse the output with sed(removing the ) and ( characters from both strings. But it didn't work.
Thank you for reading.

It looks like this is just a simple substitution within each line so all you need is sed:
$ sed 's/| *\([^|]*\) >>> \([^|]*\) *|/( \2 | \1 )/g' file
Bla bla bla bla ( LOL | SMTH AWESOME INSIDE ) bla bla bla ( AWESOME | let's do it again )
You can do the same in GNU awk with gensub() or other awks with match() and substr().

With extended regexp in sed:
sed -r 's/\|([^|]+)[[:space:]]*>>>[[:space:]]*([^|]+)\|/( \2 | \1 )/g' File
Logic:
We look for a pattern which starts with | followed by a sequence of non-| characters followed by >>> followed by a sequence of non-| characters again. See the groupings done with ( and ). Then we substitute these patterns according to our need. ( \2 | \1 ) is the replacement pattern where \1 and \2 are the first and second groupings respectively.
With basic regexp in sed:
sed 's/|\([^|]*\)[[:space:]]*>>>[[:space:]]*\([^|]*\)|/( \2 | \1 )/g' File

Perl's regular expressions have a "non-greedy" matching feature that awk's do not:
perl -pe '
s/ \| # the first delimiter
(.*?) # capture up to ...
>>> # the middle delimiter
(.*?) # capture up to ...
\| # the last delimiter
/($2 | $1)/gx
' file
Bla bla bla bla ( LOL | SMTH AWESOME INSIDE ) bla bla bla ( AWESOME | let's do it again )

Let's try with awk:
awk 'NR%2{ printf("%s", $0) } NR%2==0{ printf("( %s %s",$NF,RS); gsub(/>>>.*$/,")"); printf("%s",$0) }' RS='|' file
Bla bla bla bla ( LOL | SMTH AWESOME INSIDE ) bla bla bla ( AWESOME | let's do it again )
The RS defines | as record separator. So when the input record number (NR) isn't module of 2 (NR%2 return 1) then print that record itself. If the NR is module of 2 (NR%2==0 means if record is module of 2), then print a single open parentheses followed by printing last field from it and print record separator (printf("( %s %s",$NF,RS)), then replace >>>.*$ with close parentheses and print the rest of record (gsub(/>>>.*$/,")"); printf("%s",$0))

Related

using Grep to output a string

When running a command I get an output that outputs a chart:
+---------+------------------------------------------------------+
| Key | Value |
+---------+------------------------------------------------------+
| Address | longstringofcharacters |
+---------+------------------------------------------------------+
| Name | word1-word2-word3 |
+---------+------------------------------------------------------+
I can grep Name to get the line that contains the word Name.
What do I grep to output just the string of word1-word2-word3 only?
I've tried grep '*-*-*' but that doesn't work.
With GNU grep, you can use a PCRE regex based solution like
grep -oP '^\h*\|\h*Name\h*\|\h*\K\S+' file
See the online demo and the regex demo.
-o - outputs matches only
P - enables the PCRE regex engine
^ - start of string
\h*\|\h* - a | char enclosed with optional horizontal whitespaces
Name - a word Name
\h*\|\h* - a | char enclosed with optional horizontal whitespaces
\K - match reset operator that discards text matched so far
\S+ - one or more non-whitespace chars.
With a GNU awk:
awk -F'[|[:space:]]+' '$2 == "Name"{print $3}' file
Set the field separator to a [|[:space:]]+ regex that matches one or more | chars or whitespaces, check if Group 2 equals Name and grab Field 3.
With any awk (if you need to extract a string like nonwhitespaces(-nonwhitespaces)+):
awk 'match($0, /[^ -]+(-[^ -]+)+/) { print substr($0, RSTART, RLENGTH) }' file
See this online demo.
A simple solution using awk is:
awk '/Name/{ print $4 }'
The /Name/ section is how awk "greps". The { print $4 } bit says print the fourth space delimited word.

Get rid of unwanted lines from file

In bellow example ^[ - are escape characters to stain terminal output (just type ctrl+v+[).
1) My file:
-------- just to mark start of file ----------
^[[1;31mbla bla bla^[[0m
^[[0;36mTREE;01;^[[0m
^[[1;31m^[[0m
^[[1;31m^[[1;31mapple tree:^[[0m^[[0m
^[[1;31m4 apples^M^M^[[0m
^[[1;31m6 leafs^M^[[0m
^[[0;36mTREE;02;^[[0m
^[[0;36mTREE;03;^[[0m
withered
^[[0;36mTREE;04;^[[0m
^[[0;36mTREE;05;^[[0m
^[[0;36mTREE;06;^[[0m
^[[0;36mTREE;07;^[[0m
^[[1;31m^[[0m
^[[1;31m^[[1;31mcherry tree:^[[0m^[[0m
^[[1;31mbig branches^M^M^[[0m
^[[1;31mtchick roots^M^[[0m
^[[0;36mTREE;08;^[[0m
^[[0;36mMy tree ^[[0m I have tree house on it^[[0;31m:-)^[[0m
^[[0;36mTREE;09;^[[0m
-------- just to mark end of file ----------
2) I want to get rid of all "empty labels" - it is all labels that have no comments under it.
So the result I want to achieve is:
-------- just to mark start of results ----------
^[[1;31mbla bla bla^[[0m
^[[0;36mTREE;01;^[[0m
^[[1;31m^[[0m
^[[1;31m^[[1;31mapple tree:^[[0m^[[0m
^[[1;31m4 apples^M^M^[[0m
^[[1;31m6 leafs^M^[[0m
^[[0;36mTREE;03;^[[0m
withered
^[[0;36mTREE;07;^[[0m
^[[1;31m^[[0m
^[[1;31m^[[1;31mcherry tree:^[[0m^[[0m
^[[1;31mbig branches^M^M^[[0m
^[[1;31mtchick roots^M^[[0m
^[[0;36mTREE;08;^[[0m
^[[0;36mMy tree ^[[0m I have tree house on it^[[0;31m:-)^[[0m
-------- just to mark end of results ----------
3) I do:
pcregrep -M 'TREE.*\n(\n|\s)+(?=.*TREE|\z)' my_file
and it works as I expect - it leaves only labels with no comments
-------- just to mark start of results ----------
^[[0;36mTREE;02;^[[0m
^[[0;36mTREE;04;^[[0m
^[[0;36mTREE;05;^[[0m
^[[0;36mTREE;06;^[[0m
^[[0;36mTREE;09;^[[0m
-------- just to mark end of results ----------
4) But command:
pcregrep -Mv 'TREE.*\n(\n|\s)+(?=.*TREE|\z)' my_file
products "wired results" I do not understand.
*) How to get result I want?
With any tool like: pcregrep, ag, ack, sed, awk, ...
The simplest and, probably, the stupidest solution that I have came up with:
[steelrat#archlinux ~]$ awk '/TREE/ {f=$0;p=1} !/^ *$/&&!/TREE/ {if (p==1) {print f; p=0} print $0}' my_file
-------- just to mark start of results ----------
^[[1;31mbla bla bla^[[0m
^[[0;36mTREE;01;^[[0m
^[[1;31m^[[0m
^[[1;31m^[[1;31mapple tree:^[[0m^[[0m
^[[1;31m4 apples^M^M^[[0m
^[[1;31m6 leafs^M^[[0m
^[[0;36mTREE;03;^[[0m
withered
^[[0;36mTREE;07;^[[0m
^[[1;31m^[[0m
^[[1;31m^[[1;31mcherry tree:^[[0m^[[0m
^[[1;31mbig branches^M^M^[[0m
^[[1;31mtchick roots^M^[[0m
^[[0;36mTREE;08;^[[0m
^[[0;36mMy tree ^[[0m I have tree house on it^[[0;31m:-)^[[0m
-------- just to mark end of results ----------
If you need spaces (requires some extra work to get rid of spaces from empty sections):
$ awk '/^ *$/ {print $0} /TREE/ {f=$0;p=1} !/^ *$/&&!/TREE/ {if (p==1) {print f; p=0} print $0}' my_file
-------- just to mark start of results ----------
^[[1;31mbla bla bla^[[0m
^[[0;36mTREE;01;^[[0m
^[[1;31m^[[0m
^[[1;31m^[[1;31mapple tree:^[[0m^[[0m
^[[1;31m4 apples^M^M^[[0m
^[[1;31m6 leafs^M^[[0m
^[[0;36mTREE;03;^[[0m
withered
^[[0;36mTREE;07;^[[0m
^[[1;31m^[[0m
^[[1;31m^[[1;31mcherry tree:^[[0m^[[0m
^[[1;31mbig branches^M^M^[[0m
^[[1;31mtchick roots^M^[[0m
^[[0;36mTREE;08;^[[0m
^[[0;36mMy tree ^[[0m I have tree house on it^[[0;31m:-)^[[0m
-------- just to mark end of results ----------
Well I did it.
(1) sed 's/^M//g;
(2) s/$/#VAV#/' my_file | \
(3) paste -sd "" | \
(4) sed 's/^[\[0;36mTREE[[:print:]]\+^[\[0m\(\(#VAV#\)\|\([[:blank:]]\)\|\(^[\[0;36mTREE[[:print:]]\+^[\[0m\)\)*\(\(^[\[0;36mTREE[[:print:]]\+^[\[0m\)\|$\)/\6/g;
(5) s/#VAV#/\n/g'
(1) Get rid if ^M escape char - it handicap things.
(2) Put "some deliberate" string at end of each line.
(3) Concatenate all lines into one string.
(4) Do proper regular expression substitution.
(5) Change back that string from point (2) to end of line.

Grep whole paragraphs of a text containing a specific keyword

My goal is to extract the paragraphs of a text that contain a specific keyword. Not just the lines that contain the keyword, but the whole paragraph. The rule imposed on my text files is that every paragraph starts with a certain pattern (e.g. Pa0) which is used throughout the text only in the start of the paragraph. Each paragraph ends with a new line character.
For example, imagine I have the following text:
Pa0
This is the first paragraph bla bla bla
This is another line in the same paragraph bla bla
This is a third line bla bla
Pa0
This is the second paragraph bla bla bla
Second line bla bla My keyword is here!
bla bla bla
bla
Pa0
Hey, third paragraph bla bla bla!
bla bla
Pa0
keyword keyword
keyword
Another line! bla
My goal is to extract these paragraphs that contain the word "keyword". For example:
Pa0
This is the second paragraph bla bla bla
Second line bla bla My keyword is here!
bla bla bla
bla
Pa0
keyword keyword
keyword
Another line! bla
I can use e.g. grep for the keyword and -A, -B or -C option to get a constant number of lines before and/or after the line where the keyword is located but this does not seem enough since the beginning and end of the text block depends on the delimiters "Pa0" and "\n".
Any suggestion for grep or another tool (e.g. awk, sed, perl) would be helpful.
It is simple with awk:
awk '/keyword/' RS="\n\n" ORS="\n\n" input.txt
Explanation:
Usually awk operates on a per line basis, because the default value of the record separator RS is \n (a single new line). By changing the RS to two new lines in sequence (an empty line) we can easily operate on a paragraph basis.
/keyword/ is a condition, a regex. Since there is no action after the condition awk will simply print the unchanged record (the paragraph) if it contains keyword.
Setting the output record separator ORS to \n\n will separate the paragraphs of output with an empty line, just like in the input.
if text.txt contains the text you want, then:
$ sed -e '/./{H;$!d;}' -e 'x;/keyword/!d;' text.txt
Pa0
This is the second paragraph bla bla bla
Second line bla bla My keyword is here!
bla bla bla
bla
Pa0
keyword keyword
keyword
Another line! bla
hope this will help
sed -n '/Pa0/,/^$/p' filename
cat filename | sed -n '/Pa0/,/^$/p'
-n, suppress automatic printing of pattern space
-p, Print the current pattern space
/Pa0/, paragraph starting with Pa0 pattern
/^$/, paragraph ending with a blank line
^, start of line
$, end of line
Reference: http://www.cyberciti.biz/faq/sed-display-text/

How to delete all lines between the pattern lines

I have text with either the following structure
bla bla more bla bla
$
PART / 4402000LLINK 4401001
NAME ADHESIVE 8.0 mm LLINK Property
8. 8. 2
END_PART
$
some other bla bla
But the line containing PART could be also:
PART / 4402000 LLINK 4401001
or:
PART / 4402000 LLINK 4401001
So strictly speaking the LLINK could occupy the columns from 16 to 23.
Now I would like to delete all lines between the pattern lines. First pattern is line containing both PART and this LLINK. The second pattern is line containing END_PART. So at the end I will have this:
bla bla more bla bla
$
$
some other bla bla
I am using CentOS with:
> echo $SHELL
/bin/tcsh
so, I can use sed or awk in tcsh e.g. Could you help. Thank you
This sed command can be used.
sed -i -r '/PART.*LLINK/,/END_PART/d' file
With awk, this seems to work :
$ cat file
bla PART bla more bla bla
$
PART / 4402000LLINK 4401001
NAME ADHESIVE 8.0 mm LLINK Property
PART
8. 8. 2
END_PART
$
some other bla bla
AAAAA
PART / 4402120 LLINK 4401001
NAME ADHESIVE 8.0 mm LLINK Property
PART
AAAAAAAAAA
8. 8. 2
END_PART
$AAAA
AAAAAsome other bla bla
awk '{if (($0!~/PART/ || $0!~/LLINK/) && stop == 0) {print} else {if ($0~/END_PART/) {stop=0} else {stop=1}}}' file
bla PART bla more bla bla
$
$
some other bla bla
AAAAA
$AAAA
AAAAAsome other bla bla
Hope this helps !

Enumerate substitutions with sed or awk

Given the plain text file with lines
bli foo bla
abc
dfg
bli foo bla
hik
lmn
what sed or awk magic transforms it to
bli foo_01 bla
abc
dfg
bli foo_02 bla
hik
lmn
so that every occurence of 'foo' is replaced by 'foo_[occurence number]'.
awk '!/foo/||sub(/foo/,"&_"++_)' infile
Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.
This probably isn't what you require, but it might give some ideas in the right direction.
Administrator#snadbox3 ~
$ cd c:/tmp
Administrator#snadbox3 /cygdrive/c/tmp
$ cat <<-eof >foo.txt
> foo
> abc
> dfg
> foo
> hik
> lmn
> eof
Administrator#snadbox3 /cygdrive/c/tmp
$ awk '/^foo$/{++fooCount; print($0 "_" fooCount);} /^ /{print}' foo.txt
foo_1
abc
dfg
foo_2
hik
lmn
EDIT:
I'm a day late and a penny short, again ;-(
EDIT2:
Character encodings is another thing to lookout for... Java source code isn't necessarily in the systems default encoding... it's quit UTF-8 encoded, to allow for any embedded "higher order entities" ;-) Many *nix utilities still aren't charset-aware.
This is another way to express radoulov's answer
awk '/foo/ {sub(/foo/, "&_" sprintf("%02d",++c))} 1' infile
You should take care that you don't match "foobar" while looking for "foo":
gawk '/\<foo\>/ {sub(/\<foo\>/, "&_" sprintf("%02d",++c))} 1'

Resources