sed split single line file and process resulting lines

sed split single line file and process resulting lines - linux

I have an XML feed (this) in a single line so to extract the data I need I can do something like this:
sed -r 's:<([^>]+)>([^<]+)</\1>:&\n: g' feed | sed -nr '
/<item>/, $ s:.*<(title|link|description)>([^<]+)</\1>.*:\2: p'
since I can't find a way to make first sed call to process result as different lines.
Any advice?
My goal is to get all data I need in a single sed call

sed -rn -e 's|>[[:space:]]*<|>\n<|g
/^<title>/ { bx }
/^<description>/ { b x }
/^<link>/ { bx }
D
:x
s|<([^>]*)>([^\n]*)</\1>|\1=\2|;
P
D' rss.xml
New answer to new question. Now with branches and outputing all three chunks of information.

sed -rn -e 's|>[[:space:]]*<|>\n<|g # Insert newlines before each element
/^[^<]/ D # If not starting with <, delete until 1st \n and restart
/^<[^t]/ D # If not starting with <t, ""
/^<t[^i]/ D # If not starting with <ti, ""
/^<ti[^t]/ D
/^<tit[^l]/ D
/^<titl[^e]/ D
/^<title[^>]/ D # If not starting with <title>, delete until 1st \n and restart
s|^<title>|| # Delete <title>
s|</title>[^\n]*|| # Delete </title> and everything after it until the newline
P # Print everything up to the first newline
D' rss.xml # Delete everything up to the first newline and restart
By "restart" I mean go back to the top of the sed script and pretend we just read whatever is left.
I learned a lot about sed writing this. However, there is zero question that you really should be doing this in perl (or awk if you are old school).
In perl, this would be perl -pe 's%.*?<title>(.*?)</title>(?:.*?(?=<title>)|.*)%$1\n%g' rss.xml
Which is basically taking advantage of the minimal match (.*? is non-greedy, it will match the fewest number of character possible). The positive lookahead thing at the end is just so that I could do it in one s expression while still deleting everything at the end. There is more than one way…
If you needed multiple tags out of this xml file, it probably is still possible, but would probably involve branching and the like.

What about this:
sed -nr 's|>[[:space:]]*<|>\n<|g
h
/^<(title|link|description)>/ {
s:<([^>]+)>([^<]+)</\1>:\2: P
}
g
D
' feed

Related

How can I insert a new line after each character in shell script?

Assuming I have the following string:
abcdefghi
Which command can I use so that the outcome is:
a
b
c
d
e
f
g
h
i
I just started coding so I hope someone can help me.

There is a tool called fold which inserts linebreaks, and you can tell it do add one after every character:
$ fold -w 1 <<< 'abcdefghi'
a
b
c
d
e
f
g
h
i
<<< is used to indicate a here string. If your shell doesn't support that, you can pipe to fold instead:
echo 'abcdefghi' | fold -w 1

You can use sed, although it will add an extra newline after the last letter so you get a blank line at the end:
$ sed 's/./&\
/g' <<<abcdefghi
a
b
c
d
e
f
g
h
i
$
s/old/new/ is the sed "substitute" command. On the old side, the pattern . matches any character at all. On the new side, the symbol & means "whatever the old pattern matched" - we include what we match in the replacement so we are adding things, not removing them.
We want to follow each matched character with a newline, but the newline will terminate the sed command and result in a syntax error unless we put a backslash in front of it.
So we are replacing any character at all (.) with that same character (&) followed by a newline (\ + newline). The g on the end means to replace every occurrence, not just the first one on each line.
The demonstration uses a here-string, which is part of most modern shells but not all; you could also do it with echo abcdefghi | sed '...'.

grep -o . <<< "abcdefghi"

Programming rev in sed

I'm trying to write an utility that reverses lines of input. The following just prints the lines as they are though:
#!/bin/sed -f
#insert newline at the beginning
s/^/\n/
#while the newline hasnt moved to the end of pattern space, rotate
: loop
/\n$/{!s/\(.*\)\(.$\)/\2\1/;!b loop}
#delete the newline
s/\n//
Any ideas on what's wrong?

/\n$/{!s/\(.*\)\(.$\)/\2\1/;!b loop}
the ! is after an address/range normaly
the !b (not than goto if I understang your meaning) is maybe a t (if substitution occur, goto)
$ is not part of the last group but just after
so this line is:
/\n$/ !{s/\(.*\)\(.\)$/\2\1/;t loop}
now, this code just (in final) do nothing it add a new line at start and move it until the end by swapping last to first character and does not reveverse anything.
sed 'G
:loop
s/\(.\)\(\n.*\)/\2\1/
t loop
s/.//' YourFile
should do the trick
#TobySpeight still enhance the code removing the need of a 1st group (code adapted)

Solution 1
$ echo -e '123\n456\n789' |sed -nr '/\n/!G;s/(.)(.*\n)/&\2\1/;/^\n/!D;s/\n//p'
321
654
987
the core ideas:
we need a loop to deal with each line, and fortunately we can use D command can simulate a loop;
we need to loop over ONE line, which is difficult, because sed deals with one line every time; but we can use s and D command to simulate a loop over one line.
how to avoid infinite loop? we need a flag char to identify the end of each line, \n is the perfect choice.
D command delete chars util first \n in the pattern space every time,
and then force sed jump to its first command, which is a loop actually! So we also need to add some useless placeholder to be deleted by D command before the final string, and we can just use contents in current line before \n (\n also included).
Explains:
/\n/!G: if current pattern space contain \n, which means this command is in a loop of dealing with one line; otherwise, use G command to append the \n and hold space to the pattern space (sed will delete \n of every line before putting it into pattern space), the content of pattern space after G command is the origin content and a \n.
s/(.)(.*\n)/&\2\1/;: use s command to delete the first char (util \n) and then insert it after the final string.
/^\n/!D;s/\n//p: if current pattern space starts with \n, which means this line has been resolved already, so use s/\n//p to delete the flag char: \n and print the final string; otherwise use D command to delete the useless placeholder, and then jump to the first command to deal with the second char...
To make a summary, the contents in pattern space in a loop are shown as the followings:
123\n [(1)(23\n)] =s=> 123\n23\n1 [(123\n)(23\n)(1)] =D=> 23\n1
23\n1 [(2)(3\n)1] =s=> 23\n3\n21 [(23\n)(3\n)(2)1] =D=> 3\n21
3\n21 [(3)(\n)21] =s=> 3\n\n321 [(3\n)(\n)(3)21] =D=> \n321
\n321 [()(\n)321] =s=> \n321 =!D=> \n321 =s-p=> 321
There are some derived solutions:
Solution 2
the placeholder can be set another string ending with a \n:
$ echo -e '123\n456\n789' |sed -nr '/\n/!G;s/(.)(.*\n)/USELESS\n\2\1/;/^\n/!D;s/\n//p'
321
654
987
Solution 3
Use a direct loop instead of obscure D command
$ echo -e '123\n456\n789' |sed -nr '/\n/!G;s/(.)(.*\n)/&\2\1/;Tend;D;:end;s/\n//p'
321
654
987
Solution 4
Use . to fetch the first char \n
$ echo -e '123\n456\n789' |sed -nr '/\n/!G;s/(.)(.*\n)/&\2\1/;/^\n/!D;s/.//p'
321
654
987
Solution 5
$ echo -e '123\n456\n789' |sed -nr ':loop;/\n/!G;s/(.)(.*\n)/\2\1/;tloop;s/.//p'
321
654
987
This solution is much easier to understand, the contents in pattern space res shown as the followings:
123\n [(1)(23\n)] =s=> 23\n1 [(23\n)(1)]
23\n1 [(2)(3\n)1] =s=> 3\n21 [(3\n)(2)1]
3\n21 [(3)(\n)21] =s=> \n321 [(\n)(3)21]
\n321 [()(\n)321] =s=> \n321 =s=> 321

The problem is you are using the wrong tool for the job and trying to understand/use constructs that became obsolete in the mid-1970s when awk was invented.
$ cat file
tsuj
esu
na
etaorporppa
loot
$ awk -v FS= '{rev=""; for (i=1; i<=NF; i++) rev = $i rev; print rev}' file
just
use
an
approproate
tool

Paste corresponding characters from multiple lines together

I'm writing a linux-command that pasts corresponding characters from multiple lines together. For example: I want to change these lines
A---
-B--
---C
--D-
to this:
A----B-----D--C-
So far, i've made this:
cat sanger.a sanger.c sanger.g sanger.t | cut -c 1
This does the trick for only the first column, but it has to work for all the columns.
Is there anyone who can help?
EDIT: This is a better example. I want this:
SUGAR
HONEY
CANDY
to become
SHC UOA GND AED RYY (without spaces)

Awk way for updated spec
awk -vFS= '{for(i=1;i<=NF;i++)a[i]=a[i]$i}
END{for(i=1;i<=NF;i++)printf "%s",a[i];print ""}' file
Output
A----B-----D--C-
SHCUOAGNNAEDRYY
P.s for a large file this will use lots of memory
A terrible way not using awk, also you need to know the number of fields before hand.
for i in {1..4};do cut -c $i test | tr -d "\n" ; done;echo

Here's a solution without awk or sed, assuming the file is named f:
paste -s -d "" <(for i in $(seq 1 $(wc -L < f)); do cut -c $i f; done)
wc -L is a GNUism which returns the length of the longest line in the input file, which might not work depending on your version/locale. You could instead find the longest line by doing something like:
awk '{if (length > x) {x = length}} END {print x}' f
Then using this value in the seq command instead of the above command substitution.

All right, time for some sed insanity! :D
Disclaimer: If this is for something serious, use something less brittle than this. awk comes to mind. Unless you feel confident enough in your sed abilities to maintain this lunacy.
cat file1 file2 etc | sed -n '1h; 1!H; $ { :loop; g; s/$/\n/; s/\([^\n]\)[^\n]*\n/\1/g; p; g; s/^.//; s/\n./\n/g; h; /[^\n]/ b loop }' | tr -d '\n'; echo
This comes in three parts: Say you have a file foo.txt
12345
67890
abcde
fghij
then
cat foo.txt | sed -n '1h; 1!H; $ { :loop; g; s/$/\n/; s/\([^\n]\)[^\n]*\n/\1/g; p; g; s/^.//; s/\n./\n/g; h; /[^\n]/ b loop }'
produces
16af
27bg
38ch
49di
50ej
After that, tr -d '\n' deletes the newlines, and ;echo adds one at the end.
The heart of this madness is the sed code, which is
1h
1!H
$ {
:loop
g
s/$/\n/
s/\([^\n]\)[^\n]*\n/\1/g
p
g
s/^.//
s/\n./\n/g
h
/[^\n]/ b loop
}
This first follows the basic pattern
1h # if this is the first line, put it in the hold buffer
1!H # if it is not the first line, append it to the hold buffer
$ { # if this is the last line,
do stuff # do stuff. The whole input is in the hold buffer here.
}
which assembles all input in the hold buffer before working on it. Once the whole input is in the hold buffer, this happens:
:loop
g # copy the hold buffer to the pattern space
s/$/\n/ # put a newline at the end
s/\([^\n]\)[^\n]*\n/\1/g # replace every line with only its first character
p # print that
g # get the hold buffer again
s/^.// # remove the first character from the first line
s/\n./\n/g # remove the first character from all other lines
h # put that back in the hold buffer
/[^\n]/ b loop # if there's something left other than newlines, loop
And there you have it. I might just have summoned Cthulhu.

Can you explain this sed one-liner?

The following one liner prints out the content of the file in reverse
$ sed -n '1!G;h;$p' test.txt
How is it possible when sed reads the file line by line? Can you explain the meaning of
n flag
1!
G
h
and $p
in this command?

This will do the same job as tac, i.e. revert the order of rows.
Rewriting the sed script to pseudocode, it means:
$line_number = 1;
foreach ($input in $input_lines) {
// current input line is in $input
if ($line_number != 1) // 1!
$input = $input + '\n' + $hold; // G
$hold = $input; // h
$line_number++
}
print $input; // $p
As you can see, the sed language is very expressive :-) the 1! and $ are so called addresses, which put conditions when the command should be run. 1! means not on the first row, $ means at the end. Sed has one auxiliary memory register which is called hold space.
For more information type info sed on linux console (this is the best documentation).
-n disables the default print $input command in the loop itself.
The terms pattern space and hold space are equivalents of the variables $input and $hold (respectively) in this example.

n flag -> Disable auto-printing.
1! -> Any line except the first one.
G -> Append a newline and content of 'hold space' to 'pattern space'
h -> Replace content of 'hold space' with content of 'pattern space'
$ -> Last line.
p -> print
So, it means: Reverse the content of your file, as I understand it.
EDIT to add some explanation (thanks to potong, see his comment for the original one):
Addresses, like 1 and $ are bound to next commands, grouped using {...} or single without them. So in this case 1! applies to G and $ to p, whereas h is not attached to an address and applies to all addresses. That is $!G and $!{G} are the same.

How can I swap two lines using sed?

Does anyone know how to replace line a with line b and line b with line a in a text file using the sed editor?
I can see how to replace a line in the pattern space with a line that is in the hold space (i.e., /^Paco/x or /^Paco/g), but what if I want to take the line starting with Paco and replace it with the line starting with Vinh, and also take the line starting with Vinh and replace it with the line starting with Paco?
Let's assume for starters that there is one line with Paco and one line with Vinh, and that the line Paco occurs before the line Vinh. Then we can move to the general case.

#!/bin/sed -f
/^Paco/ {
:notdone
N
s/^\(Paco[^\n]*\)\(\n\([^\n]*\n\)*\)\(Vinh[^\n]*\)$/\4\2\1/
t
bnotdone
}
After matching /^Paco/ we read into the pattern buffer until s// succeeds (or EOF: the pattern buffer will be printed unchanged). Then we start over searching for /^Paco/.

cat input | tr '\n' 'ç' | sed 's/\(ç__firstline__\)\(ç__secondline__\)/\2\1/g' | tr 'ç' '\n' > output
Replace __firstline__ and __secondline__ with your desired regexps. Be sure to substitute any instances of . in your regexp with [^ç]. If your text actually has ç in it, substitute with something else that your text doesn't have.

try this awk script.
s1="$1"
s2="$2"
awk -vs1="$s1" -vs2="$s2" '
{ a[++d]=$0 }
$0~s1{ h=$0;ind=d}
$0~s2{
a[ind]=$0
for(i=1;i<d;i++ ){ print a[i]}
print h
delete a;d=0;
}
END{ for(i=1;i<=d;i++ ){ print a[i] } }' file
output
$ cat file
1
2
3
4
5
$ bash test.sh 2 3
1
3
2
4
5
$ bash test.sh 1 4
4
2
3
1
5
Use sed (or not at all) for only simple substitution. Anything more complicated, use a programming language

A simple example from the GNU sed texinfo doc:
Note that on implementations other than GNU `sed' this script might
easily overflow internal buffers.
#!/usr/bin/sed -nf
# reverse all lines of input, i.e. first line became last, ...
# from the second line, the buffer (which contains all previous lines)
# is *appended* to current line, so, the order will be reversed
1! G
# on the last line we're done -- print everything
$ p
# store everything on the buffer again
h

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

sed split single line file and process resulting lines - linux

sed -rn -e 's|>[[:space:]]<|>\n<|g /^<title>/ { bx } /^<description>/ { b x } /^<link>/ { bx } D :x s|<([^>])>([^\n]*)</\1>|\1=\2|; P D' rss.xml New answer to new question. Now with branches and outputing all three chunks of information.

What about this: sed -nr 's|>[[:space:]]*<|>\n<|g h /^<(title|link|description)>/ { s:<([^>]+)>([^<]+)</\1>:\2: P } g D ' feed

Related

How can I insert a new line after each character in shell script?

Programming rev in sed

Paste corresponding characters from multiple lines together

Can you explain this sed one-liner?

How can I swap two lines using sed?

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

sed split single line file and process resulting lines - linux

sed -rn -e 's|>[[:space:]]*<|>\n<|g /^<title>/ { bx } /^<description>/ { b x } /^<link>/ { bx } D :x s|<([^>]*)>([^\n]*)</\1>|\1=\2|; P D' rss.xml New answer to new question. Now with branches and outputing all three chunks of information.

What about this: sed -nr 's|>[[:space:]]*<|>\n<|g h /^<(title|link|description)>/ { s:<([^>]+)>([^<]+)</\1>:\2: P } g D ' feed

Related

How can I insert a new line after each character in shell script?

Programming rev in sed

Paste corresponding characters from multiple lines together

Can you explain this sed one-liner?

How can I swap two lines using sed?

Categories

Resources

sed -rn -e 's|>[[:space:]]<|>\n<|g /^<title>/ { bx } /^<description>/ { b x } /^<link>/ { bx } D :x s|<([^>])>([^\n]*)</\1>|\1=\2|; P D' rss.xml New answer to new question. Now with branches and outputing all three chunks of information.