'N' and 'D' not working as expected with sed - linux

sed 'N; D' testfile
testfile contains:
this is the first line
this is the second line
this is the third line
this is the fourth line
I am using RHEL 6 and the output comes as:
this is the fourth line
As per my understanding, N just pulls in the next line into the pattern space and D deletes just the first line of the pattern space. Therefore, the output should have been:
this is the second line
this is the fourth line
Can someone please explain why the output is coming as mentioned above?

According to the documentation:
D
If pattern space contains no newline, start a normal new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the first newline, and restart cycle with the resultant pattern space, without reading a new line of input.
(Emphasis mine.)
It sounds like this would restart your sed program from the beginning, reading and deleting lines until it runs out of input, at which point only the last line is left in the buffer.

As already shown using D will move to the beginning of program. You can however use the following to print even lines:
sed -n 'n;p'
and to print odds:
sed 'n;d'
In GNU sed you can also use:
sed '0~2!d' # Odd
sed '1~2!d' # Even
An alternative can be something like:
N;s/^[^\n]*\n//
which will read the next line into the pattern space and then substitute the first away.
One might ask why this is the behavior. One reason is to make things like this possible, working with multiply lines in the pattern space:
$!N;/\npattern$/d;P;D
The above will delete lines matching pattern as well as the line before.

Related

Simple way to remove multi-line string using sed

Using sed, is there a way to remove multiple lines from a text file based on some starting and ending expressions?
I have known markers in the file and want to remove everything between (markers inclusive). I have seen some really complicated solutions and I would like to do this without resorting to micro commands.
My file looks something like this:
cat /tmp/foobar.txt
this is line 1
this is line 3
tomcat.util.scan.StandardJarScanFilter.jarsToSkip=\
annotations-api.jar,\
ant-junit*.jar,\
ant-launcher.jar,\
ant.jar,\
asm-*.jar,\
aspectj*.jar,\
bootstrap.jar,\
catalina-ant.jar,\
catalina-ha.jar,\
catalina-ssi.jar,\
catalina-storeconfig.jar
the end leave me
and me
I want to remove everything starting at tomcat.util all the away to the last .jar
tldr;
I think this is the simplest way, ad no need for the assembly like micro commands
sed '/^tomcat\.util.*$/,/^.*[^\]$/d' /tmp/foobar.txt
which produces
this is line 1
this is line 3
the end leave me
and me
if you wanted to remove the lines in the file rather than spit out the output to stdout then use the inline flag, so
sed -i '/^tomcat\.util.*$/,/^.*[^\]$/d' /tmp/foobar.txt
So... how does this work?
sed commands, like vi commands operate on an address. Normally we don't specify an address and that simply applies the command to all lines of the file, eg when replacing the for that in a file we'd normally do
sed -i 's/the/that/g' /tmp/foobar.txt
ie applying the substitute or s command to all lines in the file.
In this case you want to delete some lines so we can use the delete or d command. But we need to tell it where to delete. So we need to give it an address.
The format of a sed command is
[addr][!]command[options]
(see the docs )
If no address is specified then the command is applied to all lines, if the ! is specified then it is applied to all lines that don't match the pattern. So far so good.
The trick here is that addr can be a single address or a range of addresses. The address can be a line number or a regex pattern. You use a , between two addresses to to specify a range.
so to delete line 5 to 8 inclusive you could do
sed -i '5,8d' /tmp/foobar.txt
in this case rather than knowing the line number we know some "markers" and we can use Regex instead, so the first marker, a line starting with tomcat.util is found by the regex
/^tomcat\.util.*$/
The second marker is a bit more tricky but if we look we can see that the final line to remove is the first one that does not end with a \, so we can match a line that consists of "anything but does not end with \"
/^.*[^\]$/
While the second marker could match a whole bunch of lines if we make a range out of these two regexes, the range means that the second "address" is the first line after the first address that matches the regex.
Putting that all together, we want to delete (d) all lines in the range from the address that is found by the regex matching a line starting with tomcat.util and ending with a line that does not end in \ ie
sed '/^tomcat\.util.*$/,/^.*[^\]$/d' /tmp/foobar.txt
hope that helps ;-)
Cheers
Karl
Awk is generally more useful than sed for anything spanning lines. Using any awk in any shell on every Unix box:
$ awk '!/\.jar/{f=0} /tomcat\.util/{f=1} !f' file
this is line 1
this is line 3
the end leave me
and me
This might work for you (GNU sed):
sed -n '/tomcat\.util/{:a;n;/\.jar/ba};p' file
Turn off implicit printing using the -n option.
Match on a line containing tomcat.util.
Continue fetching lines until such a line does not match one containing .jar.
Print all other lines.
Alternative:
sed -E '/tomcat\.util/{:a;$!N;/\.jar(,\\)?$/s/\n//;ta;D}' file
Gather up lines beginning tomcat.util and ending either .jar,\ or .jar, removing newlines until the end-of-file or a mis-match and then delete the collection.

Get the first of many similar strings

I have strings of the form
A-XXX
A-YYY
B-NNN
A-ZZZ
B-MMM
C-DDD
However, I want to get the first occurrence of every string before the hyphen. So the solution here would be:
A-XXX
B-NNN
C-DDD
How can I do this with bash tools? I tried uniq, but I can't set the "similarity-pattern" there.
Will this suffice?
cat uwe
A-XXX
A-YYY
B-NNN
A-ZZZ
B-MMM
C-DDD
$ awk -F'-' '!a[$1]{print $0;a[$1]++}' uwe
A-XXX
B-NNN
C-DDD
EDIT:
One can actually shorten that to the slightly more cryptic:
$ awk -F'-' '!a[$1]++' uwe
A-XXX
B-NNN
C-DDD
What we do is to tell awk - is the field separator; !a[$1] tells awk to execute the following commands (with nothing given print is the default), and post increment the array that checks whether a value was seen.
This might work for you (GNU sed):
sed -n '1!G;/^\([^-]*-\).*\n\1/!P;h' file
The general idea is to compare the current line with all previous lines and by using pattern matching, only print the current line if there is no match on a previous key.
The first line will always be printed. From the second line onwards, the previous line(s) are appended to the current line, using the G command and the first or current line only printed using the P command if there is no key match using the /^\(^-]*-\).*\n\1/! command. The current line and the appended line(s) are then stored in the hold space,using the h command, ready for the next line.
N.B. The key is defined by characters from the start of a line, upto and including the character -. Thus the regexp ^[^-]*- matches such a key. Also note that the key is collected as a group \(...\) and later referenced as \1 this allows strings of characters to be referred to at a later point in the same regexp. In this case the key at the start of the current line is matched with any such key in previous lines.

using sed script to combine part of current line with part of next line every other line

So what I want to do it combine the first part of one line with the first part of the next, separated by colon for every other line.
The input data is below and I am struggling to make it work.
This is what I want it to look like (WANT THIS):
Albania:Armenia
Angola:Antarctica
Argentina:American Samoa
This is the input:
Albania,EU,http://en.wikipedia.org/wiki/Albania
Armenia,AS,http://en.wikipedia.org/wiki/Armenia
Angola,AF,http://en.wikipedia.org/wiki/Angola
Antarctica,AN,http://en.wikipedia.org/wiki/Antarctica
Argentina,SA,http://en.wikipedia.org/wiki/Argentina
American Samoa,OC,http://en.wikipedia.org/wiki/American_Samoa
Austria,EU,http://en.wikipedia.org/wiki/Austria
Australia,OC,http://en.wikipedia.org/wiki/Australia
Aruba,NA,http://en.wikipedia.org/wiki/Aruba
Azerbaijan,AS,http://en.wikipedia.org/wiki/Azerbaijan
Bosnia and Herzegovina,EU,http://en.wikipedia.org/wiki/Bosnia_and_Herzegovina
Barbados,NA,http://en.wikipedia.org/wiki/Barbados
Bangladesh,AS,http://en.wikipedia.org/wiki/Bangladesh
Belgium,EU,http://en.wikipedia.org/wiki/Belgium
Burkina Faso,AF,http://en.wikipedia.org/wiki/Burkina_Faso
Bulgaria,EU,http://en.wikipedia.org/wiki/Bulgaria
Bahrain,AS,http://en.wikipedia.org/wiki/Bahrain
Burundi,AF,http://en.wikipedia.org/wiki/Burundi
Benin,AF,http://en.wikipedia.org/wiki/Benin
Saint Barthelemy,NA,http://en.wikipedia.org/wiki/Saint_Barthelemy
What I have come up with so far is this, using N to get the next line, and it partially works. The "junk" of the first name is gone but the "junk" of the 2nd is still there. (This is a sed script and I must have a sed script that runs with all the other sed scripts so no awk or anything)
s/,..,.+//
{N
s/\n/:/
p
}
My attempt produces this output:
Azerbaijan:Bosnia and Herzegovina,EU,http://en.wikipedia.org/wiki/Bosnia_and_Herzegovina
Barbados:Bangladesh,AS,http://en.wikipedia.org/wiki/Bangladesh
Belgium:Burkina Faso,AF,http://en.wikipedia.org/wiki/Burkina_Faso
Bulgaria:Bahrain,AS,http://en.wikipedia.org/wiki/Bahrain
Burundi:Benin,AF,http://en.wikipedia.org/wiki/Benin
s/,.*//;N;s/\n/:/;s/,.*//
Remove everything after comma, append next line, replace newline with colon, remove everything after comma.

Ignore spaces, tabs and new line in SED

I tried to replace a string in a file that contains tabs and line breaks.
the command in the shell file looked something like this:
FILE="/Somewhere"
STRING_OLD="line 1[ \t\r\n]*line 2"
sed -i 's/'"$STRING_OLD"'/'"$STRING_NEW"'/' $FILE
if I manually remove the line breaks and the tabs and leave only the spaces then I can replace successfully the file. but if I leave the line breaks then SED is unable to locate the $STRING_OLD and unable to replace to the new string
thanks in advance
Kobi
sed reads lines one at a time, and usually lines are also processed one at a time, as they are read. However, sed does have facilities for reading additional lines and operating on the combined result. There are several ways that could be applied to your problem, such as:
FILE="/Somewhere"
STRING_OLD="line 1[ \t\r\n]*line 2"
sed -n "1h;2,\$H;\${g;s/$STRING_OLD/$STRING_NEW/g;p}"
That that does more or less what you describe doing manually: it concatenates all the lines of the file (but keeps newlines), and then performs the substitution on the overall buffer, all at once. That does assume, however, either that the file is short (POSIX does not require it to work if the overall file length exceeds 8192 bytes) or that you are using a sed that does not have buffer-size limitations, such as GNU sed. Since you tagged Linux, I'm supposing that GNU sed can be assumed.
In detail:
the -n option turns off line echoing, because we save everything up and print the modified text in one chunk at the end.
there are multiple sed commands, separated by semicolons, and with literal $ characters escaped (for the shell):
1h: when processing the first line of input, replace the "hold space" with the contents of the pattern space (i.e. the first line, excluding newline)
2,\$H: when processing any line from the second through the last, append a newline to the hold space, then the contents of the pattern space
\${g;s/$STRING_OLD/$STRING_NEW/g;p}: when processing the last line, perform this group of commands: copy the hold space into the pattern space; perform the substitution, globally; print the resulting contents of the pattern space.
That's one of the simpler approaches, but if you need to accommodate seds that are not as capable as GNU's with regard to buffer capacity then there are other ways to go about it. Those start to get ugly, though.

The Concept of 'Hold space' and 'Pattern space' in sed

I'm confused by the two concepts in sed: hold space and pattern space. Can someone help explain them?
Here's a snippet of the manual:
h H Copy/append pattern space to hold space.
g G Copy/append hold space to pattern space.
n N Read/append the next line of input into the pattern space.
These six commands really confuse me.
When sed reads a file line by line, the line that has been currently read is inserted into the pattern buffer (pattern space). Pattern buffer is like the temporary buffer, the scratchpad where the current information is stored. When you tell sed to print, it prints the pattern buffer.
Hold buffer / hold space is like a long-term storage, such that you can catch something, store it and reuse it later when sed is processing another line. You do not directly process the hold space, instead, you need to copy it or append to the pattern space if you want to do something with it. For example, the print command p prints the pattern space only. Likewise, s operates on the pattern space.
Here is an example:
sed -n '1!G;h;$p'
(the -n option suppresses automatic printing of lines)
There are three commands here: 1!G, h and $p. 1!G has an address, 1 (first line), but the ! means that the command will be executed everywhere but on the first line. $p on the other hand will only be executed on the last line. So what happens is this:
first line is read and inserted automatically into the pattern space
on the first line, first command is not executed; h copies the first line into the hold space.
now the second line replaces whatever was in the pattern space
on the second line, first we execute G, appending the contents of the hold buffer to the pattern buffer, separating it by a newline. The pattern space now contains the second line, a newline, and the first line.
Then, h command inserts the concatenated contents of the pattern buffer into the hold space, which now holds the reversed lines two and one.
We proceed to line number three -- go to the point (3) above.
Finally, after the last line has been read and the hold space (containing all the previous lines in a reverse order) have been appended to the pattern space, pattern space is printed with p. As you have guessed, the above does exactly what the tac command does -- prints the file in reverse.
#Ed Morton: I disagree with you here. I found sed very useful and simple (once you grok the concept of the pattern and hold buffers) to come up with an elegant way to do multiline grepping.
For example, let's take a text file that has hostnames and some information about each host, with lots of junk in between that I dont care about.
Host: foo1
some junk, doesnt matter
some junk, doesnt matter
Info: about foo1 that I really care about!!
some junk, doesnt matter
some junk, doesnt matter
Info: a second line about foo1 that I really care about!!
some junk, doesnt matter
some junk, doesnt matter
Host: foo2
some junk, doesnt matter
Info: about foo2 that I really care about!!
some junk, doesnt matter
some junk, doesnt matter
To me, an awk script to just get the lines with the hostname and the corresponding info line would take a bit more than what I'm able to do with sed:
sed -n '/Host:/{h}; /Info/{x;p;x;p;}' myfile.txt
output looks like:
Host: foo1
Info: about foo1 that I really care about!!
Host: foo1
Info: a second line about foo1 that I really care about!!
Host: foo2
Info: about foo2 that I really care about!!
(Note that Host: foo1 appears twice in the output.)
Explanation:
-n disables output unless explicitly printed
first match, finds and puts the Host: line into hold buffer (h)
second match, finds the next Info: line, but first exchanges (x) current line in pattern buffer with hold buffer, and prints (p) the Host: line, then re-exchanges (x) and prints (p) the Info: line.
Yes, this is a simplistic example, but I suspect this is a common issue that was quickly dealt with by a simple sed one-liner. For much more complex tasks, such as ones in which you cannot rely on a given, predictable sequence, awk may be better suited.
Although #January's answer and the example are nice, the explanation was not enough for me. I had to search and learn a lot until I managed to understand how exactly sed -n '1!G;h;$p' works. So I'd like to elaborate on the command for someone like me.
First of all, let's see what the command does.
$ echo {a..d} | tr ' ' '\n' # Prints from 'a' to 'd' in each line
a
b
c
d
$ echo {a..d} | tr ' ' '\n' | sed -n '1!G;h;$p'
d
c
b
a
It reverses the input like tac command does.
sed reads line-by-line, so let's see what happens on the patten space and the hold space at each line. As h command copies the contents of the pattern space to the hold space, both spaces have the same text.
Read line Pattern Space / Hold Space Command executed
-----------------------------------------------------------
a a$ h
b b\na$ 1!G;h
c c\nb\na$ 1!G;h
d d\nc\nb\na$ 1!G;h;$p
At the last line, $p prints d\nc\nb\na$ which is formatted to
d
c
b
a
If you want to see the pattern space for each line, you can add an l command.
$ echo {a..d} | tr ' ' '\n' | sed -n '1!G;h;l;$p'
a$
b\na$
c\nb\na$
d\nc\nb\na$
d
c
b
a
I found it very helpful to watch this video tutorial Understanding how sed works, as the guy shows how each space will be used step by step. The hold spaced is referred in the 4th tutorial, but I recommend watching all the videos if you are not familiar with sed.
Also GNU sed document and Bruce Barnett's Sed tutorial are very good references.

Resources