Ignore spaces, tabs and new line in SED - linux

I tried to replace a string in a file that contains tabs and line breaks.
the command in the shell file looked something like this:
FILE="/Somewhere"
STRING_OLD="line 1[ \t\r\n]*line 2"
sed -i 's/'"$STRING_OLD"'/'"$STRING_NEW"'/' $FILE
if I manually remove the line breaks and the tabs and leave only the spaces then I can replace successfully the file. but if I leave the line breaks then SED is unable to locate the $STRING_OLD and unable to replace to the new string
thanks in advance
Kobi

sed reads lines one at a time, and usually lines are also processed one at a time, as they are read. However, sed does have facilities for reading additional lines and operating on the combined result. There are several ways that could be applied to your problem, such as:
FILE="/Somewhere"
STRING_OLD="line 1[ \t\r\n]*line 2"
sed -n "1h;2,\$H;\${g;s/$STRING_OLD/$STRING_NEW/g;p}"
That that does more or less what you describe doing manually: it concatenates all the lines of the file (but keeps newlines), and then performs the substitution on the overall buffer, all at once. That does assume, however, either that the file is short (POSIX does not require it to work if the overall file length exceeds 8192 bytes) or that you are using a sed that does not have buffer-size limitations, such as GNU sed. Since you tagged Linux, I'm supposing that GNU sed can be assumed.
In detail:
the -n option turns off line echoing, because we save everything up and print the modified text in one chunk at the end.
there are multiple sed commands, separated by semicolons, and with literal $ characters escaped (for the shell):
1h: when processing the first line of input, replace the "hold space" with the contents of the pattern space (i.e. the first line, excluding newline)
2,\$H: when processing any line from the second through the last, append a newline to the hold space, then the contents of the pattern space
\${g;s/$STRING_OLD/$STRING_NEW/g;p}: when processing the last line, perform this group of commands: copy the hold space into the pattern space; perform the substitution, globally; print the resulting contents of the pattern space.
That's one of the simpler approaches, but if you need to accommodate seds that are not as capable as GNU's with regard to buffer capacity then there are other ways to go about it. Those start to get ugly, though.

Related

Simple way to remove multi-line string using sed

Using sed, is there a way to remove multiple lines from a text file based on some starting and ending expressions?
I have known markers in the file and want to remove everything between (markers inclusive). I have seen some really complicated solutions and I would like to do this without resorting to micro commands.
My file looks something like this:
cat /tmp/foobar.txt
this is line 1
this is line 3
tomcat.util.scan.StandardJarScanFilter.jarsToSkip=\
annotations-api.jar,\
ant-junit*.jar,\
ant-launcher.jar,\
ant.jar,\
asm-*.jar,\
aspectj*.jar,\
bootstrap.jar,\
catalina-ant.jar,\
catalina-ha.jar,\
catalina-ssi.jar,\
catalina-storeconfig.jar
the end leave me
and me
I want to remove everything starting at tomcat.util all the away to the last .jar
tldr;
I think this is the simplest way, ad no need for the assembly like micro commands
sed '/^tomcat\.util.*$/,/^.*[^\]$/d' /tmp/foobar.txt
which produces
this is line 1
this is line 3
the end leave me
and me
if you wanted to remove the lines in the file rather than spit out the output to stdout then use the inline flag, so
sed -i '/^tomcat\.util.*$/,/^.*[^\]$/d' /tmp/foobar.txt
So... how does this work?
sed commands, like vi commands operate on an address. Normally we don't specify an address and that simply applies the command to all lines of the file, eg when replacing the for that in a file we'd normally do
sed -i 's/the/that/g' /tmp/foobar.txt
ie applying the substitute or s command to all lines in the file.
In this case you want to delete some lines so we can use the delete or d command. But we need to tell it where to delete. So we need to give it an address.
The format of a sed command is
[addr][!]command[options]
(see the docs )
If no address is specified then the command is applied to all lines, if the ! is specified then it is applied to all lines that don't match the pattern. So far so good.
The trick here is that addr can be a single address or a range of addresses. The address can be a line number or a regex pattern. You use a , between two addresses to to specify a range.
so to delete line 5 to 8 inclusive you could do
sed -i '5,8d' /tmp/foobar.txt
in this case rather than knowing the line number we know some "markers" and we can use Regex instead, so the first marker, a line starting with tomcat.util is found by the regex
/^tomcat\.util.*$/
The second marker is a bit more tricky but if we look we can see that the final line to remove is the first one that does not end with a \, so we can match a line that consists of "anything but does not end with \"
/^.*[^\]$/
While the second marker could match a whole bunch of lines if we make a range out of these two regexes, the range means that the second "address" is the first line after the first address that matches the regex.
Putting that all together, we want to delete (d) all lines in the range from the address that is found by the regex matching a line starting with tomcat.util and ending with a line that does not end in \ ie
sed '/^tomcat\.util.*$/,/^.*[^\]$/d' /tmp/foobar.txt
hope that helps ;-)
Cheers
Karl
Awk is generally more useful than sed for anything spanning lines. Using any awk in any shell on every Unix box:
$ awk '!/\.jar/{f=0} /tomcat\.util/{f=1} !f' file
this is line 1
this is line 3
the end leave me
and me
This might work for you (GNU sed):
sed -n '/tomcat\.util/{:a;n;/\.jar/ba};p' file
Turn off implicit printing using the -n option.
Match on a line containing tomcat.util.
Continue fetching lines until such a line does not match one containing .jar.
Print all other lines.
Alternative:
sed -E '/tomcat\.util/{:a;$!N;/\.jar(,\\)?$/s/\n//;ta;D}' file
Gather up lines beginning tomcat.util and ending either .jar,\ or .jar, removing newlines until the end-of-file or a mis-match and then delete the collection.

How to echo/print actual file contents on a unix system

I would like to see the actual file contents without it being formatted to print. For example, to show:
\n0.032,170\n0.034,290
Instead of:
0.032,170
0.34,290
Is there a command to echo the file's actual data in bash? I've tried using head, cat, more, etc. but all those seem to echo the "print-formatted" text. For example:
$ cat example.csv
0.032,170
0.34,290
How can I print the actual characters within the file?
This reads as if you miss understand what the "actual characters in the file" are. You will not find the characters \ and n in that file. But only a line feed, which is a specific character. So the utilities like cat do actually output exactly the characters in the file.
Putting it the other way around: if you really had those two characters literally in the file, then a utility like cat would actually output them. I just checked that, just to be sure.
You can easily check that yourself if you open the file using a hexeditor. There you will see the character 0A (decimal 10) which is a line feed character. You will not see the pair of the two characters \ and n somewhere in that file.
Many programming languages and also shell environments use escape sequences like \n in string definitions and identify those as control characters which would not be typable otherwise. So maybe that is where your impression comes from that your files should contain those two characters.
To display newlines as \n, you might try:
awk 1 ORS='\\n' input-file
This is not the "actual characters in the file", as \n is merely a conventional method of displaying a newline, but this does seem to be what you want.

Why does a part of this variable get replaced when combining it with a string?

I have the following Bash script which loops through the lines of a file:
INFO_FILE=playlist-info-test.txt
line_count=$(wc -l $INFO_FILE | awk '{print $1}')
for ((i=1; i<=$line_count; i++))
do
current_line=$(sed "${i}q;d" $INFO_FILE)
CURRENT_PLAYLIST_ORIG="$current_line"
input_file="$CURRENT_PLAYLIST_ORIG.mp3"
echo $input_file
done
This is a sample of the playlist-info-test.txt file:
Playlist 1
Playlist2
Playlist 3
The output of the script should be as follows:
Playlist 1.mp3
Playlist2.mp3
Playlist 3.mp3
However, I am getting the following output:
.mp3list 1
.mp3list2
.mp3list 3
I have spent a few hours on this and can't understand why the ".mp3" part is being moved to the front of the string. I initially thought it was because of the space in the lines of the input file, but removing the space doesn't make a difference. I also tried using a while loop with read line and the input file redirected into it, but that does not make any difference either.
I copied the playlist-info-test.txt contents and the script, and get the output you expected. Most likely there are non-printable characters in your playlist-info-test.txt or script which are messing up the processing. Check the binary contents of both files using for example xxd -g 1 and look for non-newline (0a) non-printing characters.
Did the file come from Windows? DOS and Windows end their lines with carriage return (hex 0d, sometimes represented as \r) followed by linefeed (hex 0a, sometimes represented as \n). Unix just uses linefeed, and so tends to treat the carriage return as part of the content of the line. In your case, it winds up at the end of the current_line variable, so input_file winds up something like "Playlist 1\r.mp3". When you print this to the terminal, the carriage return makes it go back to the beginning of the line (that's what carriage return means), so it prints as:
Playlist 1
.mp3
...with the ".mp3" printed over the "Play" part, rather than on the next line like I have it above.
Solution: either fix the file (there's a fairly standard dos2unix program that does precisely this), or change your script to strip carriage returns as it reads the file. Actually, I'd recommend a rewrite anyway, since your current use of sed to pick out lines is rather weird and inefficient. In a shell script, the standard way to read through a file line-by-line is to use a loop like while read -r current_line; do [commands here]; done <"$INFO_FILE". There's a possible problem that if any commands inside the loop read from standard input, they'll wind up inhaling part of that file; you can fix that by passing the file over unit 3 rather than standard input. With that fix and a trick to trim carriage returns, here's what it looks like:
INFO_FILE=playlist-info-test.txt
while IFS=$' \t\n\r' read -r current_line <&3; do
CURRENT_PLAYLIST_ORIG="$current_line"
input_file="$CURRENT_PLAYLIST_ORIG.mp3"
echo "$input_file"
done 3<"$INFO_FILE"
(The carriage return trim is done by read -- it always auto-trims leading and trailing whitespace, and setting IFS to $' \t\n\r' tells it to treat spaces, tabs, linefeeds, and carriage returns as whitespace. And since that assignment is a prefix to the read command, it applies only to that one command and you don't have to set IFS back to normal afterward.)
A couple of other recommendations while I'm here: double-quote all variable references (as I did with echo "$input_file" above), and avoid all-caps variable names (there are a bunch with special meanings, and if you accidentally use one of those it can have weird effects). Oh, and try passing your scripts to shellcheck.net -- it's good at spotting common mistakes.

'N' and 'D' not working as expected with sed

sed 'N; D' testfile
testfile contains:
this is the first line
this is the second line
this is the third line
this is the fourth line
I am using RHEL 6 and the output comes as:
this is the fourth line
As per my understanding, N just pulls in the next line into the pattern space and D deletes just the first line of the pattern space. Therefore, the output should have been:
this is the second line
this is the fourth line
Can someone please explain why the output is coming as mentioned above?
According to the documentation:
D
If pattern space contains no newline, start a normal new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the first newline, and restart cycle with the resultant pattern space, without reading a new line of input.
(Emphasis mine.)
It sounds like this would restart your sed program from the beginning, reading and deleting lines until it runs out of input, at which point only the last line is left in the buffer.
As already shown using D will move to the beginning of program. You can however use the following to print even lines:
sed -n 'n;p'
and to print odds:
sed 'n;d'
In GNU sed you can also use:
sed '0~2!d' # Odd
sed '1~2!d' # Even
An alternative can be something like:
N;s/^[^\n]*\n//
which will read the next line into the pattern space and then substitute the first away.
One might ask why this is the behavior. One reason is to make things like this possible, working with multiply lines in the pattern space:
$!N;/\npattern$/d;P;D
The above will delete lines matching pattern as well as the line before.

How to convert two characters to one using sed

I need to change two characters (\t\n) for only one (\t).
All lines ending in Tab will join with the next line.
I used this command:
sed -i 's/\t\n/\t/g' file.txt
but it doesn't do anything.
This might work for you (GNU sed):
sed '1h;1!H;$!d;x;s/\t\n/\t/g' file
Sed is line based and uses the \n to delimit what it presents in its pattern space. The above solution gathers up the entire file into the hold space ( a spare register) and then does the global substitution returning the desired result.

Resources