sed returning different result on different platforms

sed returning different result on different platforms - linux

Hi using following command on an x86 machine (using /bin/sh) returns: <port>3<port>
test="port 3"
echo $test | sed -r 's/\s*port\s*([0-9]+)\s*/<port>\1<\/port>/'
but running same command on sh shell of an ARM based network switch returns the string port 3.
How can I get same result on switch as I got on my x86 machine? To me it seems like digit is not being captured by [0-9].

\s is a GNU sed extension to the standard sed behavior. GNU sed is the implementation on desktop/server Linux systems. Most embedded Linux systems run BusyBox, a suite of utilities with a markedly smaller footprint and fewer features.
A standard way of specifying “any space character” is the [:space:] character class. It is supported by BusyBox (at least, by most BusyBox installations; most BusyBox features can be stripped off for an even lower footprint).
BusyBox also doesn't support the -r option, you need to use a basic regular expression. In a BRE, \(…\) marks groups, and there is no + operator, only *.
echo "$test" | sed 's/[[:space:]]*port[[:space:]]*\([0-9][0-9]*\)[[:space:]]*/<port>\1<\/port>/'
Note that since you didn't put any quotes around $test, the shell performed word splitting and wildcard expansion on the value of the variable. That is, the value of the variable was treated as a whitespace-separated list of file names which were then joined by a single space. So if you leave out the quotes, you don't have to worry about different kinds of whitespace, you can write echo $test | sed 's/ *port *([0-9][0-9]*) */<port>\1<\/port>/'. However, if $test had been port *, the result would have depended on what files exist in the current directory.

Not all seds support reg-expression short-hand like \s. A more portable version is
test="port 3"
echo "$test" | sed -r 's/[ ]*port[ ]*([0-9]+)[ ]*/<port>\1<\/port>/'
If you really need to check for tab chars as well, just add them to the char class (in all 3 places) that, in my example just contain space chars, i.e. the [ ] bit.
output
<port>3</port>
I hope this helps.

Related

How to extract string in shell script

I have file names like Tarun_Verma_25_02_2016_10_00_10.csv. How can I extract the string like 25_02_2016_10_00_10 from it in shell script?
It is not confirmed that how many numeric parts there would be after "firstName"_"lastName"
A one-line solution would be preferred.

with sed
$ echo Tarun_Verma_25_02_2016_10_00_10.csv | sed -r 's/[^0-9]*([0-9][^.]*)\..*/\1/'
25_02_2016_10_00_10
extract everything between the first digit and dot.

If you want some control over which parts you pick out (assuming the format is always like <firstname>_<lastname>_<day>_<month>_<year>_<hour>_<minute>_<second>.csv) awk would be pretty handy
echo "Tarun_Verma_25_02_2016_10_00_10.csv" | awk -F"[_.]" 'BEGIN{OFS="_"}{print $3,$4,$5,$6,$7,$8}'
Here awk splits by both underscore and period, sets the Output Field Seperator to an underscore, and then prints the parts of the file name that you are interested in.

ksh93 supports the syntax bash calls extglobs out-of-the-box. Thus, in ksh93, you can do the following:
f='Tarun_Verma_25_02_2016_10_00_10.csv'
f=${f##+([![:digit:]])} # trim everything before the first digit
f=${f%%+([![:digit:]])} # trim everything after the last digit
echo "$f"
To do the same in bash, you'll want to run the following command first
shopt -s extglob
Since this uses shell-native string manipulation, it runs much more quickly than invoking an external command (sed, awk, etc) when processing only a single line of input. (When using ksh93 rather than bash, it's quite speedy even for large inputs).

convert this linux statement into a statement which is supported by windows command prompt

This is my statement supported by unix environment
"cat document.xml | grep \'<w:t\' | sed \'s/<[^<]*>//g\' | grep -v \'^[[:space:]]*$\'"
But I want to execute that statement in windows command prompt .
How do I do that? and what are the commands which are similar to cat, grep,sed .
please tell me the exact code supported for windows similar to above command

The double quotes around the pipeline in your question are a syntax error, and the backslashed single quotes should apparently really not have backslashes, but I assume it's just an artefact of a slightly imprecise presentation.
Here's what the code does.
cat document.xml |
This is a useless use of cat but its purpose is to feed the contents of this file into the pipeline.
grep '<w:t' |
This looks for lines containing the literal string <w:t (probably the start of a tag in the XML format in the file). The single quotes quote the string so that it is not interpreted by the shell (otherwise the < would be interpreted as a redirection operator); they are consumed by the shell, and not passed through to grep.
sed 's/<[^<]*>//g' |
This replaces every pair of open/close brokets with an empty string. The regular expression [^<]* matches zero or more occurrences of a character which can be anything except <. If the XML is well-formed, these should always occur in pairs, and so we effectively remove all XML tags.
grep -v '^[[:space:]]*$'
This removes any line which is empty or consists entirely of whitespace.
Because sed is a superset of grep, the program could easily be rephrased as a single sed script. Perhaps the easiest solution for your immediate problem would be to obtain a copy of sed for your platform.
sed -e '/<w:t/!d' -e 's/<[^<]*>//g' -e '/[^[:space]]/!d' document.xml
I understand quoting rules on Windows may be different; try with double quotes instead of single, or put the script in a file and use sed -f file document.xml where file contains the script itself, like this:
/<w:t/!d
s/<[^<]*>//g
/[^[:space]]/!d
This is a rather crude way to extract the CDATA from an XML document, anyway; perhaps some XML processor would be the proper way forward. E.g. xmlstarlet appears to be available for Windows. It works even if the XML input doesn't have the beginning and ending <w:t> tags on the same line, with nothing else on it. (In fact, parsing XML with line-oriented tools is a massive antipattern.)

May try with "powershell" ?
It is included since Win8 I think,
for sure on W10 it is.
I've just tested a "cat" command and it works.
"grep" don't but may be adapt like this :
PowerShell equivalent to grep -f
and
https://communary.wordpress.com/2014/11/10/grep-the-powershell-way/

The equivalent of grep on windows would be findstr and the equivalent of cat would be type.

sed help: matching and replacing a literal "\n" (not the newline)

i have a file which contains several instances of \n.
i would like to replace them with actual newlines, but sed doesn't recognize the \n.
i tried
sed -r -e 's/\n/\n/'
sed -r -e 's/\\n/\n/'
sed -r -e 's/[\n]/\n/'
and many other ways of escaping it.
is sed able to recognize a literal \n? if so, how?
is there another program that can read the file interpreting the \n's as real newlines?

Can you please try this
sed -i 's/\\n/\n/g' input_filename

What exactly works depends on your sed implementation. This is poorly specified in POSIX so you see all kinds of behaviors.
The -r option is also not part of the POSIX standard; but your script doesn't use any of the -r features, so let's just take it out. (For what it's worth, it changes the regex dialect supported in the match expression from POSIX "basic" to "extended" regular expressions; some sed variants have an -E option which does the same thing. In brief, things like capturing parentheses and repeating braces are "extended" features.)
On BSD platforms (including MacOS), you will generally want to backslash the literal newline, like this:
sed 's/\\n/\
/g' file
On some other systems, like Linux (also depending on the precise sed version installed -- some distros use GNU sed, others favor something more traditional, still others let you choose) you might be able to use a literal \n in the replacement string to represent an actual newline character; but again, this is nonstandard and thus not portable.
If you need a properly portable solution, probably go with Awk or (gasp) Perl.
perl -pe 's/\\n/\n/g' file
In case you don't have access to the manuals, the /g flag says to replace every occurrence on a line; the default behavior of the s/// command is to only replace the first match on every line.

awk seems to handle this fine:
echo "test \n more data" | awk '{sub(/\\n/,"**")}1'
test ** more data
Here you need to escape the \ using \\

$ echo "\n" | sed -e 's/[\\][n]/hello/'

sed works one line at a time, so no \n on 1 line only (it's removed by sed at read time into buffer). You should use N, n or H,h to fill the buffer with more than one line, and then \n appears inside. Be careful, ^ and $ are no more end of line but end of string/buffer because of the \n inside.
\n is recognized in the search pattern, not in the replace pattern. Two ways for using it (sample):
sed s/\(\n\)bla/\1blabla\1/
sed s/\nbla/\
blabla\
/
The first uses a \n already inside as back reference (shorter code in replace pattern);
the second use a real newline.
So basically
sed "N
$ s/\(\n\)/\1/g
"
works (but is a bit useless). I imagine that s/\(\n\)\n/\1/g is more like what you want.

OSX sed: how to use the escape character in the second field of a `s` operation?

On OSX:
bash-3.2$ echo "abc" | sed 's/b/\x1b[31mz\x1b[m/'
ax1b[31mzx1b[mc
Whereas on Linux:
$ echo "abc" | sed 's/b/\x1b[31mz\x1b[m/'
azc
and the z correctly shows up red.
Is this a limitation of bash 3.2? My Linux test here runs bash 4.1.2.
The weird thing is on my linux environment at work the bash is version below 3.2, and it works there too.
Also, this might be related but is probably not:
bash-3.2$ echo "abc" | sed 's/b/^[[31mz^[[m/'
31mz$'m/'azc
Again, specific to BSD sed. It's pretty puzzling: Seems like something is causing the shell or sed to echo some mangled portion of the command to the terminal somehow? It is always preceding the correct output of the command, however. Where's that dollar sign coming from?
(don't be confused by colors in my commands (which come after the cyan unicode character that looks like a less bent > which is my prompt), I use syntax highlighting with zsh)

OS X's version of sed doesn't do the escape substitutions you're asking for. You can get around this by using $'...' to have bash do the substitution before handing the string to sed:
$ echo "abc" | sed 's/b/\x1b[31mz\x1b[m/'
ax1b[31mzx1b[mc
$ echo "abc" | sed $'s/b/\x1b[31mz\x1b[m/'
azc
(You'll have to trust me the "z" is red in the second one.) But note that this may require that in some cases you may have to double-escape things you want sed to do the escape substitution on.

Oh. right so the shell version does not affect this. No idea why I thought that.
The culprit is just that BSD sed doesn't do translation, so the solution is just the Ctrl+V approach of using the raw escape byte in the sed command string.

Using ? with sed

I just want to get the number of a file that may or may not be gzip'd. However, it appears that a regular expression in sed does not support a ?. Here's what I tried:
echo 'file_1.gz'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'
and nothing was returned. Then I added a ? to the string being analyzed:
echo 'file_1.gz?'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'
and got:
1
So, it looks like the ? used in most regex's is not supported in sed, right? Well then, I would just like sed to give a 1 for file_1 and file_1.gz. What's the best way to do that in a bash script if execution time is critical?

The equivalent to x? is \(x\|\).
However, many versions of sed support an option to enable "extended regular expressions" which includes ?. In GNU sed the flag is -r. Note that this also changes unescaped parens to do grouping. eg:
echo 'file_1.gz'|sed -n -r 's/.*_(.*)(\.gz)?/\1/p'
Actually, there's another bug in your regex which is that the greedy .* in the parens is going to swallow up the ".gz" if there is one. sed doesn't have a non-greedy equivalent to * as far as I know, but you can use | to work around this. | in sed (and many other regex implementations) will use the leftmost match that works, so you can do something like this:
echo 'file_1.gz'|sed -r 's/(.*_(.*)\.gz)|(.*_(.*))/\2\4/'
This tries to match with .gz, and only tries without it if that doesn't work. Only one of group 2 or 4 will actually exist (since they are on opposite sides of the same |) so we just concatenate them to get the value we want.

If you're looking for an answer to the specific example given in the question, or why it uses the ? incorrectly (regardless of syntax), see the answer by Laurence Gonsalves.
If you're looking instead for the answer to the general question of why ? doesn't exhibit its special meaning in sed as you might expect:
By default, sed uses the " POSIX basic regular expressions syntax", so the question mark must be escaped as \? to apply its special meaning, otherwise it matches a literal question mark. As an alternative, you can use the -r or --regexp-extended option to use the "extended regular expression syntax", which reverses the meaning of escaped and non-escaped special characters, including ?.
In the words of the GNU sed documentation (view by running 'info sed' on Linux):
The only difference between basic and extended regular expressions is in
the behavior of a few characters: '?', '+', parentheses, and braces
('{}'). While basic regular expressions require these to be escaped if
you want them to behave as special characters, when using extended
regular expressions you must escape them if you want them to match a
literal character.
and the option is explained:
-r
--regexp-extended
Use extended regular expressions rather than basic regular
expressions. Extended regexps are those that `egrep' accepts;
they can be clearer because they usually have less backslashes,
but are a GNU extension and hence scripts that use them are not
portable.
Update
Newer versions of GNU sed now say this:
-E
-r
--regexp-extended
Use extended regular expressions rather than basic regular
expressions. Extended regexps are those that 'egrep' accepts; they
can be clearer because they usually have fewer backslashes.
Historically this was a GNU extension, but the '-E' extension has
since been added to the POSIX standard
(http://austingroupbugs.net/view.php?id=528), so use '-E' for
portability. GNU sed has accepted '-E' as an undocumented option
for years, and *BSD seds have accepted '-E' for years as well, but
scripts that use '-E' might not port to other older systems.
So, if you need to preserve compatibility with ancient GNU sed, stick with -r. But if you prefer better cross-platform portability on more modern systems (e.g. Linux+Mac support), go with -E (but note that there are still some quirks and differences between GNU sed and BSD sed, so you'll have to make sure your scripts are portable in any case).

echo 'file_1.gz'|sed -n 's/.*_\(.*\)\?\(\.gz\)/\1/p'
Works. You have to put the return in the right spot, and you have to escape it.

A function that should return a number that follows the '_' in a filename, regardless of file extension:
realname () {
local n=${$1##*/}
local rn="${n%.*}"
sed 's/^.*\_//g' ${$rn:-$n}
}

You should use awk which is superior to sed when it comes to field grabbing/parsing:
$ awk -F'[._]' '{print $2}' <<<"file_1"
1
$ awk -F'[._]' '{print $2}' <<<"file_1.gz"
1
Alternatively you can just use Bash's parameter expansion like so:
var=file_1.gz;
temp=${var#*_};
file=${temp%.*}
echo $file
Note: works when var=file_1 as well

Part of the solution lies in escaping the question mark or using the -r option.
sed 's/.*_\([^.]*\)\(\.\?[^.]\+\)\?$/\1/'
or
sed -r 's/.*_([^.]*)(\.?[^.]+)?$/\1/'
will work for:
file_1.gz
file_12.txt
file_123
resulting in:
1
12
123

I just realized that could do something very easy:
echo 'file_1.gz'|sed -n 's/.*_\([0-9]*\).*/\1/p'
Notice the [0-9]* instead of a .*. #Laurence Gonsalves's answer made me realize the greediness of my previous post.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string