Test for empty string with X"" [duplicate]

Test for empty string with X"" [duplicate] - string

This question already has answers here:
Why do shell script comparisons often use x$VAR = xyes?
(7 answers)
Closed 9 years ago.
I know I can test for an empty string in Bash with -z like so:
if [[ -z $myvar ]]; then do_stuff; fi
but I see a lot of code written like:
if [[ X"" = X"$myvar" ]]; then do_stuff; fi
Is that method more portable? Is it just historical cruft from before the days of -z? Is it for POSIX shells (even though I've seen it used in scripts targeting bash)? Ready for my history/portability lesson.
The same question was asked on Server Fault as How to determine if a bash variable is empty? but no one offered an explanation as to why you see code with the X"" stuff.

Fundamentally, because in times now long past, the behaviour of test was more complex and not uniformly defined across different systems (so portable code had to be written carefully to avoid non-portable constructs).
In particular, before test was a shell built-in, it was a separate executable (and note that MacOS X still has /bin/test and /bin/[ as executables). When that was the case, writing:
if [ -z $variable ]
when $variable was empty would invoke the test program via its alias [ with 3 arguments:
argv[0] = "["
argv[1] = "-z"
argv[2] = "]"
because the variable was empty so there was nothing to expand. So, the safe way of writing the code was:
if [ -z "$variable" ]
This works reliably, passing 4 arguments to the test executable. Granted, the test program has been a built-in to most shells for decades, but old equipment dies hard, and so do good practices learned even longer ago.
The other problem resolved by the X prefix was what happened if variables include leading dashes, or contain equals or other comparators. Consider (a not desparately good example):
x="-z"
if [ $x -eq 0 ]
Is that an empty string test with a stray (erroneous) argument, or a numeric equality test with a non-numeric first argument? Different systems provided different answers before POSIX standardized the behaviour, circa 1990. So, the safe way of dealing with this was:
if [ "X$x" = "X0" ]
or (less usually, in my experience, but completely equivalently):
if [ X"$x" = X"0" ]
It was all the edge cases like this, tied up with the possibility that the test was a separate executable, that means that portable shell code still uses double quotes more copiously than the modern shells actually require, and the X-prefix notation was used to ensure that things could not get misinterpreted.

Ah, this is one of my preferred questions and answers, because I came up with the answer just thinking about it. The X is set just in case the string starts with a -, that can be taken as a flag for the test. Putting an X before just removes that case, and the comparison can still hold.
I also like this because this kind of trick is almost an inheritance from the past, oldest times of computing, and you encounter it when you try to read some of the most portable shell scripts out there (autoconf, configure, etc.)

Related

Why append an extra character in `test`/`[` string comparison in POSIX sh? [duplicate]

This question already has answers here:
Why do shell script comparisons often use x$VAR = xyes?
(7 answers)
Closed 2 years ago.
After reading Debian's /usr/bin/startx, there was something I found peculiar:
mcookie=`/usr/bin/mcookie`
if test x"$mcookie" = x; then
echo "Couldn't create cookie"
exit 1
fi
I don't understand why the extra x is necessary - wouldn't it be equivalent to write the following?
mcookie=`/usr/bin/mcookie`
if test "$mcookie" = ""; then
echo "Couldn't create cookie"
exit 1
fi
I initially thought that very early versions of sh might have printed some error if the mkcookie variable happened to be unset (this wasn't the only instance; these types of comparisons were scattered liberally throughout the script). But then it didn't make much sense after further thought because the variable was being quoted and the shell would expand it to an empty string.
I perused the Dash, --posix section of the Bash man page and checked POSIX itself. It said nothing about it under test: s1 = s2 True if the strings s1 and s2 are identical; otherwise, false. I assume the people that wrote the script knew what they were doing - so can someone shed some light on this?
Thankies,
Edwin

Here's an excerpt from the Wooledge Bash Pitfall #4:
You may have seen code like this:
[ x"$foo" = xbar ] # Ok, but usually unnecessary.
The x"$foo" hack is required for code that must run on very ancient shells which lack [[, and have a more primitive [, which gets confused if $foo begins with a -. On said older systems, [ still doesn't care whether the token on the right hand side of the = begins with a -. It just uses it literally. It's just the left-hand side that needs extra caution.
Note that shells that require this workaround are not POSIX-conforming. Even the Heirloom Bourne shell doesn't require this (probably the non-POSIX Bourne shell clone that's still most widely in use as a system shell). Such extreme portability is rarely a requirement and makes your code less readable (and uglier).
If you find it in code from this century, it's just cargo culting that keeps being reinforced by people who don't quote.
/usr/bin/startx in particular has used this idiom since at least XFree86 2.1 in Feb 1993, and any updates have likely just matched style since.

What is the proper way of filtering type errors in Bash scripts?

Some years ago, while we were learning Bash script, our teacher taught us that the first thing we should do was filtering the "type" of the arguments given and throw an error in case they were different from expected.
(e. g., if our script takes 2 arguments to sum them, we should first check if they were indeed numbers)
The method he told us to use was sed command.
In the example above, we would take all the characters from the arguments that are different from numbers, and if there's any, we would consider it as a non numerical value, throwing an error.
Since this method looks pretty rudimentary to me, I would like to ask:
Is there any elegant way of achieving this?
Is this process actually done when building scripts?

sed is tedious overkill when the shell has this functionality built in.
case $1 in *[!0-9]*) echo "$0: not a number: $1" >&2; exit 127;; esac
(If you want to allow for negative numbers or non-integers, obviously adapt the glob pattern.)
The shell itself doesn't have any types at all -- everything is a string.

In bash, how do I execute the contents of a variable, VERBATIM, as though they were a command line?

I need to construct a complex command that includes quoted arguments. As it happens, they are arguments to grep, so I'll use that as my example and deeply simplify the command to just enough to demonstrate the error.
Let's start with a working example:
> COMMAND='/usr/bin/grep _'
> echo $COMMAND
/usr/bin/grep _
> $COMMAND
foo <- I type this, and grep filters it out.
foo_ <- I type this, and.....
foo_ <- ... it matches, so grep emits it.
"foo" is not echoed back because it lacks an underscore, "foo_" has one, so it's returned. Let's get to a demonstration of the problem:
> COMMAND='/usr/bin/grep "_ _"'
> echo -E $COMMAND
/usr/bin/grep "_ _"
> /usr/bin/grep "_ _" <- The exact same command line
foo <- fails to match
foo_ _ <- matches, so it gets echoed back
foo_ _
> $COMMAND <- But that command doesn't work from a variable
grep: _": No such file or directory
In other words, when this command is invoked through a variable name, bash is taking the space between underscores as an argument delimiter - despite the quotes.
Normally, I'd fix this with backslashes:
> COMMAND='/usr/bin/grep "_\ _"'
> $COMMAND
grep: trailing backslash (\)
Okay, maybe I need another layer of escaping the backslash:
> COMMAND='/usr/bin/grep "_\\ _"'
12:32 (master) /Users/ronbarry> $COMMAND
grep: _": No such file or directory
And now we're back to square one - the command line is still being broken up at the space. I can, of course, verify all of this with some debugging, which establishes that the backslashes are surviving, unescaped, and grep is being called with multiple arguments:
> set -x
> $COMMAND
+ /usr/bin/grep '"_\\' '_"' <- grep is being called with two args
I have a solution to the problem that takes advantage of arrays, but packing commands this way (in my full implementation, which I'll spare you) is unfamiliar to most people who'd read my code. To oversimplify the creation of an array-based command:
> declare -a COMMAND=('/usr/bin/grep' '-i' 'a b')
12:44 (master) /Users/ronbarry> ${COMMAND[*]}
foo <- Same old, same old
fooa B <- ...
fooa B <- Matches because of case-insensitive (-i) grep.
Finally we get to the question. Why does bash break up quoted arguments in strings when interpreting them as commands and why doesn't there seem to be a string-y way to get it to work? If I have a command packed in a string variable, it violates the Principle of Least Surprise to have that string interpreted differently than the string itself would be. If someone can point me at some docs that cover all of this, and will set me at peace with why I have to resort to the infinitely uglier mechanism of building up arrays with all of my commands, I'd very much appreciate it.

Disclaimer: After writing the following, I almost decided that the question should be closed for encouraging opinion-based responses. This is an opinion-based response. Proceed at your own risk.
Why does bash break up quoted arguments in strings when interpreting them as commands
Because that's what it does. A more interesting question might be "Why does bash break up strings at all?", to which the only possible answer would be "it seemed like a good idea at the time".
Or, to put it another way: In the beginning, nobody thought of putting spaces into filenames. When you only had a few letters for a filename, you didn't waste any of them on spaces. So it seemed reasonable to represent a list of words as just a space-separated list of words, and that was the basis on which shell languages were developed. So the default behaviour of bash, like that of all unix-y shells, is to consider a string with whitespace in it to be a whitespace-separated list of words.
But, of course, that leads to all sorts of headaches, because strings are not structured data. Sometimes a filename does have whitespace in its name. And not all utility arguments are filenames, either. Sometimes you want to give an argument to a utility which is, for example, a sentence. Without that complication, shells were able to avoid making you type quotes, unlike "real" programming languages where strings need to be quoted. But once you decide that sometimes a space in a string is just another character, you need to have some kind of quoting system. So then the syntax of shells added several quoting forms, each with slightly different semantics. The most common is double-quoting, which marks the contents as a single word but still allows variable expansion.
It remains the case that shell quotes, like quotes in any other language, are simply syntactic constructs. They are not part of the string, and the fact that a particular character in a string was marked with a quote (or, equivalently, a backslash) is not retained as part of the string -- again, just like any other programming language. Strings are not really lists of words; they are just treated that way by default.
All of that is not very satisfactory. The nature of shell programming is that you really want a data structure which is a list of "words" -- or, better, a list of strings. And, eventually, shells got around to doing that. Unfortunately, by then there wasn't much syntactic space left in shell languages; it was considered important that the new features not change the behaviour of existing shell scripts. As far as I know, the current shell syntax for arrays was created by David Korn in 1988 (or earlier); eventually, bash also implemented arrays with basically the same syntax.
One of the curiosities in the syntax is that there are three ways of specifying that an entire array should be substituted:
${array[*]} or ${array[#]}: concatenate all the array elements together separated with the first character in $IFS, and then consider the result to be a whitespace-separated list of words.
"${array[*]}": concatenate all the array elements together separated with the first character in $IFS, and then consider the result to be a single word.
"${array[#]}": each array element is inserted as a separate word.
Of these, the first one is essentially useless; the second one is occasionally useful, and the third -- and most difficult to type -- is the one you almost always want.
In the above brief discussion, I left out any consideration of glob characters and filename expansion, and a number of other shell idiosyncrasies. So don't take it as a complete tutorial, by any means.
why doesn't there seem to be a string-y way to get it to work?
You can always use eval. Unfortunately. If you really really want to get bash to interpret a string as though it were a bash program rather than a string, and if you are prepared to open your script up to all manner of injection attacks, then the shell will happily give you enough rope. Personally, I would never allow a script which used eval to pass code review so I'm not going to expand on its use here. But it's documented.
If I have a command packed in a string variable, it violates the Principle of Least Surprise to have that string interpreted differently than the string itself would be.
Surprise is really in the eye of the beholder. There are probably lots of programmers who think that a newline character really occupies two bytes, and are Surprised when it turns out that in C, '\n'[0] is not a backslash. But I think most of us would be Surprised if it were. (I've tried to answer SO questions based on this misunderstanding, and it is not easy.)
Bash strings, regardless of anything else, are strings. They are not bash programs. Having them suddenly interpreted as bash programs would, in my opinion, not only be surprising but dangerous. At least if you use eval, there is a big red flag for the code reviewer.

Shellshock vulnerability, CVE-2014-7169 [duplicate]

With a bash release which has been patched for shellshock
$ bash --version
GNU bash, version 3.2.52(1)-release (x86_64-apple-darwin12)
Copyright (C) 2007 Free Software Foundation, Inc.
$ env x='() { :;}; echo vulnerable' bash -c "echo this is a test"
bash: warning: x: ignoring function definition attempt
bash: error importing function definition for `x'
this is a test
another similar exploit still works and has been assigned CVE-2014-7169
$ env X='() { (a)=>\' bash -c "echo date"; cat echo
bash: X: line 1: syntax error near unexpected token `='
bash: X: line 1: `'
bash: error importing function definition for `X'
Thu Sep 25 12:47:22 EDT 2014
$ ls echo
echo
Looking for a breakdown of this as well.

The bug
CVE-2014-7169 is a bug in bash's parser. Bash's parser uses a variable eol_ungetc_lookahead to ungetc characters across lines. That variable wasn't being properly reset from the reset_parser function, which is called e.g. on some syntax errors. Using that bug, it's possible to inject a character into the start of the next bash input line.
So the test code forces a syntax error, using either (a)= or function a a, adds the redirection character to prepend to the next line >, and adds a line continuation \, which leads to either version of the test code:
() { (a)=>\
() { function a a>\
When bash is executed, it processes variables from the environment, finds that variable X is a exported function, and evaluates it to import the function. But the evaluation fails with a parse error, leaving the > character in the eol_ungetc_lookahead variable. Then, when parsing the command argument echo date, it prepends the > character, leading to >echo date, which runs date redirected to a file named echo.
Its relation to the previous bug
The above bug is obviously very different to the original shellshock bug. There are actually several problems:
Bash evaluates completely a variable that looks like an exported function (starts with the four characters () {). CVE-2014-6271.
Under some conditions, it is possible to inject a character into an ungetc variable, that will be prepended to the next input line. CVE-2014-7169.
Bash allows every environment variable to be treated like an exported function, so long as it starts with the four characters () {. CVE-2014-6271, CVE-2014-7169, all the other CVEs where a bug is triggered in bash's parser.
There is a limited stack for here-doc redirection, and there is no check for overflow. CVE-2014-7186, which leads to memory corruption, and can probably be leveraged for arbitrary code execution.
There is a limited stack for nested control structures (select/for/while), with checks for overflow. That stack is still corrupted. CVE-2014-7187.
The fixes
The first patch restricts bash to evaluating a single function definition in each variable that looks like a exported function.
The second patch properly resets eol_ungetc_lookahead on reset_parser.
The third patch changes how functions are exported: now they are exported in variables named BASH_FUNC_functionname%%.
Attack surface
The big problem here has been that every environment variable could be used as a vector for attack. Typically, attackers cannot control arbitrary environment variables, otherwise there are already other known attacks (think of LD_PRELOAD, PATH, IFS, ...).
sudo is not affected because it strips exported bash functions from the environment, as mentioned by Gilles on security.SE.
ssh is affected. Typical sshd installations only allow a limited set of environment variables to be exported as configured in AcceptEnv in sshd_config, e.g: LANG and LC_*. Even with this aggressive whitelisting approach, in shellshock any variable could be an attack vector.
Not only was every environment variable a potential attack vector, they exposed a >6000 lines parser.
Lessons relearned
system, popen, and others are potentially dangerous. Not only should you take care with their arguments: even when the arguments are fixed at compile-time, the environment is a potential attack vector. Use fork()/execve(), preferably with a clean environment (but at least restrict the environment to white-listed variables, preferably with their values sanity-checked). Remember that a good quality system does what it is supposed to do, while a secure system does what it is supposed to do and nothing more. Invoking a full-blown shell makes doing nothing more a little bit harder.
Complexity is the enemy of security. These days you can easily find people recommending simpler shells. Most shells are free from shellshock by not supporting exported functions at all. Conversely, bash has received lots of security features over the years (you need to invoke it with -p to avoid it dropping privileges on startup, it sanitizes IFS, ...), so don't assume I'm advocating switching shells, this is more of a general advice.
Some excerpts from David Wheeler's ancient "Secure Programming for Linux and UNIX HOWTO" chapter on environment variables are still worth rereading.
§5.2.3 ¶1:
For secure setuid/setgid programs, the short list of environment
variables needed as input (if any) should be carefully extracted. Then
the entire environment should be erased, followed by resetting a small
set of necessary environment variables to safe values. There really
isn't a better way if you make any calls to subordinate programs;
there's no practical method of listing ``all the dangerous values''.
§5.2.3 ¶6:
If you really need user-supplied values, check the values first (to
ensure that the values match a pattern for legal values and that they
are within some reasonable maximum length).

Waving my hands a lot, I suspect the new exploit does the following:
The backslash helps bypass the original patch, so that the string is still evaluated.
The > combines with echo as an output redirection for the bash shell
With echo being consumed by the evaluation to define the function, the only part of the -c argument left to execute is date, whose output goes to a file name echo instead of standard output.
That's the best I can come up with short of reading the bash source, but I suspect the backslash facilitates some sort of buffer overflow that allows the environment string and the argument to -c to be merged.

Bloated echo command

Look at the following implementations of the "echo" command:
http://bxr.su/o/bin/echo/echo.c (OpenBSD)
http://bxr.su/d/bin/echo/echo.c (DragonFly)
http://bxr.su/n/bin/echo/echo.c (NetBSD)
http://bxr.su/f/bin/echo/echo.c (FreeBSD)
http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/echo.c (GNU)
As you go down the list, I'm sure you'll notice the increasing bloat in each implementation.
What is the point of a 272 line echo program?

In their article 'Program Design in the UNIX Environment', Pike & Kernighan discuss how the cat program accreted control arguments. Somewhere, though not that article, there was a comment about 'cat came back from Berkeley waving flags'. This is a similar issue to the problem with echo developing options. (I found a reference to the relevant article in the BSD (Mac OS X) man page for cat: Rob Pike, "UNIX Style, or cat -v Considered Harmful", USENIX Summer Conference Proceedings, 1983. See also http://quotes.cat-v.org/programming/)
In their book 'The UNIX Programming Environment', Kernighan & Pike (yes, those two again) quote Doug McIlroy on the subject of what 'echo' should do with no arguments (circa 1984):
Another question of philosophy is what echo should do if given no arguments - specifically, should it print a blank line or nothing at all. All the current echo implementations we know print a blank line, but past versions didn't, and there were great debates on the subject. Doug McIlroy imparted the right feelings of mysticism in his discussion on the topic:
The UNIX and the Echo
There dwelt in the land of New Jersey the UNIX, a fair maid whom savants travelled far to admire. Dazzled by her purity, all sought to espouse her, one for her virginal grace, another for her polished civility, yet another for her agility in performing exacting tasks seldom accomplished even in much richer lands. So large of heart and accommodating of nature was she that the UNIX adopted all but the most insufferably rich of her suitors. Soon, many offspring grew and prospered and spread to the ends of the earth.
Nature herself smiled and answered to the UNIX more eagerly than to other mortal beings. Humbler folk, who knew little of more courtly manners, delighted in her echo, so precise and crystal clear they scarce believed she could be answered by the same rocks and woods that so garbled their own shouts into the wilderness. And the compliant UNIX obliged with perfect echoes of whatever she was asked.
When one impatient swain asked the UNIX, 'Echo nothing', the UNIX obligingly opened her mouth, echoed nothing, and closed it again.
'Whatever do you mean,' the youth demanded, 'opening your mouth like that? Henceforth never open your mouth when you are supposed to echo nothing!' And the UNIX obliged.
'But I want a perfect performance, even when you echo nothing,' pleaded a sensitive youth, 'and no perfect echoes can come from a closed mouth.' Not wishing to offend either one, the UNIX agreed to say different nothings for the impatient youth and the insensitive youth. She called the sensitive nothing '\n'.
Yet now when she said '\n', she was not really saying nothing so she had to open her mouth twice, once to say '\n' and once to say nothing, and so she did not please the sensitive youth, who said forthwith, 'The \n sounds like a perfect nothing to me, but the second one ruins it. I want you to take back one of them.' So the UNIX, who could not abide offending, agreed to undo some echoes, and called that '\c'. Now the sensitive youth could hear a perfect echo of nothing by asking for '\n' and '\c' together. But they say that he died of a surfeit of notation before he ever heard one.
The Korn shell introduced (or, at least, included) a printf command that was based on the C language printf() function, and that uses a format string to control how the material should appear. It is a better tool for complicated formatting than echo. But, because of the history outlined in the quote, echo doesn't just echo any more; it interprets what it is given to echo.
And interpreting the command line arguments to echo indubitably requires more code than not interpreting them. A basic echo command is:
#include <stdio.h>
int main(int argc, char **argv)
{
const char *pad = "";
while (*++argv)
{
fputs(pad, stdout);
fputs(*argv, stdout);
pad = " ";
}
fputc('\n', stdout);
return 0;
}
There are other ways to achieve that. But the more complex versions of echo have to scrutinize their arguments before printing anything - and that takes more code. And different systems have decided that they want to do different amounts of interpretation of their arguments, leading to different amounts of code.

You will notice there is not really that much bloat growth.
Most of the lines of code are comments.
Most of the lines of code that are not comments, are usage documentation, so when somebody goes 'echo --help' it will do something.
Code outside of the above appears largely to be handling the arguments echo can take, as well as the "special" expansion for symbols such as \n and \t to be their equivalent characters instead of echoing them literally.
Also, most of the time, you're not even running the echo command, most of the time 'echo' invokes a shell built-in. At least on my machine, you have to type /bin/echo --help to get all the advanced functionality/documentation out of it, because echo --help merely echo's --help
For a good example, run this in your shell.
echo '\e[31mhello\e[0m'
Then run this:
echo -e '\e[31mhello\e[0m'
And you will note vastly different results.
The former will just emit the input as-is, but the latter will print hello coloured red.
Another example, using the 'hex' code:
$echo '\x64\x65\x66'
\x64\x65\x66
$echo -e '\x64\x65\x66'
def
Vastly different behaviour. The openbsd implementation cannot to this =).

I'm not sure if I like the first implementation: it has one option too many!
If a -n option was required, then also adding a -- option to stop option processing would be helpful. That way if you are writing a shell script that reads and prints user input, you don't get inconsistent behavior if the user types -n.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string