Avoid using an array for wildcard expansion in bash

Avoid using an array for wildcard expansion in bash - linux

I wrote the following code:
join(){
IFS="$1"
shift
echo "$*"
}
FILES=(/tmp/*)
SEPARATED_FILES=$(join , ${FILES[*]})
echo $VAR
And it prints the comma separated lists of files in /tmp just fine. But I would like to refactor it and eliminate the tmp global variable FILES which is an array. I tried the following:
SEPARATED_FILES=$(join , ${(/tmp/*)[*]})
echo $VAR
But it prints the following error:
line 8: ${(/tmp/*)[*]}: bad substitution

Yes! You can avoid it by doing pass the glob as directly an argument to the function. Note that, the glob results are expanded by the shell before passing to the function. So pass the first argument as the IFS you want to set and the second as the glob expression you want to use.
join , /tmp/*
The glob is expanded to file names before the function is being called.
join , /tmp/file1 /tmp/file2 /tmp/file3
A noteworthy addition to the above would be to use nullglob option before calling the function. Because when the glob does not produce any results, the un-expanded string can be safely ignored.
shopt -s nullglob
join , /tmp/*
and in a command substitution syntax as
fileList=$(shopt -s nullglob; join , /tmp/*)
Couple of takeaways from your good effort.
Always apply shell quoting to variables/arrays unless you have a reason not to do so. Doing so preserves the literal value of the contents inside and prevents Word-Splitting from happening
Always use lower case names for user-defined variable/function and array names

Related

Bash: using parameter expansion to add variables at front and end simultaneously [duplicate]

How to add suffix and prefix to $#?
If I do $PREFIX/$#/$SUFFIX, I get the prefix and the suffix only in the first parameter.

I would use shell [ parameter expansion ] for this
$ set -- one two three
$ echo "$#"
one two three
$ set -- "${#/#/pre}" && set -- "${#/%/post}"
$ echo "$#"
preonepost pretwopost prethreepost
Notes
The # matches the beginning
The % matches the end
Using double quotes around ${#} considers each element as a separate word. so replacement happens for every positional parameter

Let's create a parameters for test purposes:
$ set -- one two three
$ echo "$#"
one two three
Now, let's use bash to add prefixes and suffixes:
$ IFS=$'\n' a=($(printf "pre/%s/post\n" "$#"))
$ set -- "${a[#]}"
$ echo -- "$#"
pre/one/post pre/two/post pre/three/post
Limitations: (a) since this uses newline-separated strings, it won't work if your $# contains newlines itself. In that case, there may be another choice for IFS that would suffice. (b) This is subject to globbing. If either of these is an issue, see the more general solution below.
On the other hand, if the positional parameters do not contain whitespace, then no change to IFS is needed.
Also, if IFS is changed, then one may want to save IFS beforehand and restore afterward.
More general solution
If we don't want to make any assumptions about whitespace, we can modify "$#" with a loop:
$ a=(); for p in "$#"; do a+=("pre/$p/post"); done
$ set -- "${a[#]}"
$ echo "$#"
pre/one/post pre/two/post pre/three/post

Note: This is essentially a slightly more detailed version of sjam's answer.
John1024's answer is helpful, but:
requires a subshell (which involves a child process)
can result in unwanted globbing applied to the array elements.
Fortunately, Bash parameter expansion can be applied to arrays too, which avoids these issues:
set -- 'one' 'two' # sample input array, which will be reflected in $#
# Copy $# to new array ${a[#]}, adding a prefix to each element.
# `/#` replaces the string that follows, up to the next `/`,
# at the *start* of each element.
# In the absence of a string, the replacement string following
# the second `/` is unconditionally placed *before* each element.
a=( "${#/#/PREFIX}" )
# Add a suffix to each element of the resulting array ${a[#]}.
# `/%` replaces the string that follows, up to the next `/`,
# at the *end* of each element.
# In the absence of a string, the replacement string following
# the second `/` is unconditionally placed *after* each element.
a=( "${a[#]/%/SUFFIX}" )
# Print the resulting array.
declare -p a
This yields:
declare -a a='([0]="PREFIXoneSUFFIX" [1]="PREFIXtwoSUFFIX")'
Note that double-quoting the array references is crucial to protect their elements from potential word-splitting and globbing (filename expansion) - both of which are instances of shell expansions.

IFS and command substitution

I am writing a shell script to read input csv files and run a java program accordingly.
#!/usr/bin/ksh
CSV_FILE=${1}
myScript="/usr/bin/java -version"
while read row
do
$myScript
IFS=$"|"
for column in $row
do
$myScript
done
done < $CSV_FILE
csv file:
a|b|c
Interestingly, $myScript outside the for loop works but the $myScript inside the for loop says "/usr/bin/java -version: not found [No such file or directory]". I have come to know that it is because I am setting IFS. If I comment IFS, and change the csv file to
a b c
It works ! I imagine the shell using the default IFS to separate the command /usr/bin/java and then apply the -version argument later. Since I changed the IFS, it is taking the entire string as a single command - or that is what I think is happening.
But this is my requirement: I have a csv file with a custom delimiter, and the command has arguments in it, separated by space. How can I do this correctly?

IFS indicates how to split the values of variables in unquoted substitutions. It applies to both $row and $myscript.
If you want to use IFS to do the splitting, which is convenient in plain sh, then you need to change the value of IFS or arrange to need the same value. In this particular case, you can easily arrange to need the same value, by defining myScript as myScript="/usr/bin/java|-version". Alternatively, you can change the value of IFS just in time. In both cases, note that an unquoted substitution doesn't just split the value using IFS, it also interprets each part as a wildcard pattern and replaces it by the list of matching file names if there are any. This means that if your CSV file contains a line like
foo|*|bar
then the row won't be foo, *, bar but foo, each file name in the current directory, bar. To process the data like this, you need to turn off with set -f. Also remember that read reads continuation lines when a line ends with a backslash, and strips leading and trailing IFS characters. Use IFS= read -r to turn off these two behaviors.
myScript="/usr/bin/java -version"
set -f
while IFS= read -r row
do
$myScript
IFS='|'
for column in $row
do
IFS=' '
$myScript
done
done
However there are better ways that avoid IFS-splitting altogether. Don't store a command in a space-separated string: it fails in complex cases, like commands that need an argument that contains a space. There are three robust ways to store a command:
Store the command in a function. This is the most natural approach. Running a command is code; you define code in a function. You can refer to the function's arguments collectively as "$#".
myScript () {
/usr/bin/java -version "$#"
}
…
myScript extra_argument_1 extra_argument_2
Store an executable command name and its arguments in an array.
myScript=(/usr/bin/java -version)
…
"${myScript[#]}" extra_argument_1 extra_argument_2
Store a shell command, i.e. something that is meant to be parsed by the shell. To evaluate the shell code in a string, use eval. Be sure to quote the argument, like any other variable expansion, to avoid premature wildcard expansion. This approach is more complex since it requires careful quoting. It's only really useful when you have to store the command in a string, for example because it comes in as a parameter to your script. Note that you can't sensibly pass extra arguments this way.
myScript='/usr/bin/java -version'
…
eval "$myScript"
Also, since you're using ksh and not plain sh, you don't need to use IFS to split the input line. Use read -A instead to directly split into an array.
#!/usr/bin/ksh
CSV_FILE=${1}
myScript=(/usr/bin/java -version)
while IFS='|' read -r -A columns
do
"${myScript[#]}"
for column in "${columns[#]}"
do
"${myScript[#]}"
done
done <"$CSV_FILE"

The simplest soultion is to avoid changing IFS and do the splitting with read -d <delimiter> like this:
#!/usr/bin/ksh
CSV_FILE=${1}
myScript="/usr/bin/java -version"
while read -A -d '|' columns
do
$myScript
for column in "${columns[#]}"
do
echo next is "$column"
$myScript
done
done < $CSV_FILE

IFS tells the shell which characters separate "words", that is, the different components of a command. So when you remove the space character from IFS and run foo bar, the script sees a single argument "foo bar" rather than "foo" and "bar".

the IFS should be placed behind of "while"
#!/usr/bin/ksh
CSV_FILE=${1}
myScript="/usr/bin/java -version"
while IFS="|" read row
do
$myScript
for column in $row
do
$myScript
done
done < $CSV_FILE

Newlines not quoted properly in ls -Q

Using ls -Q with --quoting-style=shell, newlines in file names (yes, I know...) are turned into ?. Is this a bug? Is there a way how to get the file names in a format that's 100% compatible with a shell (sh or bash if possible)?
Example (bash):
$ touch a$'\n'b
$ for s in literal shell shell-always c c-maybe escape locale clocale ; do
ls -Q a?b --quoting-style=$s
done
a?b
'a?b'
'a?b'
"a\nb"
"a\nb"
a\nb
‘a\nb’
‘a\nb’

coreutils 8.25 has the new 'shell-escape' quoting style, and in fact enables it by default to allow the output from ls to be always usable, and to be safe to copy and paste back to other commands.

Maybe not quite what you are looking for, but the "escape" style seems to work well with the upcoming ${...#E} parameter expansion in bash 4.4.
$ touch $'a\nb' $'c\nd'
$ ls -Q --quoting-style=escape ??? | while IFS= read -r fname; do echo =="${fname#E}==="; done
==a
b==
==c
d==
Here is the relevant part of the man page (link is to the raw source):
${parameter#operator}
Parameter transformation. The expansion is either a transforma-
tion of the value of parameter or information about parameter
itself, depending on the value of operator. Each operator is a
single letter:
Q The expansion is a string that is the value of parameter
quoted in a format that can be reused as input.
E The expansion is a string that is the value of parameter
with backslash escape sequences expanded as with the
$'...' quoting mechansim.
P The expansion is a string that is the result of expanding
the value of parameter as if it were a prompt string (see
PROMPTING below).
A The expansion is a string in the form of an assignment
statement or declare command that, if evaluated, will
recreate parameter with its attributes and value.
a The expansion is a string consisting of flag values rep-
resenting parameter's attributes.
If parameter is # or *, the operation is applied to each posi-
tional parameter in turn, and the expansion is the resultant
list. If parameter is an array variable subscripted with # or
*, the case modification operation is applied to each member of
the array in turn, and the expansion is the resultant list.
The result of the expansion is subject to word splitting and
pathname expansion as described below.

From a bit of experimentation, it looks like --quoting-style=escape is compatible with being wrapped in $'...', with two exceptions:
it escapes spaces by prepending a backslash; but $'...' doesn't discard backslashes before spaces.
it doesn't escape single-quotes.
So you could perhaps write something like this (in Bash):
function ls-quote-shell () {
ls -Q --quoting-style=escape "$#" \
| while IFS= read -r filename ; do
filename="${filename//'\ '/ }" # unescape spaces
filename="${filename//"'"/\'}" # escape single-quotes
printf "$'%s'\n" "$filename"
done
}
To test this, I've created a directory with a bunch of filenames with weird characters; and
eval ls -l $(ls-quote-shell)
worked as intended . . . though I won't make any firm guarantees about it.
Alternatively, here's a version that uses printf to process the escapes followed by printf %q to re-escape in a shell-friendly manner:
function ls-quote-shell () {
ls -Q --quoting-style=escape "$#" \
| while IFS= read -r escaped_filename ; do
escaped_filename="${escaped_filename//'\ '/ }" # unescape spaces
escaped_filename="${escaped_filename//'%'/%%}" # escape percent signs
# note: need to save in variable, rather than using command
# substitution, because command substitution strips trailing newlines:
printf -v filename "$escaped_filename"
printf '%q\n' "$filename"
done
}
but if it turns out that there's some case that the first version doesn't handle correctly, then the second version will most likely have the same issue. (FWIW, eval ls -l $(ls-quote-shell) worked as intended with both versions.)

Using a glob expression passed as a bash script argument

TL;DR:
Why isn't invoking ./myscript foo* when myscript has var=$1 the same as invoking ./myscript with var=foo* hardcoded?
Longer form
I've come across a weird issue in a bash script I'm writing. I am sure there is a simple explanation, but I can't figure it out.
I am trying to pass a command line argument to be assigned as a variable in the script.
I want the script to allow 2 command line arguments as follows:
$ bash my_bash_script.bash args1 args2
In my script, I assigned variables like this:
ARGS1=$1
ARGS2=$2
Args 1 is a string descriptor to add to the output file.
Args 2 is a group of directories: "dir1, dir2, dir3", which I am passing as dir*
When I assign dir* to ARGS2 in the script it works fine, but when I pass dir* as the second command line argument, it only includes dir1 in the wildcard expansion of dir*.
I assume this has something to do with how the shell handles wildcards (even when passed as args), but I don't really understand it.
Any help would be appreciated.
Environment / Usage
I have a group of directories:
dir_1_y_map, dir_1_x_map, dir_2_y_map, dir_2_x_map,
... dir_10_y_map, dir_10_x_map...
Inside these directories I am trying to access a file with extension ".status" via *.status, and ".report.txt" via *report.txt.
I want to pass dir_*_map as the second argument to the script and store it in the variable ARGS2, then use it to search within each of the directories for the ".status" and ".report" files.
The issue is that passing dir_*_map from the command line doesn't give the list of directories, but rather just the first item in the list. If I assign the variable ARGS2=dir_*_map within the script, it works as I intend.
Workaround: Quoting
It turns out that passing the second argument in quotes allowed the wildcard expansion to work appropriately for "dir_*_map"
#!/usr/bin/env bash
ARGS1=$1
ARGS2=$2
touch $ARGS1".extension"
for i in /$ARGS2/*.status
do
grep -e "string" $i >> $ARGS1".extension"
done
Here is an example invocation of the script:
sh ~/path/to/script descriptor "dir_*_map"
I don't fully understand when/why some arguments must be passed in quotes, but I assume it has to do with the wildcard expansion in the for loop.

Addressing the "why"
Assignments, as in var=foo*, don't expand globs -- that is, when you run var=foo*, the literal string foo* is put into the variable foo, not the list of files matching foo*.
By contrast, unquoted use of foo* on a command line expands the glob, replacing it with a list of individual names, each of which is passed as a separate argument.
Thus, running ./yourscript foo* doesn't pass foo* as $1 unless no files matching that glob expression exist; instead, it becomes something like ./yourscript foo01 foo02 foo03, with each argument in a different spot on the command line.
The reason running ./yourscript "foo*" functions as a workaround is the unquoted expansion inside the script allowing the glob to be expanded at that later time. However, this is bad practice: glob expansion happens concurrent with string-splitting (meaning that relying on this behavior removes your ability to pass filenames containing characters found in IFS, typically whitespace), and also means that you can't pass literal filenames when they could also be interpreted as globs (if you have a file named [1] and a file named 1, passing [1] would always be replaced with 1).
Idiomatic Usage
The idiomatic way to build this would be to shift away the first argument, and then iterate over subsequent ones, like so:
#!/bin/bash
out_base=$1; shift
shopt -s nullglob # avoid generating an error if a directory has no .status
for dir; do # iterate over directories passed in $2, $3, etc
for file in "$dir"/*.status; do # iterate over files ending in .status within those
grep -e "string" "$file" # match a single file
done
done >"${out_base}.extension"
If you have many .status files in a single directory, all this can be made more efficient by using find to invoke grep with as many arguments as possible, rather than calling grep individually on a per-file basis:
#!/bin/bash
out_base=$1; shift
find "$#" -maxdepth 1 -type f -name '*.status' \
-exec grep -h -- /dev/null '{}' + \
>"${out_base}.extension"
Both scripts above expect the globs passed not to be quoted on the invoking shell. Thus, usage is of the form:
# being unquoted, this expands the glob into a series of separate arguments
your_script descriptor dir_*_map
This is considerably better practice than passing globs to your script (which then is required to expand them to retrieve the actual files to use); it works correctly with filenames containing whitespace (which the other practice doesn't), and files whose names are themselves glob expressions.
Some other points of note:
Always put double quotes around expansions! Failing to do so results in the additional steps of string-splitting and glob expansion (in that order) being applied. If you want globbing, as in the case of "$dir"/*.status, then end the quotes before the glob expression starts.
for dir; do is precisely equivalent to for dir in "$#"; do, which iterates over arguments. Don't make the mistake of using for dir in $*; do or for dir in $#; do instead! These latter invocations combine each element of the list with the first character of IFS (which, by default, contains the space, the tab and the newline in that order), then splits the resulting string on any IFS characters found within, then expands each component of the resulting list as a glob.
Passing /dev/null as an argument to grep is a safety measure: It ensures that you don't have different behavior between the single-argument and multi-argument cases (as an example, grep defaults to printing filenames within output only when passed multiple arguments), and ensures that you can't have grep hang trying to read from stdin if it's passed no additional filenames at all (which find won't do here, but xargs can).
Using lower-case names for your own variables (as opposed to system- and shell-provided variables, which have all-uppercase names) is in accordance with POSIX-specified convention; see fourth paragraph of the POSIX specification regarding environment variables, keeping in mind that environment variables and shell variables share a namespace.

basename command confusion

Given the following command:
$(basename "/this-directory-does-not-exist/*.txt" ".txt")
it outputs not only txt files but other files as well. On the other hand if I change ".txt" to something like "gobble de gook" it returns:
*.txt
I'm confused with regard to why it returns the other extension types.

Your problem doesn't stem from basename, but from inadvertent use of the shell's pathname expansion (globbing) feature due to lack of quoting:
If you use the result of your command substitution ($(...)) unquoted:
$ echo $(basename "/this-directory-does-not-exist/*.txt" ".txt")
you effectively execute the following:
$ echo * # unquoted '*' expands to all files and folders in the current dir
because basename "/this-directory-does-not-exist/*.txt" ".txt" returns literal * (it strips the extension from filename *.txt;
the reason that the filename pattern *.txt didn't expand to an actual filename is that the shell leaves globbing patterns that don't match anything unmodified (by default).)
If you double-quote the command substitution, the problem goes away:
$ echo "$(basename "/this-directory-does-not-exist/*.txt" ".txt")" # -> *
However, even with this problem resolved, your basename command will only work correctly if the glob expands to one matching file, because the syntax form you're using only supports one filename argument.
GNU basename and BSD basename support the non-POSIX -s option, which allows for multiple file operands from which to strip the extension:
basename -s .txt "/some-dir/*.txt"
Assuming you use bash, you can put it all together robustly as follows:
#!/usr/bin/env bash
names=() # initialize result array
files=( *.txt ) # perform globbing and capture matching paths in an array
# Since the shell by default returns a pattern as-is if there are no matches,
# we test the first array item for existence; if it refers to an existing
# file or dir., we know that at least 1 match was found.
if [[ -e ${files[0]} ]]; then
# Apply the `basename` command with suffix-stripping to all matches
# and read the results robustly into an array.
# Note that just `names=( $(basename ...) )` would NOT work robustly.
readarray -t names < <(basename -s '.txt' "${files[#]}")
# Note: `readarray` requires Bash 4; in Bash 3.x, use the following:
# IFS=$'\n' read -r -d '' -a names < <(basename -s '.txt' "${files[#]}")
fi
# "${names[#]}" now contains an array of suffix-stripped basenames,
# or is empty, if no files matched.
printf '%s\n' "${names[#]}" # print names line by line
Note: The -e test comes with a tiny caveat: if there are matches and the first match is a broken symlink, the test will mistakenly conclude that there are no matches.
A more robust option is to use shopt -s nullglob to make the shell expand non-matching globs to the empty string, but note that this is a shell-global option, and it is good practice to return it to its previous value afterward, which makes that approach more cumbersome.

Try to put quotes around the whole thing, what you is globbing happening, your command becomes * which then is converted to all files in the current directory, this does not happen inside single or double quotes.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string