bash tar error doesn't create tar.gz - linux

I have the following bash script:
#DIR is something like: /home/foo/foobar/test/ without any whitespace but can also include whitespace
DIR="$( cd "$( dirname "$0" )" && pwd )"
#backup_name is read from a file
backup_name=FOOBAR
date=`date +%Y%m%d_%H%M_%S`
#subdirs is also read from the same file
subdirs=etc/ sbin/ bin/
filename="$DIR/Backup_$backup_name"_"$date.tar.gz"
cd /
echo "filename: $filename"
echo "subdirs $subdirs"
cmd='tar czvf "'$filename'" '$subdirs
echo "cmd tar: $cmd"
$cmd
But I get following output:
filename: /home/foo/foobar/test/Backup_FOOBAR_20120322_1529_35.tar.gz
subdirs: etc/ sbin/ bin/
cmd tar: tar cfvz "/home/foo/foobar/test/Backup_FOOBAR_20120322_1529_35.tar.gz" etc/ sbin/ bin/
etc/
# ... list of files in etc
# but no files from sbin or bin directory
tar: "/home/foo/foobar/test/Backup_FOOBAR_20120322_1529_35.tar.gz": can open not execute: File or directory not found
tar: not recoverable error: abortion.
However, when I copy the echo output of the tar command, make a cd to / and paste it into the bash shell it is working:
tar cfvz "/home/foo/foobar/test/Backup_FOOBAR_20120322_1529_35.tar.gz" etc/ sbin/ bin/
etc/
Every variable is defined and there is no trailing newline
I also tried $cmd with backticks
the two variables: backup_name and subdirs are read from a file (I did not include the reading process in the code)
edit: I just copied my script to a dir with no whitespace and changed the line:
cmd='tar czvf "'$filename'" '$subdirs
#to
cmd="tar czvf $filename $subdirs"
and it's working now but when I do the same in a dir which also contents whitespaces I get still the same error.
edit2: reading from file (the file is read before anything else happens)
config="config.txt"
local line
while read line
do
#points to next free element and declares it
config_lines[${#config_lines[#]}]=$line
done <$config
backup_name=${config_line[0]}
subdirs=${config_line[1]}
What is wrong with my bash script?

Short answer: see BashFAQ #050: I'm trying to put a command in a variable, but the complex cases always fail!.
Long answer: embedding quotes in a variable doesn't do anything useful, because when you use it (i.e. $cmd), bash parses quotes before replacing variables; by the time the quotes are there, it's too late for them to do any good. You do, however, have several options:
Don't bother with putting the command in a variable in the first place, just use it directly:
echo "filename: $filename"
echo "subdirs $subdirs"
tar czvf "$filename" $subdirs
If you really need to put it in a variable first, use an array rather than a plain text variable (and ideally, do the same with the subdirs list):
subdirs=(etc/ sbin/ bin/)
...
echo "filename: $filename"
echo "subdirs ${subdirs[*]}"
cmd=(tar czvf "$filename" "${subdirs[#]}")
printf "cmd tar:"
printf " %q" "${cmd[#]}" # Have to do some trickery to get it printed right
printf "\n"
"${cmd[#]}"

Instead of mucking about with messy quoting issues you could get the results you want a different way and, perhaps, save some time. How about something like this?
#!/usr/bin/env bash
# abusing set -v for fun and profit
tar_output=/tmp/$$.tarout
tar_command=/tmp/$$.tarcmd
tmp_script=/tmp/$$.script
dir="$(cd "$(dirname "$0")"; pwd)"
cat>"${tmp_script}"<<-'END'
datestamp=$(date +%Y%m%d_%H%M_%S)
subdirs=(etc sbin bin)
backup_name=FOOBAR
filename="$1/Backup_${backup_name}_${date}.tar.gz"
printf 'tar cmd: '
set -v
tar czvf "$filename" "${subdirs[#]}" 2>"$2"
set +v
END
bash "${tmp_script}" "$dir" "${tar_output}" 2>"${tar_command}"
cat "${tar_command}" | head -n 1 | sed -e 's/2>"\$2"$//'
cat "${tar_output}"
rm -f "${tmp_script}" "${tar_command}" "${tar_output}"
I apologize for nothing, but in the real world note that you'd want to make proper temp files.

If you execute the string $cmd, it won't work if "filename" embeds spaces
You have to let bash creates the arguments.
like this:
tar czvf "${filename}" $subdirs
You don't even need to put '\' in filename

OK, your original script did not work because file/path determination happens before variable expansion, so the filename is wrong: tar thinks that it's supposed to write to a file in the current directory named "/home/foo/foobar/test/Backup_FOOBAR_20120322_1529_35.tar.gz" i.e. the file name contains slashes and double quotes!
tar cfz /this/file/does/nopt/exist .
tar: /this/file/does/nopt/exist: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now
See the difference? There no double quotes around the file name/path in tar's error message.
It worked when you copy and paste the line because then, the doublequotes are intepreted by the shell.
Witness:
ls -l /tmp/screen-exchange
-rw-rw-rw- 1 aqn users 0 Mar 21 07:29 /tmp/screen-exchange
cmd='ls -l "'/tmp/screen-exchange'"'
$cmd
/bin/ls: "/tmp/screen-exchange": No such file or directory
eval $cmd
-rw-rw-rw- 1 aqn users 0 Mar 21 07:29 /tmp/screen-exchange
Of course, using eval won't guard against filenames with whitespaces in them. To guard against that, your tar command needs to be like so:
date>'file name with spaces'
file='file name with spaces' # this is the equivalent of your $filename
cmd='ls -l "$file"'
$cmd
ls: "$file": No such file or directory
eval $cmd
-rw-r--r-- 1 andyn SPICE\Domain Users 1083 Mar 22 15:28 a b

I would suggest you separate $cmd from $filename and $subdirs. I think the induced error comes from when you join these strings. Also, using multiple variables in one variable without proper quoting will also induce errors.
This should work for you:
cmd="tar -zcvf"
subdirs="etc/ sbin/ bin/"
filename="${DIR}/Backup_${backup_name}_${date}.tar.gz"
$cmd $filename $subdirs

#DIR is something like: /home/foo/foobar/test/ without any whitespace but can also include whitespace
DIR="$( cd "$( dirname "$0" )" && pwd )"
backup_name=FOOBAR
date=`date +%Y%m%d_%H%M_%S`
subdirs="etc/ sbin/ bin/"
filename="$DIR/Backup_$backup_name"_"$date.tar.gz"
cd /
echo "filename: $filename"
echo "subdirs $subdirs"
cmd="tar zcvf $filename $subdirs"
echo "cmd tar: $cmd"
$cmd

Related

How to create a command in linux from a bash executable when my program uses an internal database? [duplicate]

How do I get the path of the directory in which a Bash script is located, inside that script?
I want to use a Bash script as a launcher for another application. I want to change the working directory to the one where the Bash script is located, so I can operate on the files in that directory, like so:
$ ./application
#!/usr/bin/env bash
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
is a useful one-liner which will give you the full directory name of the script no matter where it is being called from.
It will work as long as the last component of the path used to find the script is not a symlink (directory links are OK). If you also want to resolve any links to the script itself, you need a multi-line solution:
#!/usr/bin/env bash
SOURCE=${BASH_SOURCE[0]}
while [ -L "$SOURCE" ]; do # resolve $SOURCE until the file is no longer a symlink
DIR=$( cd -P "$( dirname "$SOURCE" )" >/dev/null 2>&1 && pwd )
SOURCE=$(readlink "$SOURCE")
[[ $SOURCE != /* ]] && SOURCE=$DIR/$SOURCE # if $SOURCE was a relative symlink, we need to resolve it relative to the path where the symlink file was located
done
DIR=$( cd -P "$( dirname "$SOURCE" )" >/dev/null 2>&1 && pwd )
This last one will work with any combination of aliases, source, bash -c, symlinks, etc.
Beware: if you cd to a different directory before running this snippet, the result may be incorrect!
Also, watch out for $CDPATH gotchas, and stderr output side effects if the user has smartly overridden cd to redirect output to stderr instead (including escape sequences, such as when calling update_terminal_cwd >&2 on Mac). Adding >/dev/null 2>&1 at the end of your cd command will take care of both possibilities.
To understand how it works, try running this more verbose form:
#!/usr/bin/env bash
SOURCE=${BASH_SOURCE[0]}
while [ -L "$SOURCE" ]; do # resolve $SOURCE until the file is no longer a symlink
TARGET=$(readlink "$SOURCE")
if [[ $TARGET == /* ]]; then
echo "SOURCE '$SOURCE' is an absolute symlink to '$TARGET'"
SOURCE=$TARGET
else
DIR=$( dirname "$SOURCE" )
echo "SOURCE '$SOURCE' is a relative symlink to '$TARGET' (relative to '$DIR')"
SOURCE=$DIR/$TARGET # if $SOURCE was a relative symlink, we need to resolve it relative to the path where the symlink file was located
fi
done
echo "SOURCE is '$SOURCE'"
RDIR=$( dirname "$SOURCE" )
DIR=$( cd -P "$( dirname "$SOURCE" )" >/dev/null 2>&1 && pwd )
if [ "$DIR" != "$RDIR" ]; then
echo "DIR '$RDIR' resolves to '$DIR'"
fi
echo "DIR is '$DIR'"
And it will print something like:
SOURCE './scriptdir.sh' is a relative symlink to 'sym2/scriptdir.sh' (relative to '.')
SOURCE is './sym2/scriptdir.sh'
DIR './sym2' resolves to '/home/ubuntu/dotfiles/fo fo/real/real1/real2'
DIR is '/home/ubuntu/dotfiles/fo fo/real/real1/real2'
Use dirname "$0":
#!/usr/bin/env bash
echo "The script you are running has basename $( basename -- "$0"; ), dirname $( dirname -- "$0"; )";
echo "The present working directory is $( pwd; )";
Using pwd alone will not work if you are not running the script from the directory it is contained in.
[matt#server1 ~]$ pwd
/home/matt
[matt#server1 ~]$ ./test2.sh
The script you are running has basename test2.sh, dirname .
The present working directory is /home/matt
[matt#server1 ~]$ cd /tmp
[matt#server1 tmp]$ ~/test2.sh
The script you are running has basename test2.sh, dirname /home/matt
The present working directory is /tmp
The dirname command is the most basic, simply parsing the path up to the filename off of the $0 (script name) variable:
dirname -- "$0";
But, as matt b pointed out, the path returned is different depending on how the script is called. pwd doesn't do the job because that only tells you what the current directory is, not what directory the script resides in. Additionally, if a symbolic link to a script is executed, you're going to get a (probably relative) path to where the link resides, not the actual script.
Some others have mentioned the readlink command, but at its simplest, you can use:
dirname -- "$( readlink -f -- "$0"; )";
readlink will resolve the script path to an absolute path from the root of the filesystem. So, any paths containing single or double dots, tildes and/or symbolic links will be resolved to a full path.
Here's a script demonstrating each of these, whatdir.sh:
#!/usr/bin/env bash
echo "pwd: `pwd`"
echo "\$0: $0"
echo "basename: `basename -- "$0"`"
echo "dirname: `dirname -- "$0"`"
echo "dirname/readlink: $( dirname -- "$( readlink -f -- "$0"; )"; )"
Running this script in my home dir, using a relative path:
>>>$ ./whatdir.sh
pwd: /Users/phatblat
$0: ./whatdir.sh
basename: whatdir.sh
dirname: .
dirname/readlink: /Users/phatblat
Again, but using the full path to the script:
>>>$ /Users/phatblat/whatdir.sh
pwd: /Users/phatblat
$0: /Users/phatblat/whatdir.sh
basename: whatdir.sh
dirname: /Users/phatblat
dirname/readlink: /Users/phatblat
Now changing directories:
>>>$ cd /tmp
>>>$ ~/whatdir.sh
pwd: /tmp
$0: /Users/phatblat/whatdir.sh
basename: whatdir.sh
dirname: /Users/phatblat
dirname/readlink: /Users/phatblat
And finally using a symbolic link to execute the script:
>>>$ ln -s ~/whatdir.sh whatdirlink.sh
>>>$ ./whatdirlink.sh
pwd: /tmp
$0: ./whatdirlink.sh
basename: whatdirlink.sh
dirname: .
dirname/readlink: /Users/phatblat
There is however one case where this doesn't work, when the script is sourced (instead of executed) in bash:
>>>$ cd /tmp
>>>$ . ~/whatdir.sh
pwd: /tmp
$0: bash
basename: bash
dirname: .
dirname/readlink: /tmp
pushd . > '/dev/null';
SCRIPT_PATH="${BASH_SOURCE[0]:-$0}";
while [ -h "$SCRIPT_PATH" ];
do
cd "$( dirname -- "$SCRIPT_PATH"; )";
SCRIPT_PATH="$( readlink -f -- "$SCRIPT_PATH"; )";
done
cd "$( dirname -- "$SCRIPT_PATH"; )" > '/dev/null';
SCRIPT_PATH="$( pwd; )";
popd > '/dev/null';
It works for all versions, including
when called via multiple depth soft link,
when the file it
when script called by command "source" aka . (dot) operator.
when arg $0 is modified from caller.
"./script"
"/full/path/to/script"
"/some/path/../../another/path/script"
"./some/folder/script"
Alternatively, if the Bash script itself is a relative symlink you want to follow it and return the full path of the linked-to script:
pushd . > '/dev/null';
SCRIPT_PATH="${BASH_SOURCE[0]:-$0}";
while [ -h "$SCRIPT_PATH" ];
do
cd "$( dirname -- "$SCRIPT_PATH"; )";
SCRIPT_PATH="$( readlink -f -- "$SCRIPT_PATH"; )";
done
cd "$( dirname -- "$SCRIPT_PATH"; )" > '/dev/null';
SCRIPT_PATH="$( pwd; )";
popd > '/dev/null';
SCRIPT_PATH is given in full path, no matter how it is called.
Just make sure you locate this at start of the script.
You can use $BASH_SOURCE:
#!/usr/bin/env bash
scriptdir="$( dirname -- "$BASH_SOURCE"; )";
Note that you need to use #!/bin/bash and not #!/bin/sh since it's a Bash extension.
Here is an easy-to-remember script:
DIR="$( dirname -- "${BASH_SOURCE[0]}"; )"; # Get the directory name
DIR="$( realpath -e -- "$DIR"; )"; # Resolve its full path if need be
Short answer:
"`dirname -- "$0";`"
or (preferably):
"$( dirname -- "$0"; )"
This should do it:
DIR="$(dirname "$(realpath "$0")")"
This works with symlinks and spaces in path.
Please see the man pages for dirname and realpath.
Please add a comment on how to support MacOS. I'm sorry I can verify it.
pwd can be used to find the current working directory, and dirname to find the directory of a particular file (command that was run, is $0, so dirname $0 should give you the directory of the current script).
However, dirname gives precisely the directory portion of the filename, which more likely than not is going to be relative to the current working directory. If your script needs to change directory for some reason, then the output from dirname becomes meaningless.
I suggest the following:
#!/usr/bin/env bash
reldir="$( dirname -- "$0"; )";
cd "$reldir";
directory="$( pwd; )";
echo "Directory is ${directory}";
This way, you get an absolute, rather than a relative directory.
Since the script will be run in a separate Bash instance, there isn't any need to restore the working directory afterwards, but if you do want to change back in your script for some reason, you can easily assign the value of pwd to a variable before you change directory, for future use.
Although just
cd "$( dirname -- "$0"; )";
solves the specific scenario in the question, I find having the absolute path to more more useful generally.
SCRIPT_DIR=$( cd ${0%/*} && pwd -P )
I don't think this is as easy as others have made it out to be. pwd doesn't work, as the current directory is not necessarily the directory with the script. $0 doesn't always have the information either. Consider the following three ways to invoke a script:
./script
/usr/bin/script
script
In the first and third ways $0 doesn't have the full path information. In the second and third, pwd does not work. The only way to get the directory in the third way would be to run through the path and find the file with the correct match. Basically the code would have to redo what the OS does.
One way to do what you are asking would be to just hardcode the data in the /usr/share directory, and reference it by its full path. Data shoudn't be in the /usr/bin directory anyway, so this is probably the thing to do.
This gets the current working directory on Mac OS X v10.6.6 (Snow Leopard):
DIR=$(cd "$(dirname "$0")"; pwd)
$(dirname "$(readlink -f "$BASH_SOURCE")")
This is Linux specific, but you could use:
SELF=$(readlink /proc/$$/fd/255)
Here is a POSIX compliant one-liner:
SCRIPT_PATH=`dirname "$0"`; SCRIPT_PATH=`eval "cd \"$SCRIPT_PATH\" && pwd"`
# test
echo $SCRIPT_PATH
The shortest and most elegant way to do this is:
#!/bin/bash
DIRECTORY=$(cd `dirname $0` && pwd)
echo $DIRECTORY
This would work on all platforms and is super clean.
More details can be found in "Which directory is that bash script in?".
Summary:
FULL_PATH_TO_SCRIPT="$(realpath "${BASH_SOURCE[-1]}")"
# OR, if you do NOT need it to work for **sourced** scripts too:
# FULL_PATH_TO_SCRIPT="$(realpath "$0")"
# OR, depending on which path you want, in case of nested `source` calls
# FULL_PATH_TO_SCRIPT="$(realpath "${BASH_SOURCE[0]}")"
# OR, add `-s` to NOT expand symlinks in the path:
# FULL_PATH_TO_SCRIPT="$(realpath -s "${BASH_SOURCE[-1]}")"
SCRIPT_DIRECTORY="$(dirname "$FULL_PATH_TO_SCRIPT")"
SCRIPT_FILENAME="$(basename "$FULL_PATH_TO_SCRIPT")"
Details:
How to obtain the full file path, full directory, and base filename of any script being run OR sourced...
...even when the called script is called from within another bash function or script, or when nested sourcing is being used!
For many cases, all you need to acquire is the full path to the script you just called. This can be easily accomplished using realpath. Note that realpath is part of GNU coreutils. If you don't have it already installed (it comes default on Ubuntu), you can install it with sudo apt update && sudo apt install coreutils.
get_script_path.sh (for the latest version of this script, see get_script_path.sh in my eRCaGuy_hello_world repo):
#!/bin/bash
# A. Obtain the full path, and expand (walk down) symbolic links
# A.1. `"$0"` works only if the file is **run**, but NOT if it is **sourced**.
# FULL_PATH_TO_SCRIPT="$(realpath "$0")"
# A.2. `"${BASH_SOURCE[-1]}"` works whether the file is sourced OR run, and even
# if the script is called from within another bash function!
# NB: if `"${BASH_SOURCE[-1]}"` doesn't give you quite what you want, use
# `"${BASH_SOURCE[0]}"` instead in order to get the first element from the array.
FULL_PATH_TO_SCRIPT="$(realpath "${BASH_SOURCE[-1]}")"
# B.1. `"$0"` works only if the file is **run**, but NOT if it is **sourced**.
# FULL_PATH_TO_SCRIPT_KEEP_SYMLINKS="$(realpath -s "$0")"
# B.2. `"${BASH_SOURCE[-1]}"` works whether the file is sourced OR run, and even
# if the script is called from within another bash function!
# NB: if `"${BASH_SOURCE[-1]}"` doesn't give you quite what you want, use
# `"${BASH_SOURCE[0]}"` instead in order to get the first element from the array.
FULL_PATH_TO_SCRIPT_KEEP_SYMLINKS="$(realpath -s "${BASH_SOURCE[-1]}")"
# You can then also get the full path to the directory, and the base
# filename, like this:
SCRIPT_DIRECTORY="$(dirname "$FULL_PATH_TO_SCRIPT")"
SCRIPT_FILENAME="$(basename "$FULL_PATH_TO_SCRIPT")"
# Now print it all out
echo "FULL_PATH_TO_SCRIPT = \"$FULL_PATH_TO_SCRIPT\""
echo "SCRIPT_DIRECTORY = \"$SCRIPT_DIRECTORY\""
echo "SCRIPT_FILENAME = \"$SCRIPT_FILENAME\""
IMPORTANT note on nested source calls: if "${BASH_SOURCE[-1]}" above doesn't give you quite what you want, try using "${BASH_SOURCE[0]}" instead. The first (0) index gives you the first entry in the array, and the last (-1) index gives you the last last entry in the array. Depending on what it is you're after, you may actually want the first entry. I discovered this to be the case when I sourced ~/.bashrc with . ~/.bashrc, which sourced ~/.bash_aliases with . ~/.bash_aliases, and I wanted the realpath (with expanded symlinks) to the ~/.bash_aliases file, NOT to the ~/.bashrc file. Since these are nested source calls, using "${BASH_SOURCE[0]}" gave me what I wanted: the expanded path to ~/.bash_aliases! Using "${BASH_SOURCE[-1]}", however, gave me what I did not want: the expanded path to ~/.bashrc.
Example command and output:
Running the script:
~/GS/dev/eRCaGuy_hello_world/bash$ ./get_script_path.sh
FULL_PATH_TO_SCRIPT = "/home/gabriel/GS/dev/eRCaGuy_hello_world/bash/get_script_path.sh"
SCRIPT_DIRECTORY = "/home/gabriel/GS/dev/eRCaGuy_hello_world/bash"
SCRIPT_FILENAME = "get_script_path.sh"
Sourcing the script with . get_script_path.sh or source get_script_path.sh (the result is the exact same as above because I used "${BASH_SOURCE[-1]}" in the script instead of "$0"):
~/GS/dev/eRCaGuy_hello_world/bash$ . get_script_path.sh
FULL_PATH_TO_SCRIPT = "/home/gabriel/GS/dev/eRCaGuy_hello_world/bash/get_script_path.sh"
SCRIPT_DIRECTORY = "/home/gabriel/GS/dev/eRCaGuy_hello_world/bash"
SCRIPT_FILENAME = "get_script_path.sh"
If you use "$0" in the script instead of "${BASH_SOURCE[-1]}", you'll get the same output as above when running the script, but this undesired output instead when sourcing the script:
~/GS/dev/eRCaGuy_hello_world/bash$ . get_script_path.sh
FULL_PATH_TO_SCRIPT = "/bin/bash"
SCRIPT_DIRECTORY = "/bin"
SCRIPT_FILENAME = "bash"
And, apparently if you use "$BASH_SOURCE" instead of "${BASH_SOURCE[-1]}", it will not work if the script is called from within another bash function. So, using "${BASH_SOURCE[-1]}" is therefore the best way to do it, as it solves both of these problems! See the references below.
Difference between realpath and realpath -s:
Note that realpath also successfully walks down symbolic links to determine and point to their targets rather than pointing to the symbolic link. If you do NOT want this behavior (sometimes I don't), then add -s to the realpath command above, making that line look like this instead:
# Obtain the full path, but do NOT expand (walk down) symbolic links; in
# other words: **keep** the symlinks as part of the path!
FULL_PATH_TO_SCRIPT="$(realpath -s "${BASH_SOURCE[-1]}")"
This way, symbolic links are NOT expanded. Rather, they are left as-is, as symbolic links in the full path.
The code above is now part of my eRCaGuy_hello_world repo in this file here: bash/get_script_path.sh. Reference and run this file for full examples both with and withOUT symlinks in the paths. See the bottom of the file for example output in both cases.
References:
How to retrieve absolute path given relative
taught me about the BASH_SOURCE variable: Unix & Linux: determining path to sourced shell script
taught me that BASH_SOURCE is actually an array, and we want the last element from it for it to work as expected inside a function (hence why I used "${BASH_SOURCE[-1]}" in my code here): Unix & Linux: determining path to sourced shell script
man bash --> search for BASH_SOURCE:
BASH_SOURCE
An array variable whose members are the source filenames where the corresponding shell function names in the FUNCNAME array variable are defined. The shell function ${FUNCNAME[$i]} is defined in the file ${BASH_SOURCE[$i]} and called from ${BASH_SOURCE[$i+1]}.
See also:
[my answer] Unix & Linux: determining path to sourced shell script
#!/bin/sh
PRG="$0"
# need this for relative symlinks
while [ -h "$PRG" ] ; do
PRG=`readlink "$PRG"`
done
scriptdir=`dirname "$PRG"`
Here is the simple, correct way:
actual_path=$(readlink -f "${BASH_SOURCE[0]}")
script_dir=$(dirname "$actual_path")
Explanation:
${BASH_SOURCE[0]} - the full path to the script. The value of this will be correct even when the script is being sourced, e.g. source <(echo 'echo $0') prints bash, while replacing it with ${BASH_SOURCE[0]} will print the full path of the script. (Of course, this assumes you're OK taking a dependency on Bash.)
readlink -f - Recursively resolves any symlinks in the specified path. This is a GNU extension, and not available on (for example) BSD systems. If you're running a Mac, you can use Homebrew to install GNU coreutils and supplant this with greadlink -f.
And of course dirname gets the parent directory of the path.
I tried all of these and none worked. One was very close, but it had a tiny bug that broke it badly; they forgot to wrap the path in quotation marks.
Also a lot of people assume you're running the script from a shell, so they forget when you open a new script it defaults to your home.
Try this directory on for size:
/var/No one/Thought/About Spaces Being/In a Directory/Name/And Here's your file.text
This gets it right regardless how or where you run it:
#!/bin/bash
echo "pwd: `pwd`"
echo "\$0: $0"
echo "basename: `basename "$0"`"
echo "dirname: `dirname "$0"`"
So to make it actually useful, here's how to change to the directory of the running script:
cd "`dirname "$0"`"
This is a slight revision to the solution e-satis and 3bcdnlklvc04a pointed out in their answer:
SCRIPT_DIR=''
pushd "$(dirname "$(readlink -f "$BASH_SOURCE")")" > /dev/null && {
SCRIPT_DIR="$PWD"
popd > /dev/null
}
This should still work in all the cases they listed.
This will prevent popd after a failed pushd. Thanks to konsolebox.
Try using:
real=$(realpath "$(dirname "$0")")
I would use something like this:
# Retrieve the full pathname of the called script
scriptPath=$(which $0)
# Check whether the path is a link or not
if [ -L $scriptPath ]; then
# It is a link then retrieve the target path and get the directory name
sourceDir=$(dirname $(readlink -f $scriptPath))
else
# Otherwise just get the directory name of the script path
sourceDir=$(dirname $scriptPath)
fi
For systems having GNU coreutils readlink (for example, Linux):
$(readlink -f "$(dirname "$0")")
There's no need to use BASH_SOURCE when $0 contains the script filename.
$_ is worth mentioning as an alternative to $0. If you're running a script from Bash, the accepted answer can be shortened to:
DIR="$( dirname "$_" )"
Note that this has to be the first statement in your script.
These are short ways to get script information:
Folders and files:
Script: "/tmp/src dir/test.sh"
Calling folder: "/tmp/src dir/other"
Using these commands:
echo Script-Dir : `dirname "$(realpath $0)"`
echo Script-Dir : $( cd ${0%/*} && pwd -P )
echo Script-Dir : $(dirname "$(readlink -f "$0")")
echo
echo Script-Name : `basename "$(realpath $0)"`
echo Script-Name : `basename $0`
echo
echo Script-Dir-Relative : `dirname "$BASH_SOURCE"`
echo Script-Dir-Relative : `dirname $0`
echo
echo Calling-Dir : `pwd`
And I got this output:
Script-Dir : /tmp/src dir
Script-Dir : /tmp/src dir
Script-Dir : /tmp/src dir
Script-Name : test.sh
Script-Name : test.sh
Script-Dir-Relative : ..
Script-Dir-Relative : ..
Calling-Dir : /tmp/src dir/other
Also see: https://pastebin.com/J8KjxrPF
This works in Bash 3.2:
path="$( dirname "$( which "$0" )" )"
If you have a ~/bin directory in your $PATH, you have A inside this directory. It sources the script ~/bin/lib/B. You know where the included script is relative to the original one, in the lib subdirectory, but not where it is relative to the user's current directory.
This is solved by the following (inside A):
source "$( dirname "$( which "$0" )" )/lib/B"
It doesn't matter where the user is or how he/she calls the script. This will always work.
I've compared many of the answers given, and came up with some more compact solutions. These seem to handle all of the crazy edge cases that arise from your favorite combination of:
Absolute paths or relative paths
File and directory soft links
Invocation as script, bash script, bash -c script, source script, or . script
Spaces, tabs, newlines, Unicode, etc. in directories and/or filename
Filenames beginning with a hyphen
If you're running from Linux, it seems that using the proc handle is the best solution to locate the fully resolved source of the currently running script (in an interactive session, the link points to the respective /dev/pts/X):
resolved="$(readlink /proc/$$/fd/255 && echo X)" && resolved="${resolved%$'\nX'}"
This has a small bit of ugliness to it, but the fix is compact and easy to understand. We aren't using bash primitives only, but I'm okay with that because readlink simplifies the task considerably. The echo X adds an X to the end of the variable string so that any trailing whitespace in the filename doesn't get eaten, and the parameter substitution ${VAR%X} at the end of the line gets rid of the X. Because readlink adds a newline of its own (which would normally be eaten in the command substitution if not for our previous trickery), we have to get rid of that, too. This is most easily accomplished using the $'' quoting scheme, which lets us use escape sequences such as \n to represent newlines (this is also how you can easily make deviously named directories and files).
The above should cover your needs for locating the currently running script on Linux, but if you don't have the proc filesystem at your disposal, or if you're trying to locate the fully resolved path of some other file, then maybe you'll find the below code helpful. It's only a slight modification from the above one-liner. If you're playing around with strange directory/filenames, checking the output with both ls and readlink is informative, as ls will output "simplified" paths, substituting ? for things like newlines.
absolute_path=$(readlink -e -- "${BASH_SOURCE[0]}" && echo x) && absolute_path=${absolute_path%?x}
dir=$(dirname -- "$absolute_path" && echo x) && dir=${dir%?x}
file=$(basename -- "$absolute_path" && echo x) && file=${file%?x}
ls -l -- "$dir/$file"
printf '$absolute_path: "%s"\n' "$absolute_path"
I believe I've got this one. I'm late to the party, but I think some will appreciate it being here if they come across this thread. The comments should explain:
#!/bin/sh # dash bash ksh # !zsh (issues). G. Nixon, 12/2013. Public domain.
## 'linkread' or 'fullpath' or (you choose) is a little tool to recursively
## dereference symbolic links (ala 'readlink') until the originating file
## is found. This is effectively the same function provided in stdlib.h as
## 'realpath' and on the command line in GNU 'readlink -f'.
## Neither of these tools, however, are particularly accessible on the many
## systems that do not have the GNU implementation of readlink, nor ship
## with a system compiler (not to mention the requisite knowledge of C).
## This script is written with portability and (to the extent possible, speed)
## in mind, hence the use of printf for echo and case statements where they
## can be substituded for test, though I've had to scale back a bit on that.
## It is (to the best of my knowledge) written in standard POSIX shell, and
## has been tested with bash-as-bin-sh, dash, and ksh93. zsh seems to have
## issues with it, though I'm not sure why; so probably best to avoid for now.
## Particularly useful (in fact, the reason I wrote this) is the fact that
## it can be used within a shell script to find the path of the script itself.
## (I am sure the shell knows this already; but most likely for the sake of
## security it is not made readily available. The implementation of "$0"
## specificies that the $0 must be the location of **last** symbolic link in
## a chain, or wherever it resides in the path.) This can be used for some
## ...interesting things, like self-duplicating and self-modifiying scripts.
## Currently supported are three errors: whether the file specified exists
## (ala ENOENT), whether its target exists/is accessible; and the special
## case of when a sybolic link references itself "foo -> foo": a common error
## for beginners, since 'ln' does not produce an error if the order of link
## and target are reversed on the command line. (See POSIX signal ELOOP.)
## It would probably be rather simple to write to use this as a basis for
## a pure shell implementation of the 'symlinks' util included with Linux.
## As an aside, the amount of code below **completely** belies the amount
## effort it took to get this right -- but I guess that's coding for you.
##===-------------------------------------------------------------------===##
for argv; do :; done # Last parameter on command line, for options parsing.
## Error messages. Use functions so that we can sub in when the error occurs.
recurses(){ printf "Self-referential:\n\t$argv ->\n\t$argv\n" ;}
dangling(){ printf "Broken symlink:\n\t$argv ->\n\t"$(readlink "$argv")"\n" ;}
errnoent(){ printf "No such file: "$#"\n" ;} # Borrow a horrible signal name.
# Probably best not to install as 'pathfull', if you can avoid it.
pathfull(){ cd "$(dirname "$#")"; link="$(readlink "$(basename "$#")")"
## 'test and 'ls' report different status for bad symlinks, so we use this.
if [ ! -e "$#" ]; then if $(ls -d "$#" 2>/dev/null) 2>/dev/null; then
errnoent 1>&2; exit 1; elif [ ! -e "$#" -a "$link" = "$#" ]; then
recurses 1>&2; exit 1; elif [ ! -e "$#" ] && [ ! -z "$link" ]; then
dangling 1>&2; exit 1; fi
fi
## Not a link, but there might be one in the path, so 'cd' and 'pwd'.
if [ -z "$link" ]; then if [ "$(dirname "$#" | cut -c1)" = '/' ]; then
printf "$#\n"; exit 0; else printf "$(pwd)/$(basename "$#")\n"; fi; exit 0
fi
## Walk the symlinks back to the origin. Calls itself recursivly as needed.
while [ "$link" ]; do
cd "$(dirname "$link")"; newlink="$(readlink "$(basename "$link")")"
case "$newlink" in
"$link") dangling 1>&2 && exit 1 ;;
'') printf "$(pwd)/$(basename "$link")\n"; exit 0 ;;
*) link="$newlink" && pathfull "$link" ;;
esac
done
printf "$(pwd)/$(basename "$newlink")\n"
}
## Demo. Install somewhere deep in the filesystem, then symlink somewhere
## else, symlink again (maybe with a different name) elsewhere, and link
## back into the directory you started in (or something.) The absolute path
## of the script will always be reported in the usage, along with "$0".
if [ -z "$argv" ]; then scriptname="$(pathfull "$0")"
# Yay ANSI l33t codes! Fancy.
printf "\n\033[3mfrom/as: \033[4m$0\033[0m\n\n\033[1mUSAGE:\033[0m "
printf "\033[4m$scriptname\033[24m [ link | file | dir ]\n\n "
printf "Recursive readlink for the authoritative file, symlink after "
printf "symlink.\n\n\n \033[4m$scriptname\033[24m\n\n "
printf " From within an invocation of a script, locate the script's "
printf "own file\n (no matter where it has been linked or "
printf "from where it is being called).\n\n"
else pathfull "$#"
fi
Try the following cross-compatible solution:
CWD="$(cd -P -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P)"
As the commands such as realpath or readlink could be not available (depending on the operating system).
Note: In Bash, it's recommended to use ${BASH_SOURCE[0]} instead of $0, otherwise path can break when sourcing the file (source/.).
Alternatively you can try the following function in Bash:
realpath () {
[[ $1 = /* ]] && echo "$1" || echo "$PWD/${1#./}"
}
This function takes one argument. If argument has already absolute path, print it as it is, otherwise print $PWD variable + filename argument (without ./ prefix).
Related:
How can I set the current working directory to the directory of the script in Bash?
Bash script absolute path with OS X
Reliable way for a Bash script to get the full path to itself

execute command from the terminal vs from a script - path with wildcard - prevent glob expansion

I experience a strange problem and I think it have something to do with file/directory globbing
script
echo "tar -zcvf $file $base/$target $exclude_args"
cd $base && tar -zcvf $file $base/$target $exclude_args
output
tar -zcvf www_2017-04-24.tar.gz /var/www --exclude '/var/www/bak/*/*' --exclude '/var/www/test'
When running the script the exclude path is omitted (every directory is gzipped)
When running the output directly from putty the directories under /var/www/bak/*/* is excluded from the gzip
update
parse_exclude_paths (){
# escape forward slashes to avoid the paths to expand
args=$(echo "$exclude" | sed 's,/,\\\/,g')
args=$(printf " --exclude '%s'" $args)
# strip escapes
echo "$args" | sed 's,\\\/,/,g'
}
exclude="/var/www/bak/*/* /var/www/test"
exclude_args=''
if [ ! -z "$exclude" ]; then
exclude_args="$(parse_exclude_paths "$exclude")"
fi
update 2
If the command is sent via SSH there is no problems and the exclude paths are excluded from the gzip
ssh root#$host 'cd '"$base"' && tar -zcvf $file '"$base/$target $exclude_args"
I snooped your question history and saw that you're familiar with PHP. Here's the equivalent problem in PHP:
function foo($arg1, $arg2) {
echo "You passed $arg1 and $arg2\n";
}
$var='"one", "two"';
echo "Running: foo($var);\n";
foo($var);
The echo prints Running: foo("one", "two"); and that command works just fine if you copy-paste it!
Why does foo($var); instead write PHP Warning: Missing argument 2 for foo()?
The answer is of course that literal quotes in your variables don't matter for how the function is called. This is the same in both PHP and shell.
The solution in both PHP and Bash is to use an array:
#!/bin/bash
file="www_2017-04-24.tar.gz"
base="/var"
target="www"
exclude_args=( --exclude '/var/www/bak/*/*' )
cd "$base" && tar -zcvf "$file" "$base/$target" "${exclude_args[#]}"
sh is more primitive and doesn't support arbitrary arrays, but we can reuse the positional parameters to the same effect:
#!/bin/sh
file="www_2017-04-24.tar.gz"
base="/var"
target="www"
set -- --exclude '/var/www/bak/*/*' # Now assigned to $1, $2, etc
cd "$base" && tar -zcvf "$file" "$base/$target" "$#"
Another option is to use eval to re-interpret a string as a shell command. This means that anyone who can influence your variables can take over your system, but that may be ok if all the variables come from users with equivalent privileges:
eval "tar -zcvf $file $base/$target $exclude_args"

Spaces in directory name Bash

I'm new to bash and I'm working on script that traverses the tar.gz file archive and in each file changes a string specified to an another string. Args of script: name of archive, searched string, target word.
My problem is that when archive name contains a space (e.g. I run script with following args: > change_strings.sh "/tmp/tmp.m7xYn5EQ2y/work/data txt" a A) I have following error:
on line if [ ! -f $filename ] ; then [: data: binary operator expected, dirname: extra operand `txt'.
Here is my code:
#!/bin/bash
filename="${1##*/}"
VAR="$1"
DIR=$(dirname ${VAR})
cd "$DIR"
if [ ! -f $filename ] ; then
echo "no such archive" >&2
exit 1
fi
if ! tar tf $filename &> /dev/null; then
echo "this is not .tar.gz archive" >&2
exit 1
fi
dir=`mktemp -dt 'test.XXXXXX'`
tar -xf $filename -C $dir #extract archive to dir
cd $dir #go to argument directory
FILES=$dir"/*"
for f in $FILES
do
sed -i "s/$2/$3/g" "$f"
done
tar -czf $filename * #create tar gz archive with files in current directory
mv -f $filename $cdir"/"$filename #move archive
rm -r $dir #remove tmp directory
The proper way to handle this is to surround your variables with double quotes.
var=/foo/bar baz
CMD $var # CMD /foo/bar baz
The above code will execute CMD on /foo/bar and baz
CMD "$var"
This will execute CMD on "/foo/bar baz". It is a best practice to always surround your variables with double quotes in most places.
Welcome to stackoverflow!
For the convenience of current and future readers, here's a small, self contained example showing the problem:
filename="my file.txt"
if [ ! -f $filename ]
then
echo "file does not exist"
fi
Here's the output we get:
$ bash file
file: line 2: [: my: binary operator expected
And here's the output we expected to get:
file does not exist
Why are they not the same?
Here's what shellcheck has to say about it:
$ shellcheck file
In file line 2:
if [ -f $filename ]
^-- SC2086: Double quote to prevent globbing and word splitting.
and indeed, if we double quote it, we get the expected output:
$ cat file
filename="my file.txt"
if [ ! -f "$filename" ]
then
echo "file does not exist"
fi
$ bash file
file does not exist
You should be double quoting all your variables.
However, you have to take care with $FILES because it contains a glob/wildcards that you want to expand along with potential spaces that you don't want to wordsplit on. The easiest way is to just not put it in a variable and instead write it out:
for f in "$dir"/*
do
...

linux zip and exclude dir via bash/shell script

I am trying to write a bash/shell script to zip up a specific folder and ignore certain sub-dirs in that folder.
This is the folder I am trying to zip "sync_test5":
My bash script generates an ignore list (based on) and calls the zip function like this:
#!/bin/bash
SYNC_WEB_ROOT_BASE_DIR="/home/www-data/public_html"
SYNC_WEB_ROOT_BACKUP_DIR="sync_test5"
SYNC_WEB_ROOT_IGNORE_DIR="dir_to_ignore dir2_to_ignore"
ignorelist=""
if [ "$SYNC_WEB_ROOT_IGNORE_DIR" != "" ];
then
for ignoredir in $SYNC_WEB_ROOT_IGNORE_DIR
do
ignorelist="$ignorelist $SYNC_WEB_ROOT_BACKUP_DIR/$ignoredir/**\*"
done
fi
FILE="$SYNC_BACKUP_DIR/$DATETIMENOW.website.zip"
cd $SYNC_WEB_ROOT_BASE_DIR;
zip -r $FILE $SYNC_WEB_ROOT_BACKUP_DIR -x $ignorelist >/dev/null
echo "Done"
Now this script runs without error, however it is not ignoring/excluding the dirs I've specified.
So, I had the shell script output the command it tried to run, which was:
zip -r 12-08-2014_072810.website.zip sync_test5 -x sync_test5/dir_to_ignore/**\* sync_test5/dir2_to_ignore/**\*
Now If I run the above command directly in putty like this, it works:
So, why doesn't my shell script exclude working as intended? the command that is being executed is identical (in shell and putty directly).
Because backslash quotings in a variable after word splitting are not evaluated.
If you have a='123\4', echo $a would give
123\4
But if you do it directly like echo 123\4, you'd get
1234
Clearly the arguments you pass with the variable and without the variables are different.
You probably just meant to not quote your argument with backslash:
ignorelist="$ignorelist $SYNC_WEB_ROOT_BACKUP_DIR/$ignoredir/***"
Btw, what actual works is a non-evaluated glob pattern:
zip -r 12-08-2014_072810.website.zip sync_test5 -x 'sync_test5/dir_to_ignore/***' 'sync_test5/dir2_to_ignore/***'
You can verify this with
echo zip -r 12-08-2014_072810.website.zip sync_test5 -x sync_test5/dir_to_ignore/**\* sync_test5/dir2_to_ignore/**\*
And this is my suggestion:
#!/bin/bash
SYNC_WEB_ROOT_BASE_DIR="/home/www-data/public_html"
SYNC_WEB_ROOT_BACKUP_DIR="sync_test5"
SYNC_WEB_ROOT_IGNORE_DIR=("dir_to_ignore" "dir2_to_ignore")
IGNORE_LIST=()
if [[ -n $SYNC_WEB_ROOT_IGNORE_DIR ]]; then
for IGNORE_DIR in "${SYNC_WEB_ROOT_IGNORE_DIR[#]}"; do
IGNORE_LIST+=("$SYNC_WEB_ROOT_BACKUP_DIR/$IGNORE_DIR/***") ## "$SYNC_WEB_ROOT_BACKUP_DIR/$IGNORE_DIR/*" perhaps is enough?
done
fi
FILE="$SYNC_BACKUP_DIR/$DATETIMENOW.website.zip" ## Where is $SYNC_BACKUP_DIR set?
cd "$SYNC_WEB_ROOT_BASE_DIR";
zip -r "$FILE" "$SYNC_WEB_ROOT_BACKUP_DIR" -x "${IGNORE_LIST[#]}" >/dev/null
echo "Done"
This is what I ended up with:
#!/bin/bash
# This script zips a directory, excluding specified files, types and subdirectories.
# while zipping the directory it excludes hidden directories and certain file types
[[ "`/usr/bin/tty`" == "not a tty" ]] && . ~/.bash_profile
DIRECTORY=$(cd `dirname $0` && pwd)
if [[ -z $1 ]]; then
echo "Usage: managed_directory_compressor /your-directory/ zip-file-name"
else
DIRECTORY_TO_COMPRESS=${1%/}
ZIPPED_FILE="$2.zip"
COMPRESS_IGNORE_FILE=("\.git" "*.zip" "*.csv" "*.json" "gulpfile.js" "*.rb" "*.bak" "*.swp" "*.back" "*.merge" "*.txt" "*.sh" "bower_components" "node_modules")
COMPRESS_IGNORE_DIR=("bower_components" "node_modules")
IGNORE_LIST=("*/\.*" "\.* "\/\.*"")
if [[ -n $COMPRESS_IGNORE_FILE ]]; then
for IGNORE_FILES in "${COMPRESS_IGNORE_FILE[#]}"; do
IGNORE_LIST+=("$DIRECTORY_TO_COMPRESS/$IGNORE_FILES/*")
done
for IGNORE_DIR in "${COMPRESS_IGNORE_DIR[#]}"; do
IGNORE_LIST+=("$DIRECTORY_TO_COMPRESS/$IGNORE_DIR/")
done
fi
zip -r "$ZIPPED_FILE" "$DIRECTORY_TO_COMPRESS" -x "${IGNORE_LIST[#]}" # >/dev/null
# echo zip -r "$ZIPPED_FILE" "$DIRECTORY_TO_COMPRESS" -x "${IGNORE_LIST[#]}" # >/dev/null
echo $DIRECTORY_TO_COMPRESS "compressed as" $ZIPPED_FILE.
fi
After a few trial and error, I have managed to fix this problem by changing this line:
ignorelist="$ignorelist $SYNC_WEB_ROOT_BACKUP_DIR/$ignoredir/**\*"
to:
ignorelist="$ignorelist $SYNC_WEB_ROOT_BACKUP_DIR/$ignoredir/***"
Not sure why this worked, but it does :)

grep from tar.gz without extracting [faster one]

Am trying to grep pattern from dozen files .tar.gz but its very slow
am using
tar -ztf file.tar.gz | while read FILENAME
do
if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null
then
echo "$FILENAME contains string"
fi
done
If you have zgrep you can use
zgrep -a string file.tar.gz
You can use the --to-command option to pipe files to an arbitrary script. Using this you can process the archive in a single pass (and without a temporary file). See also this question, and the manual.
Armed with the above information, you could try something like:
$ tar xf file.tar.gz --to-command "awk '/bar/ { print ENVIRON[\"TAR_FILENAME\"]; exit }'"
bfe2/.bferc
bfe2/CHANGELOG
bfe2/README.bferc
I know this question is 4 years old, but I have a couple different options:
Option 1: Using tar --to-command grep
The following line will look in example.tgz for PATTERN. This is similar to #Jester's example, but I couldn't get his pattern matching to work.
tar xzf example.tgz --to-command 'grep --label="$TAR_FILENAME" -H PATTERN ; true'
Option 2: Using tar -tzf
The second option is using tar -tzf to list the files, then go through them with grep. You can create a function to use it over and over:
targrep () {
for i in $(tar -tzf "$1"); do
results=$(tar -Oxzf "$1" "$i" | grep --label="$i" -H "$2")
echo "$results"
done
}
Usage:
targrep example.tar.gz "pattern"
Both the below options work well.
$ zgrep -ai 'CDF_FEED' FeedService.log.1.05-31-2019-150003.tar.gz | more
2019-05-30 19:20:14.568 ERROR 281 --- [http-nio-8007-exec-360] DrupalFeedService : CDF_FEED_SERVICE::CLASSIFICATION_ERROR:408: Classification failed even after maximum retries for url : abcd.html
$ zcat FeedService.log.1.05-31-2019-150003.tar.gz | grep -ai 'CDF_FEED'
2019-05-30 19:20:14.568 ERROR 281 --- [http-nio-8007-exec-360] DrupalFeedService : CDF_FEED_SERVICE::CLASSIFICATION_ERROR:408: Classification failed even after maximum retries for url : abcd.html
If this is really slow, I suspect you're dealing with a large archive file. It's going to uncompress it once to extract the file list, and then uncompress it N times--where N is the number of files in the archive--for the grep. In addition to all the uncompressing, it's going to have to scan a fair bit into the archive each time to extract each file. One of tar's biggest drawbacks is that there is no table of contents at the beginning. There's no efficient way to get information about all the files in the archive and only read that portion of the file. It essentially has to read all of the file up to the thing you're extracting every time; it can't just jump to a filename's location right away.
The easiest thing you can do to speed this up would be to uncompress the file first (gunzip file.tar.gz) and then work on the .tar file. That might help enough by itself. It's still going to loop through the entire archive N times, though.
If you really want this to be efficient, your only option is to completely extract everything in the archive before processing it. Since your problem is speed, I suspect this is a giant file that you don't want to extract first, but if you can, this will speed things up a lot:
tar zxf file.tar.gz
for f in hopefullySomeSubdir/*; do
grep -l "string" $f
done
Note that grep -l prints the name of any matching file, quits after the first match, and is silent if there's no match. That alone will speed up the grepping portion of your command, so even if you don't have the space to extract the entire archive, grep -l will help. If the files are huge, it will help a lot.
For starters, you could start more than one process:
tar -ztf file.tar.gz | while read FILENAME
do
(if tar -zxf file.tar.gz "$FILENAME" -O | grep -l "string"
then
echo "$FILENAME contains string"
fi) &
done
The ( ... ) & creates a new detached (read: the parent shell does not wait for the child)
process.
After that, you should optimize the extracting of your archive. The read is no problem,
as the OS should have cached the file access already. However, tar needs to unpack
the archive every time the loop runs, which can be slow. Unpacking the archive once
and iterating over the result may help here:
local tempPath=`tempfile`
mkdir $tempPath && tar -zxf file.tar.gz -C $tempPath &&
find $tempPath -type f | while read FILENAME
do
(if grep -l "string" "$FILENAME"
then
echo "$FILENAME contains string"
fi) &
done && rm -r $tempPath
find is used here, to get a list of files in the target directory of tar, which we're iterating over, for each file searching for a string.
Edit: Use grep -l to speed up things, as Jim pointed out. From man grep:
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would
normally have been printed. The scanning will stop on the first match. (-l is specified
by POSIX.)
Am trying to grep pattern from dozen files .tar.gz but its very slow
tar -ztf file.tar.gz | while read FILENAME
do
if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null
then
echo "$FILENAME contains string"
fi
done
That's actually very easy with ugrep option -z:
-z, --decompress
Decompress files to search, when compressed. Archives (.cpio,
.pax, .tar, and .zip) and compressed archives (e.g. .taz, .tgz,
.tpz, .tbz, .tbz2, .tb2, .tz2, .tlz, and .txz) are searched and
matching pathnames of files in archives are output in braces. If
-g, -O, -M, or -t is specified, searches files within archives
whose name matches globs, matches file name extensions, matches
file signature magic bytes, or matches file types, respectively.
Supported compression formats: gzip (.gz), compress (.Z), zip,
bzip2 (requires suffix .bz, .bz2, .bzip2, .tbz, .tbz2, .tb2, .tz2),
lzma and xz (requires suffix .lzma, .tlz, .xz, .txz).
Which requires just one command to search file.tar.gz as follows:
ugrep -z "string" file.tar.gz
This greps each of the archived files to display matches. Archived filenames are shown in braces to distinguish them from ordinary filenames. For example:
$ ugrep -z "Hello" archive.tgz
{Hello.bat}:echo "Hello World!"
Binary file archive.tgz{Hello.class} matches
{Hello.java}:public class Hello // prints a Hello World! greeting
{Hello.java}: { System.out.println("Hello World!");
{Hello.pdf}:(Hello)
{Hello.sh}:echo "Hello World!"
{Hello.txt}:Hello
If you just want the file names, use option -l (--files-with-matches) and customize the filename output with option --format="%z%~" to get rid of the braces:
$ ugrep -z Hello -l --format="%z%~" archive.tgz
Hello.bat
Hello.class
Hello.java
Hello.pdf
Hello.sh
Hello.txt
All of the code above was really helpful, but none of it quite answered my own need: grep all *.tar.gz files in the current directory to find a pattern that is specified as an argument in a reusable script to output:
The name of both the archive file and the extracted file
The line number where the pattern was found
The contents of the matching line
It's what I was really hoping that zgrep could do for me and it just can't.
Here's my solution:
pattern=$1
for f in *.tar.gz; do
echo "$f:"
tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true";
done
You can also replace the tar line with the following if you'd like to test that all variables are expanding properly with a basic echo statement:
tar -xzf "$f" --to-command 'echo "f:`basename $TAR_FILENAME` s:'"$pattern\""
Let me explain what's going on. Hopefully, the for loop and the echo of the archive filename in question is obvious.
tar -xzf: x extract, z filter through gzip, f based on the following archive file...
"$f": The archive file provided by the for loop (such as what you'd get by doing an ls) in double-quotes to allow the variable to expand and ensure that the script is not broken by any file names with spaces, etc.
--to-command: Pass the output of the tar command to another command rather than actually extracting files to the filesystem. Everything after this specifies what the command is (grep) and what arguments we're passing to that command.
Let's break that part down by itself, since it's the "secret sauce" here.
'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
First, we use a single-quote to start this chunk so that the executed sub-command (basename $TAR_FILENAME) is not immediately expanded/resolved. More on that in a moment.
grep: The command to be run on the (not actually) extracted files
--label=: The label to prepend the results, the value of which is enclosed in double-quotes because we do want to have the grep command resolve the $TAR_FILENAME environment variable passed in by the tar command.
basename $TAR_FILENAME: Runs as a command (surrounded by backticks) and removes directory path and outputs only the name of the file
-Hin: H Display filename (provided by the label), i Case insensitive search, n Display line number of match
Then we "end" the first part of the command string with a single quote and start up the next part with a double quote so that the $pattern, passed in as the first argument, can be resolved.
Realizing which quotes I needed to use where was the part that tripped me up the longest. Hopefully, this all makes sense to you and helps someone else out. Also, I hope I can find this in a year when I need it again (and I've forgotten about the script I made for it already!)
And it's been a bit a couple of weeks since I wrote the above and it's still super useful... but it wasn't quite good enough as files have piled up and searching for things has gotten more messy. I needed a way to limit what I looked at by the date of the file (only looking at more recent files). So here's that code. Hopefully it's fairly self-explanatory.
if [ -z "$1" ]; then
echo "Look within all tar.gz files for a string pattern, optionally only in recent files"
echo "Usage: targrep <string to search for> [start date]"
fi
pattern=$1
startdatein=$2
startdate=$(date -d "$startdatein" +%s)
for f in *.tar.gz; do
filedate=$(date -r "$f" +%s)
if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
echo "$f:"
tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
fi
done
And I can't stop tweaking this thing. I added an argument to filter by the name of the output files in the tar file. Wildcards work, too.
Usage:
targrep.sh [-d <start date>] [-f <filename to include>] <string to search for>
Example:
targrep.sh -d "1/1/2019" -f "*vehicle_models.csv" ford
while getopts "d:f:" opt; do
case $opt in
d) startdatein=$OPTARG;;
f) targetfile=$OPTARG;;
esac
done
shift "$((OPTIND-1))" # Discard options and bring forward remaining arguments
pattern=$1
echo "Searching for: $pattern"
if [[ -n $targetfile ]]; then
echo "in filenames: $targetfile"
fi
startdate=$(date -d "$startdatein" +%s)
for f in *.tar.gz; do
filedate=$(date -r "$f" +%s)
if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
echo "$f:"
if [[ -z "$targetfile" ]]; then
tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
else
tar -xzf "$f" --no-anchored "$targetfile" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
fi
fi
done
zgrep works fine for me, only if all files inside is plain text.
it looks nothing works if the tgz file contains gzip files.
You can mount the TAR archive with ratarmount and then simply search for the pattern in the mounted view:
pip install --user ratarmount
ratarmount large-archive.tar mountpoint
grep -r '<pattern>' mountpoint/
This is much faster than iterating over each file and piping it to grep separately, especially for compressed TARs. Here are benchmark results in seconds for a 55 MiB uncompressed and 42 MiB compressed TAR archive containing 40 files:
Compression
Ratarmount
Bash Loop over tar -O
none
0.31 +- 0.01
0.55 +- 0.02
gzip
1.1 +- 0.1
13.5 +- 0.1
bzip2
1.2 +- 0.1
97.8 +- 0.2
Of course, these results are highly dependent on the archive size and how many files the archive contains. These test examples are pretty small because I didn't want to wait too long. But, they already exemplify the problem well enough. The more files there are, the longer it takes for tar -O to jump to the correct file. And for compressed archives, it will be quadratically slower the larger the archive size is because everything before the requested file has to be decompressed and each file is requested separately. Both of these problems are solved by ratarmount.
This is the code for benchmarking:
function checkFilesWithRatarmount()
{
local pattern=$1
local archive=$2
ratarmount "$archive" "$archive.mountpoint"
'grep' -r -l "$pattern" "$archive.mountpoint/"
}
function checkEachFileViaStdOut()
{
local pattern=$1
local archive=$2
tar --list --file "$archive" | while read -r file; do
if tar -x --file "$archive" -O -- "$file" | grep -q "$pattern"; then
echo "Found pattern in: $file"
fi
done
}
function createSampleTar()
{
for i in $( seq 40 ); do
head -c $(( 1024 * 1024 )) /dev/urandom | base64 > $i.dat
done
tar -czf "$1" [0-9]*.dat
}
createSampleTar myarchive.tar.gz
time checkEachFileViaStdOut ABCD myarchive.tar.gz
time checkFilesWithRatarmount ABCD myarchive.tar.gz
sleep 0.5s
fusermount -u myarchive.tar.gz.mountpoint
In my case the tarballs have a lot of tiny files and I want to know what archived file inside the tarball matches. zgrep is fast (less than one second) but doesn't provide the info I want, and tar --to-command grep is much, much slower (many minutes)1.
So I went the other direction and had zgrep tell me the byte offsets of the matches in the tarball and put that together with the list of offsets in the tarball of all archived files to find the matching archived files.
#!/bin/bash
set -e
set -o pipefail
function tar_offsets() {
# Get the byte offsets of all the files in a given tarball
# based on https://stackoverflow.com/a/49865044/60422
[ $# -eq 1 ]
tar -tvf "$1" -R | awk '
BEGIN{
getline;
f=$8;
s=$5;
}
{
offset = int($2) * 512 - and((s+511), compl(512)+1)
print offset,s,f;
f=$8;
s=$5;
}'
}
function tar_byte_offsets_to_files() {
[ $# -eq 1 ]
# Convert the search results of a tarball with byte offsets
# to search results with archived file name and offset, using
# the provided tar_offsets output (single pass, suitable for
# process substitution)
offsets_file="$1"
prev_offset=0
prev_offset_filename=""
IFS=' ' read -r last_offset last_len last_offset_filename < "$offsets_file"
while IFS=':' read -r search_result_offset match_text
do
while [ $last_offset -lt $search_result_offset ]; do
prev_offset=$last_offset
prev_offset_filename="$last_offset_filename"
IFS=' ' read -r last_offset last_len last_offset_filename < "$offsets_file"
# offsets increasing safeguard
[ $prev_offset -le $last_offset ]
done
# now last offset is the first file strictly after search result offset so prev offset is
# the one at or before it, and must be the one it is in
result_file_offset=$(( $search_result_offset - $prev_offset ))
echo "$prev_offset_filename:$result_file_offset:$match_text"
done
}
# Putting it together e.g.
zgrep -a --byte-offset "your search here" some.tgz | tar_byte_offsets_to_files <(tar_offsets some.tgz)
1 I'm running this in Git for Windows' minimal MSYS2 fork unixy environment, so it's possible that the launch overhead of grep is much much higher than on any kind of real Unix machine and would make `tar --to-command grep` good enough there; benchmark solutions for your own needs and platform situation before selecting.

Resources