How to grep for a pattern in the files in tar archive without filling up disk space

How to grep for a pattern in the files in tar archive without filling up disk space - linux

I have a tar archive which is very big ~ 5GB.
I want to grep for a pattern on all files (and also print the name of the file that has the pattern ) in the archive but do not want to fill up my disk space by extracting the archive.
Anyway I can do that?
I tried these, but this does not give me the file names that contain the pattern, just the matching lines:
tar -O -xf test.tar.gz | grep 'this'
tar -xf test.tar.gz --to-command='grep awesome'
Also where is this feature of tar documented? tar xf test.tar $FILE

Seems like nobody posted this simple solution that processes the archive only once:
tar xzf archive.tgz --to-command \
'grep --label="$TAR_FILENAME" -H PATTERN ; true'
Here tar passes the name of each file in a variable (see the docs) and it is used by grep to print it with each match. Also true is added so that tar doesn't complain about failing to extract files that don't match.

Here's my take on this:
while read filename; do tar -xOf file.tar "$filename" | grep 'pattern' | sed "s|^|$filename:|"; done < <(tar -tf file.tar | grep -v '/$')
Broken out for explanation:
while read filename; do -- it's a loop...
tar -xOf file.tar "$filename" -- this extracts each file...
| grep 'pattern' -- here's where you put your pattern...
| sed "s|^|$filename:|"; - prepend the filename, so this looks like grep. Salt to taste.
done < <(tar -tf file.tar | grep -v '/$') -- end the loop, get the list of files as to fead to your while read.
One proviso: this breaks if you have OR bars (|) in your filenames.
Hmm. In fact, this makes a nice little bash function, which you can append to your .bashrc file:
targrep() {
local taropt=""
if [[ ! -f "$2" ]]; then
echo "Usage: targrep pattern file ..."
fi
while [[ -n "$2" ]]; do
if [[ ! -f "$2" ]]; then
echo "targrep: $2: No such file" >&2
fi
case "$2" in
*.tar.gz) taropt="-z" ;;
*) taropt="" ;;
esac
while read filename; do
tar $taropt -xOf "$2" \
| grep "$1" \
| sed "s|^|$filename:|";
done < <(tar $taropt -tf $2 | grep -v '/$')
shift
done
}

Here's a bash function that may work for you. Add the following to your ~/.bashrc
targrep () {
for i in $(tar -tzf "$1"); do
results=$(tar -Oxzf "$1" "$i" | grep --label="$i" -H "$2")
echo "$results"
done
}
Usage:
targrep archive.tar.gz "pattern"

It's incredibly hacky, but you could abuse tar's -v option to process and delete each file as it is extracted.
grep_and_delete() {
if [ -n "$1" -a -f "$1" ]; then
grep -H 'this' -- "$1" </dev/null
rm -f -- "$1" </dev/null
fi
}
mkdir tmp; cd tmp
tar -xvzf test.tar.gz | (
prev=''
while read pathname; do
grep_and_delete "$prev"
prev="$pathname"
done
grep_and_delete "$prev"
)

tar -tf test.tar.gz | grep -v '/$'| \
xargs -n 1 -I _ \
sh -c 'tar -xOf test.tar.gz _|grep -q <YOUR SEARCH PATTERN> && echo _'

Try:
tar tvf name_of_file |grep --regex="pattern"
The t option will test the tar file without extracting the files. The v is verbose and the f prints he filenames. This should save you considerable hard disk space.

may help
zcat log.tar.gz | grep -a -i "string"
zgrep -i "string" log.tar.gz
http://www.commandlinefu.com/commands/view/9261/grep-compressed-log-files-without-extracting

Related

Avoid collision, if copying files

I was trying to copy all files of a certain filetype from all subfolders to one place. Unfortunately, this might cause collisions, if two files have the same name from two different subfolders.
I was using
find ./ -name '*.jpg' -exec mv -u '{}' . \;
How can I adjust this to automatically rename files (e.g. append "_1") to avoid collisions.
Or better: check if the files are the same (e.g. same size) beforehand. If yes, ignore (overwrite would be fine, too). If No, rename to avoid collision.
Suggestion would be appreciated. Thanks!

You could check before moving each individual file. Here I've used cksum to compare, which returns both the filesize and a simple checksum.
find ./ -name '*.jpg' -print0 |
while read -d '' -r path; do
file=$(basename "$path")
if [[ -e $file ]]; then
if [[ $(cksum "$file" | awk '{print $1 $2}') = $(cksum "$path" | awk '{print $1 $2}') ]]; then
continue
fi
read -n 1 -p "File '$file' would be overwritten by '$path', continue? (y/N) " -r prompt </dev/tty
if [[ $prompt != [Yy] ]]; then
continue
fi
fi
mv -f -v "$path" "$file"
done

how to get the filename from the mentioned list

I am new to linux and/or scripting, so bear with me. I want a script which can get the files for a Linux directory. here what I tried for getting the filename.
for NAME in $(ls -1 *.wav /some/path | cut -d "/" -f3 | cut -d "-" -f1-5)
if the filename contains -IN or -OUT then they will be sox -m and after that mv to another directory but if the some other files then it will be just mv
for the reference, filenames be like
1030-04-06-2015-1433414216.wav
1030-04-06-2015-1433414318.wav
1030-04-06-2015-1433414440.wav
1043-21-05-2015-1432207256.wav
1043-21-05-2015-1432207457.wav
1046-20-05-2015-1432137944.wav
1046-20-05-2015-1432138015.wav
1046-20-05-2015-1432138704.wav
1431709157.93900.0-in.wav
1431709157.93900.0-out.wav
1431709157.93900.1-in.wav
1431709157.93900.1-out.wav
1431710008.94059.0-in.wav
1431710008.94059.0-out.wav
1431710008.94059.1-in.wav
1431710008.94059.1-out.wav
1431710008.94059.1.wav
1431710008.94059.2.wav
1431713190.94698.2-in.wav
1431713190.94698.2-out.wav
1431713190.94698.2.wav
1431721107.96010.0-in.wav
1431721107.96010.0-out.wav
1431721107.96010.1.wav

This should work:
#!/bin/bash
regex='(.*)(-out|-in)(.txt)'
for file in *.txt;do
# $file is the full path to the file
if [[ $file =~ $regex ]];then
#filename in this block contains -in or -out
filename="${file##*\/}"
inorout="${BASH_REMATCH[2]}"
extension="${BASH_REMATCH[3]}"
filenamewithoutextension="${filename%.*}"
filenamewithoutioroutorextension="${filenamewithoutextension/%$inorout/}"
filenamewithoutiorout="${filenamewithoutioroutorextension}$extension"
echo "$filename" "$inorout" "$extension" "$filenamewithoutextension" "$filenamewithoutiorout" "$filenamewithoutioroutorextension"
#do something here sox-m mv or whatever
else
#filename in this block doesn't contain -in or -out
echo "do something else"
fi
done
Explanation:
"${file##*\/}" is the string left by cutting */ from $file i.e the basename (filename).
"${BASH_REMATCH[2]}" is the second captured group from the pattern matching done by [[ $file =~ $regex ]] i.e -in or -out
"${BASH_REMATCH[3]}" is the third captured group i.e .wav.
"${filename%.*}" is the string left by cutting .* from $file i.e filename without .wav
Resources you should check on:
Bash Parameter Expansion
Pattern Matching
Bash Variables

Here is one way to do it.
>cat test.sh
#!/bin/bash
destination="./DEST"
# Loop over every file
for file in "${#}" ; do
# If it is "-in", then sox
if [[ "${file}" =~ "-in" ]] || [[ "${file}" =~ "-out" ]] ; then
printf "sox -m ${file}; "
fi
echo "mv ${file} ${destination}"
done
When run, I get the following output.
>../test.sh *
mv 1030-04-06-2015-1433414216.wav ./DEST
mv 1030-04-06-2015-1433414318.wav ./DEST
mv 1030-04-06-2015-1433414440.wav ./DEST
mv 1043-21-05-2015-1432207256.wav ./DEST
mv 1043-21-05-2015-1432207457.wav ./DEST
mv 1046-20-05-2015-1432137944.wav ./DEST
mv 1046-20-05-2015-1432138015.wav ./DEST
mv 1046-20-05-2015-1432138704.wav ./DEST
sox -m 1431709157.93900.0-in.wav; mv 1431709157.93900.0-in.wav ./DEST
sox -m 1431709157.93900.0-out.wav; mv 1431709157.93900.0-out.wav ./DEST
sox -m 1431709157.93900.1-in.wav; mv 1431709157.93900.1-in.wav ./DEST
sox -m 1431709157.93900.1-out.wav; mv 1431709157.93900.1-out.wav ./DEST
sox -m 1431710008.94059.0-in.wav; mv 1431710008.94059.0-in.wav ./DEST
sox -m 1431710008.94059.0-out.wav; mv 1431710008.94059.0-out.wav ./DEST
sox -m 1431710008.94059.1-in.wav; mv 1431710008.94059.1-in.wav ./DEST
sox -m 1431710008.94059.1-out.wav; mv 1431710008.94059.1-out.wav ./DEST
mv 1431710008.94059.1.wav ./DEST
mv 1431710008.94059.2.wav ./DEST
sox -m 1431713190.94698.2-in.wav; mv 1431713190.94698.2-in.wav ./DEST
sox -m 1431713190.94698.2-out.wav; mv 1431713190.94698.2-out.wav ./DEST
mv 1431713190.94698.2.wav ./DEST
sox -m 1431721107.96010.0-in.wav; mv 1431721107.96010.0-in.wav ./DEST
sox -m 1431721107.96010.0-out.wav; mv 1431721107.96010.0-out.wav ./DEST
mv 1431721107.96010.1.wav ./DEST
mv DEST ./DEST
If you want to execute the commands, simply cut and paste these lines into a shell, or modify the bash script to execute these commands rather than print them.

xargs copy if file exists

I got a string with filenames I want to copy. However, only some of these files exist. My current script looks like this:
echo $x | xargs -n 1 test -f {} && cp --target-directory=../folder/ --parents
However, I always get a test: {}: binary operator expected error.
How can I do that?

You need to supply the -i flag to xargs for it to substitute {} for the filename.
However, you seem to expect xargs to feed into the cp, which it does not do. Maybe try something like
echo "$x" |
xargs -i sh -c 'test -f {} && cp --target-directory=../folder/ --parents {}'
(Notice also the use of double quotes with echo. There are very few situations where you want a bare unquoted variable interpolation.)
To pass in many files at once, you can use a for loop in the sh -c:
echo "$x" |
xargs sh -c 'for f; do
test -f "$f" && continue
echo "$f"
done' _ |
xargs cp --parents --target-directory=".,/folder/"
The _ argument is because the first argument to sh -c is used to populate $0, not $#

xargs can only run a simple command. The && part gets interpreted by the shell which is not what you want. Just create a temporary script with the commands you want to run:
cat > script.sh
test -f "$1" && cp "$1" --target-directory=../folder/ --parents
Control-D
chmod u+x ./script.sh
echo $x | xargs -n1 ./script.sh
Also note that {} is not needed with -n1 because the parameter is used as the last word on a line.

mkdir command for a list of filenames paths

I have txt file with content like this
/home/username/Desktop/folder/folder3333/IMAGw488.jpg
/home/username/Desktop/folder/folder3333/IMAG04f88.jpg
/home/username/Desktop/folder/folder3333/IMAGe0488.jpg
/home/username/Desktop/folder/folder3333/IMAG0r88.jpg
/home/username/Desktop/folder/folder3333/
/home/username/Desktop/folder/
/home/username/Desktop/folder/IMAG0488.jpg
/home/username/Desktop/folder/fff/fff/feqw/123.jpg
/home/username/Desktop/folder/fffa/asd.png
....
these are filenames paths but also paths of folders.
The problem I want to solve is to create all folders that doesn't exist.
I want to call mkdir command for every folder that does not exist
How can I do this on easy way ?
Thanks

This can be done in native bash syntax without any calls to external binaries:
while read line; do mkdir -p "${line%/*}"; done < infile
Or perhaps with a just a single call to mkdir if you have bash 4.x
mapfile -t arr < infile; mkdir -p "${arr[#]%/*}"

How about...
for p in $(xargs < somefile.txt);
do
mkdir -p $(dirname ${p})
done

xargs -n 1 dirname <somefile.txt | xargs mkdir -p

It can be done without loop also (provided input file not huge):
mkdir -p $(perl -pe 's#/(?!.*/).*$##' file.txt)

If you have file "file1" with filenames you could try this oneliner:
cat file1 |xargs -I {} dirname "{}"| sort -u | xargs -I{} mkdir -p "{}"
Use of:
xargs -I{} mkdir -p "{}"
ensures that even path names with spaces will be created

Using a perl one-liner and File::Path qw(make_path):
perl -MFile::Path=make_path -lne 'make_path $_' dirlist.txt

delete file other than particular extension file format

i have a lot of different type of files in one folder. i need to delete the files but except the pdf file.
I tried to display the pdf file only. but i need to delete the other than pdf files
ls -1 | xargs file | grep 'PDF document,' | sed 's/:.*//'

You could do the following - I've used echo rm instead of rm for safety:
for i in *
do
[ x"$(file --mime-type -b "$i")" != xapplication/pdf ] && echo rm "$i"
done
The --mime-type -b options to file make the output of file easier to deal with in a script.

$ ls
aa.txt a.pdf bb.cpp b.pdf
$ ls | grep -v .pdf | xargs rm -rf
$ ls
a.pdf b.pdf
:) !

ls |xargs file|awk -F":" '!($2~/PDF document/){print $1}'|xargs rm -rf

Try inverting the grep match:
ls -1 | xargs file | grep -v 'PDF document,' | sed 's/:.*//'

It's rare in my experience to encounter PDF files which don't have a .pdf extension. You don't state why "file" is necessary in the example, but I'd write this as:
# find . -not -name '*.pdf' -delete
Note that this will recurse into subdirectories; use "-maxdepth 1" to limit to the current directory only.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to grep for a pattern in the files in tar archive without filling up disk space - linux

Here's a bash function that may work for you. Add the following to your ~/.bashrc targrep () { for i in $(tar -tzf "$1"); do results=$(tar -Oxzf "$1" "$i" | grep --label="$i" -H "$2") echo "$results" done } Usage: targrep archive.tar.gz "pattern"

tar -tf test.tar.gz | grep -v '/$'| \ xargs -n 1 -I _ \ sh -c 'tar -xOf test.tar.gz _|grep -q <YOUR SEARCH PATTERN> && echo _'

Try: tar tvf name_of_file |grep --regex="pattern" The t option will test the tar file without extracting the files. The v is verbose and the f prints he filenames. This should save you considerable hard disk space.

may help zcat log.tar.gz | grep -a -i "string" zgrep -i "string" log.tar.gz http://www.commandlinefu.com/commands/view/9261/grep-compressed-log-files-without-extracting

Related

Avoid collision, if copying files

how to get the filename from the mentioned list

xargs copy if file exists

mkdir command for a list of filenames paths

delete file other than particular extension file format

Categories

Resources