Linux: Unzip an archive containing files with the same name

Linux: Unzip an archive containing files with the same name - linux

I was sent a zip file containing 40 files with the same name.
I wanted to extract each of these files to a seperate folder OR extract each file with a different name (file1, file2, etc).
Is there a way to do this automatically with standard linux tools? A check of man unzip revealed nothing that could help me. zipsplit also does not seem to allow an arbitrary splitting of zip files (I was trying to split the zip into 40 archives, each containing one file).
At the moment I am (r)enaming my files individually. This is not so much of a problem with a 40 file archive, but is obviously unscalable.
Anyone have a nice, simple way of doing this? More curious than anything else.
Thanks.

Assuming that no such tool currently exists, then it should be quite easy to write one in python. Python has a zipfile module that should be sufficient.
Something like this (maybe, untested):
#!/usr/bin/env python
import os
import sys
import zipfile
count = 0
z = zipfile.ZipFile(sys.argv[1],"r")
for info in z.infolist():
directory = str(count)
os.makedirs(directory)
z.extract(info,directory)
count += 1
z.close()

I know this is a couple years old, but the answers above did not solve my particular problem here so I thought I should go ahead and post a solution that worked for me.
Without scripting, you can just use command line input to interact with the unzip tools text interface. That is, when you type this at the command line:
unzip file.zip
and it contains files of the same name, it will prompt you with:
replace sameName.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename:
If you wanted to do this by hand, you would type "r", and then at the next prompt:
new name:
you would just type the new file name.
To automate this, simply create a text file with the responses to these prompts and use it as the input to unzip, as follows.
r
sameName_1.txt
r
sameName_2.txt
...
That is generated pretty easily using your favorite scripting language. Save it as unzip_input.txt and then use it as input to unzip like this:
unzip < unzip_input.txt
For me, this was less of a headache than trying to get the Perl or Python extraction modules working the way I needed. Hope this helps someone...

here is a linux script version
in this case the 834733991_T_ONTIME.csv is the name of the file that is the same inside every zip file, and the .csv after "$count" simply has to be swapped with the file type you want
#!/bin/bash
count=0
for a in *.zip
do
unzip -q "$a"
mv 834733991_T_ONTIME.csv "$count".csv
count=$(($count+1))
done`

This thread is old but there is still room for improvement. Personally I prefer the following one-liner in bash
unzipd ()
{
unzip -d "${1%.*}" "$1"
}
Nice, clean, and simple way to remove the extension and use the

Using unzip -B file.zip did the trick for me. It creates a backup file suffixed with ~<number> in case the file already exists.
For example:
$ rm *.xml
$ unzip -B bogus.zip
Archive: bogus.zip
inflating: foo.xml
inflating: foo.xml
inflating: foo.xml
inflating: foo.xml
inflating: foo.xml
$ ls -l
-rw-rw-r-- 1 user user 1161 Dec 20 20:03 bogus.zip
-rw-rw-r-- 1 user user 1501 Dec 16 14:34 foo.xml
-rw-rw-r-- 1 user user 1520 Dec 16 14:45 foo.xml~
-rw-rw-r-- 1 user user 1501 Dec 16 14:47 foo.xml~1
-rw-rw-r-- 1 user user 1520 Dec 16 14:53 foo.xml~2
-rw-rw-r-- 1 user user 1520 Dec 16 14:54 foo.xml~3
Note: the -B option does not show up in unzip --help, but is mentioned in the man pages: https://manpages.org/unzip#options

Related

Move and copy text file in bash script

I referred to all the previous responses to this question from stackoverflow and tried out the following. But unfortunately, I am still encountering an issue.
I have a text file named Rels_obs inside my directory home/manuela/PycharmProjects/knowledgegraphidentification/data. As the script runs, it extracts a kgis.tar.gz compressed folder and extracts it in the following manner.
#!/bin/bash
readonly DATA_URL='https://linqs-data.soe.ucsc.edu/public/psl-examples-data/kgi.tar.gz'
readonly DATA_FILE='kgis.tar.gz'
readonly DATA_DIR='kgi'
function main() {
trap exit SIGINT
check_requirements
fetch_file "${DATA_URL}" "${DATA_FILE}" 'data'
extract_tar "${DATA_FILE}" "${DATA_DIR}" 'data'
}
The extraction results in two directories within the data directory found at home/manuela/PycharmProjects/knowledgegraphidentification/data :
eval directory : home/manuela/PycharmProjects/knowledgegraphidentification/data/kgi/eval
learn directory : home/manuela/PycharmProjects/knowledgegraphidentification/data/kgi/learn
What I want to do is to copy my Rels_obs file to both of these newly available directories, eval and learn.
I tried doing the following but it resulted in an error as shown below.
#!/bin/bash
readonly DATA_URL='https://linqs-data.soe.ucsc.edu/public/psl-examples-data/kgi.tar.gz'
readonly DATA_FILE='kgis.tar.gz'
readonly DATA_DIR='kgi'
function main() {
trap exit SIGINT
check_requirements
fetch_file "${DATA_URL}" "${DATA_FILE}" 'data'
extract_tar "${DATA_FILE}" "${DATA_DIR}" 'data'
echo "COPYING"
//I have only one file that is of plain text format within the data directory
for file in ~/PycharmProjects/knowledgegraphidentification/data/*.txt
do
name="$(basename "$file" .txt)"
cp "$file" "~/PycharmProjects/knowledgegraphidentification/data/kgi/eval"
cp "$file" "~/PycharmProjects/knowledgegraphidentification/data/kgi/learn"
done
echo "SUCCESSFULLY COPIED FILES"
}
Error
COPYING
cp: cannot stat '/home/manuelanayantarajeyaraj/PycharmProjects/knowledgegraphidentification/data/.txt': No such file or directory
cp: cannot stat '/home/manuelanayantarajeyaraj/PycharmProjects/knowledgegraphidentification/data/.txt': No such file or directory
ls -l on the data directory
total 20524
-rwxr-xr-x 1 manuelanayantarajeyaraj manuelanayantarajeyaraj 2210
Feb 5 15:19 fetchData.sh
drwxrwxr-x 4 manuelanayantarajeyaraj manuelanayantarajeyaraj 4096
Nov 19 2017 kgi
-rw-rw-r-- 1 manuelanayantarajeyaraj manuelanayantarajeyaraj 18546351
Feb 5 15:21 kgis.tar.gz
-rw-rw-r-- 1 manuelanayantarajeyaraj manuelanayantarajeyaraj 2459319
Feb 5 13:31 Rels_obs
Any suggestions in this regard will be highly appreciated.

It says no such file or directory, which is visible from the file name:
/home/manuelanayantarajeyaraj/PycharmProjects/knowledgegraphidentification/data/.txt
It doesn't parse an asterisk in the for loop.
As a hot fix you can use:
SRC_DIR="~/PycharmProjects/knowledgegraphidentification/data/"
for file in $(ls -1 ${SRC_DIR}/*.txt); do
cp "$file" "${SRC_DIR}/kgi/eval"
cp "$file" "${SRC_DIR}/kgi/learn"
done
echo "SUCCESSFULLY COPIED FILES"
Or you can just copy then by two commands:
SRC_DIR="~/PycharmProjects/knowledgegraphidentification/data/"
cp ${SRC_DIR}/*.txt ${SRC_DIR}/kgi/eval/
cp ${SRC_DIR}/*.txt ${SRC_DIR}/kgi/learn/

You mention a text file named Rels_obs, but is it named that or is it named Rels_obs.txt? Linux (text) files do not need an extension per se, extensions are purely for the user's convenience. So if the file does not have the extension, this script will not find it since it is looking for the extension.
If you're sure this file will always have the same name, and you only want this file, I'd simply use the name.
ie.
cp "~/PycharmProjects/knowledgegraphidentification/data/Rels_obs" "~/PycharmProjects/knowledgegraphidentification/data/kgi/eval"
cp "~/PycharmProjects/knowledgegraphidentification/data/Rels_obs" "~/PycharmProjects/knowledgegraphidentification/data/kgi/learn"
If you really want all plaintext files in that directory, you could do this:
for file in ~/PycharmProjects/knowledgegraphidentification/data/*
do
test "`file $file`" =~ "ASCII text" || continue
cp "$file" "~/PycharmProjects/knowledgegraphidentification/data/kgi/eval"
cp "$file" "~/PycharmProjects/knowledgegraphidentification/data/kgi/learn"
done
Or you could rename the file to add the .txt extension first somehow.

How to preserve timestamp of original file post zip compression?

I have a lot of files on our servers which we compression with a filter that only the files older than x days will get compressed.
The zip command compresses the original, makes a filename.zip and removes the original.
This has a small problem that the timestamp changes since the compression job runs after x days.
So when we run files to remove older files (which are by now zip files), not all files get removed since the timestamp has changed from the original file to the compressed file.
I would like to add a condition where while zipping, i want the original timestamp of the file to be retained by the zip archive even though its running at a later date.
One way of doing this would be to
Get timestamp of each original file with a date command
Compress the original, remove the original
Use and insert the earlier stored timestamp to the new zip file using "touch"
I am looking for a simpler solution.

Some old file I had:
$ ls -l foo
-rw-r--r-- 1 james james 120 Sep 5 07:28 foo
Zip and redate:
$ zip foo.zip foo && touch -d "$(date -R -r foo)" foo.zip
Check it out:
$ ls -l foo.zip
-rw-r--r-- 1 james james 120 Sep 5 07:28 foo.zip
Remove the original:
$ rm -i foo

Yes you can unzip a file and preserve the old timestamp from the original time it was created. Steps to do this are as below:
Click on the filename.zip, properties
In the General tab, the security says "This file came from another computer and might be blocked to help protect this computer". Click on the Unblock check box and click OK
Extract the file and volla, the extracted file has the datatime stamp when the file was created/modified

Run script on specific file in all subdirs

I've written a script (foo) which makes a simple sed replacement on text in the input file. I have a directory (a) containing a large number of subdirectories (a/b1, a/b2 etc) which all have the same subdirs (c, etc) and contain a file with the same name (d). So the rough structure is:
a/
-b1/
--c/
---d
-b2/
--c/
---d
-b3/
--c/
---d
I want to run my script on every file (d) in the tree. Unfortunately the following doesn't work:
sudo sh foo a/*/c/d
how do I use wildcards in a bash command like this? Do I have to use find with specific max and mindepth, or is there a more elegant solution?

The wildcard expansion in your example should work, and no find should be needed. I assume a b and c are just some generic file names to simplify the question. Do any of your folders/files contain spaces?
If you do:
ls -l a/*/d/c
are you getting the files you need listed? If so, then it is how you handle the $* in your script file. Mind sharing it with us?
As you can see, wildcard expansion works
$ ls -l a/*/c/d
-rw-r--r-- 1 user wheel 0 15 Apr 08:05 a/b1/c/d
-rw-r--r-- 1 user wheel 0 15 Apr 08:05 a/b2/c/d
-rw-r--r-- 1 user wheel 0 15 Apr 08:05 a/b3/c/d

Linux move files without replacing if files exists

In Linux how do I move files without replacing if a particular file already exists in the destination?
I tried the following command:
mv --backup=t <source> <dest>
The file doesn't get replaced but the issue is the extension gets changed because it puts "~" at the back of the filename.
Is there any other way to preserve the extension but only the filename gets changed when moving?
E.g.
test~1.txt instead of test.txt~1
When the extension gets replaced, subsequently you can't just view a file by double clicking on it.

If you want to make it in shell, without requiring atomicity (so if two shell processes are running the same code at the same time, you could be in trouble), you simply can (using the builtin test(1) feature of your shell)
[ -f destfile.txt ] || mv srcfile.txt destfile.txt
If you require atomicity (something that works when two processes are simultaneously running it), things are quite difficult, and you'll need to call some system calls in C. Look into renameat2(2)
Perhaps you should consider using some version control system like git ?

mv has an option:
-S, --suffix=SUFFIX
override the usual backup suffix
which you might use; however afaik mv doesn't have a functionality to change part of the filename but not the extension. If you just want to be able to open the backup file with a text editor, you might consider something like:
mv --suffix=.backup.txt <source> <dest>
how this would work: suppose you have
-rw-r--r-- 1 chris users 2 Jan 25 11:43 test2.txt
-rw-r--r-- 1 chris users 0 Jan 25 11:42 test.txt
then after the command mv --suffix=.backup.txt test.txt test2.txt you get:
-rw-r--r-- 1 chris users 0 Jan 25 11:42 test2.txt
-rw-r--r-- 1 chris users 2 Jan 25 11:43 test2.txt.backup.txt

#aandroidtest: if you are able to rely upon a Bash shell script and the source directory (where the files reside presently) and the target directory (where you want to them to move to) are same file system, I suggest you try out a script that I wrote. You can find it at https://github.com/jmmitchell/movestough
In short, the script allows you to move files from a source directory to a target directory while taking into account new files, duplicate (same file name, same contents) files, file collisions (same file name, different contents), as well as replicating needed subdirectory structures. In addition, the script handles file collision renaming in three forms. As an example if, /some/path/somefile.name.ext was found to be a conflicting file. It would be moved to the target directory with a name like one of the following, depending on the deconflicting style chosen (via the -u= or --unique-style= flag):
default style : /some/path/somefile.name.ext-< unique string here >
style 1 : /some/path/somefile.name.< unique string here >.ext
style 2 : /some/path/somefile.< unique string here >.name.ext
Let me know if you have any questions.

Guess mv command is quite limited if moving files with same filename.
Below is the bash script that can be used to move and if the file with the same filename exists it will append a number to the filename and the extension is also preserved for easier viewing.
I modified the script that can be found here:
https://superuser.com/a/313924
#!/bin/bash
source=$1
dest=$2
file=$(basename $source)
basename=${file%.*}
ext=${file##*.}
if [[ ! -e "$dest/$basename.$ext" ]]; then
mv "$source" "$dest"
else
num=1
while [[ -e "$dest/$basename$num.$ext" ]]; do
(( num++ ))
done
mv "$source" "$dest/$basename$num.$ext"
fi

How to compare two tarball's content

I want to tell whether two tarball files contain identical files, in terms of file name and file content, not including meta-data like date, user, group.
However, There are some restrictions:
first, I have no control of whether the meta-data is included when making the tar file, actually, the tar file always contains meta-data, so directly diff the two tar files doesn't work.
Second, since some tar files are so large that I cannot afford to untar them in to a temp directory and diff the contained files one by one. (I know if I can untar file1.tar into file1/, I can compare them by invoking 'tar -dvf file2.tar' in file/. But usually I cannot afford untar even one of them)
Any idea how I can compare the two tar files? It would be better if it can be accomplished within SHELL scripts. Alternatively, is there any way to get each sub-file's checksum without actually untar a tarball?
Thanks,

Try also pkgdiff to visualize differences between packages (detects added/removed/renamed files and changed content, exist with zero code if unchanged):
pkgdiff PKG-0.tgz PKG-1.tgz

Are you controlling the creation of these tar files?
If so, the best trick would be to create a MD5 checksum and store it in a file within the archive itself. Then, when you want to compare two files, you just extract this checksum files and compare them.
If you can afford to extract just one tar file, you can use the --diff option of tar to look for differences with the contents of other tar file.
One more crude trick if you are fine with just a comparison of the filenames and their sizes.
Remember, this does not guarantee that the other files are same!
execute a tar tvf to list the contents of each file and store the outputs in two different files. then, slice out everything besides the filename and size columns. Preferably sort the two files too. Then, just do a file diff between the two lists.
Just remember that this last scheme does not really do checksum.
Sample tar and output (all files are zero size in this example).
$ tar tvfj pack1.tar.bz2
drwxr-xr-x user/group 0 2009-06-23 10:29:51 dir1/
-rw-r--r-- user/group 0 2009-06-23 10:29:50 dir1/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:51 dir1/file2
drwxr-xr-x user/group 0 2009-06-23 10:29:59 dir2/
-rw-r--r-- user/group 0 2009-06-23 10:29:57 dir2/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:59 dir2/file3
drwxr-xr-x user/group 0 2009-06-23 10:29:45 dir3/
Command to generate sorted name/size list
$ tar tvfj pack1.tar.bz2 | awk '{printf "%10s %s\n",$3,$6}' | sort -k 2
0 dir1/
0 dir1/file1
0 dir1/file2
0 dir2/
0 dir2/file1
0 dir2/file3
0 dir3/
You can take two such sorted lists and diff them.
You can also use the date and time columns if that works for you.

tarsum is almost what you need. Take its output, run it through sort to get the ordering identical on each, and then compare the two with diff. That should get you a basic implementation going, and it would be easily enough to pull those steps into the main program by modifying the Python code to do the whole job.

Here is my variant, it is checking the unix permission too:
Works only if the filenames are shorter than 200 char.
diff <(tar -tvf 1.tar | awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2) <(tar -tvf 2.tar|awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2)

EDIT: See the comment by #StéphaneGourichon
I realise that this is a late reply, but I came across the thread whilst attempting to achieve the same thing. The solution that I've implemented outputs the tar to stdout, and pipes it to whichever hash you choose:
tar -xOzf archive.tar.gz | sort | sha1sum
Note that the order of the arguments is important; particularly O which signals to use stdout.

Is tardiff what you're looking for? It's "a simple perl script" that "compares the contents of two tarballs and reports on any differences found between them."

There is also diffoscope, which is more generic, and allows to compare things recursively (including various formats).
pip install diffoscope

I propose gtarsum, that I have written in Go, which means it will be an autonomous executable (no Python or other execution environment needed).
go get github.com/VonC/gtarsum
It will read a tar file, and:
sort the list of files alphabetically,
compute a SHA256 for each file content,
concatenate those hashes into one giant string
compute the SHA256 of that string
The result is a "global hash" for a tar file, based on the list of files and their content.
It can compare multiple tar files, and return 0 if they are identical, 1 if they are not.

Just throwing this out there since none of the above solutions worked for what I needed.
This function gets the md5 hash of the md5 hashes of all the file-paths matching a given path. If the hashes are the same, the file hierarchy and file lists are the same.
I know it's not as performant as others, but it provides the certainty I needed.
PATH_TO_CHECK="some/path"
for template in $(find build/ -name '*.tar'); do
tar -xvf $template --to-command=md5sum |
grep $PATH_TO_CHECK -A 1 |
grep -v $PATH_TO_CHECK |
awk '{print $1}' |
md5sum |
awk "{print \"$template\",\$1}"
done
*note: An invalid path simply returns nothing.

If not extracting the archives nor needing the differences, try diff's -q option:
diff -q 1.tar 2.tar
This quiet result will be "1.tar 2.tar differ" or nothing, if no differences.

There is tool called archdiff. It is basically a perl script that can look into the archives.
Takes two archives, or an archive and a directory and shows a summary of the
differences between them.

I have a similar question and i resolve it by python, here is the code.
ps:although this code is used to compare two zipball's content,but it's similar with tarball, hope i can help you
import zipfile
import os,md5
import hashlib
import shutil
def decompressZip(zipName, dirName):
try:
zipFile = zipfile.ZipFile(zipName, "r")
fileNames = zipFile.namelist()
for file in fileNames:
zipFile.extract(file, dirName)
zipFile.close()
return fileNames
except Exception,e:
raise Exception,e
def md5sum(filename):
f = open(filename,"rb")
md5obj = hashlib.md5()
md5obj.update(f.read())
hash = md5obj.hexdigest()
f.close()
return str(hash).upper()
if __name__ == "__main__":
oldFileList = decompressZip("./old.zip", "./oldDir")
newFileList = decompressZip("./new.zip", "./newDir")
oldDict = dict()
newDict = dict()
for oldFile in oldFileList:
tmpOldFile = "./oldDir/" + oldFile
if not os.path.isdir(tmpOldFile):
oldFileMD5 = md5sum(tmpOldFile)
oldDict[oldFile] = oldFileMD5
for newFile in newFileList:
tmpNewFile = "./newDir/" + newFile
if not os.path.isdir(tmpNewFile):
newFileMD5 = md5sum(tmpNewFile)
newDict[newFile] = newFileMD5
additionList = list()
modifyList = list()
for key in newDict:
if not oldDict.has_key(key):
additionList.append(key)
else:
newMD5 = newDict[key]
oldMD5 = oldDict[key]
if not newMD5 == oldMD5:
modifyList.append(key)
print "new file lis:%s" % additionList
print "modified file list:%s" % modifyList
shutil.rmtree("./oldDir")
shutil.rmtree("./newDir")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string