Unzip the archive with more than one entry - linux

I'm trying to decompress ~8GB .zip file piped from curl command. Everything I have tried is being interrupted at <1GB and returns a message:
... has more than one entry--rest ignored
I've tried: funzip, gunzip, gzip -d, zcat, ... also with different arguments - all end up in the above message.
The datafile is public, so it's easy to repro the issue:
curl -L https://archive.org/download/nycTaxiTripData2013/faredata2013.zip | funzip > datafile

Are you sure the mentioned file deflates to a single file? If it extracts to multiple files you unfortunately cannot unzip on the fly.
Zip is a container as well as compression format and it doesn't know where the new file begins. You'll have to download the whole file and unzip it.

Related

Splitting large tar file into multiple tar files

I have a tar file which is 3.1 TB(TeraByte)
File name - Testfile.tar
I would like to split this tar file into 2 parts - Testfil1.tar and Testfile2.tar
I tried the following so far
split -b 1T Testfile.tar "Testfile.tar"
What i get is Testfile.taraa(what is "aa")
And i just stopped my command. I also noticed that the output Testfile.taraa doesn't seem to be a tar file when I do ls in the directory. It seems like it is a text file. May be once the full split is completed it will look like a tar file?
The behavior from split is correct, from man page online: http://man7.org/linux/man-pages/man1/split.1.html
Output pieces of FILE to PREFIXaa, PREFIXab, ...
Don't stop the command let it run and then you can use cat to concatenate (join) them all back again.
Examples can be seen here: https://unix.stackexchange.com/questions/24630/whats-the-best-way-to-join-files-again-after-splitting-them
split -b 100m myImage.iso
# later
cat x* > myImage.iso
UPDATE
Just as clarification since I believe you have not understood the approach. You split a big file like this to transport it for example, files are not usable this way. To use it again you need to concatenate (join) pieces back. If you want usable parts, then you need to decompress the file, split it in parts and compress them. With split you basically split the binary file. I don't think you can use those parts.
You are doing the compression first and the partition later.
If you want each part to be a tar file, you should use 'split' first with de original file, and then 'tar' with each part.

remove extracting files after extract, split files

I have a request and a problem.
I have archived files
tar xvpf /to_arch |gzip - c | split -b10000m - /arch/to_arch.gz_
I use this comand. this is archive got my system and i need move it on other server.
on nev server i havent space for put arhive and extract it then i have idea.
Can someone help me write script in bash who can remuve extracted files.
like to_arch.gz_aa to_arch.gz_abto_arch.gz_acto_arch.gz_ad etc.
if finish extract aa file then script delete it.
cat *.gz* | tar zxvf - -i
Normaly i extract that but havent space on disk.

Is it possible to partially unzip a .vcf file?

I have a ~300 GB zipped vcf file (.vcf.gz) which contains the genomes of about 700 dogs. I am only interested in a few of these dogs and I do not have enough space to unzip the whole file at this time, although I am in the process of getting a computer to do this. Is it possible to unzip only parts of the file to begin testing my scripts?
I am trying to a specific SNP at a position on a subset of the samples. I have tried using bcftools to no avail: (If anyone can identify what went wrong with that I would also really appreciate it. I created an empty file for the output (722g.990.SNP.INDEL.chrAll.vcf.bgz) but it returns the following error)
bcftools view -f PASS --threads 8 -r chr9:55252802-55252810 -o 722g.990.SNP.INDEL.chrAll.vcf.gz -O z 722g.990.SNP.INDEL.chrAll.vcf.bgz
The output type "722g.990.SNP.INDEL.chrAll.vcf.bgz" not recognised
I am planning on trying awk, but need to unzip the file first. Is it possible to partially unzip it so I can try this?
Double check your command line for bcftools view.
The error message 'The output type "something" is not recognized' is printed by bcftools when you specify an invalid value for the -O (upper-case O) command line option like this -O something. Based on the error message you are getting it seems that you might have put the file name there.
Check that you don't have your input and output file names the wrong way around in your command. Note that the -o (lower-case o) command line option specifies the output file name, and the file name at the end of the command line is the input file name.
Also, you write that you created an empty file for the output. You don't need to do that, bcftools will create the output file.
I don't have that much experience with bcftools but generically If you want to to use awk to manipulate a gzipped file you can pipe to it so as to only unzip the file as needed, you can also pipe the result directly through gzip so it too is compressed e.g.
gzip -cd largeFile.vcf.gz | awk '{ <some awk> }' | gzip -c > newfile.txt.gz
Also zcat is an alias for gzip -cd, -c is input/output to standard out, -d is decompress.
As a side note if you are trying to perform operations on just a part of a large file you may also find the excellent tool less useful it can be used to view your large file loading only the needed parts, the -S option is particularly useful for wide formats with many columns as it stops line wrapping, as is -N for showing line numbers.
less -S largefile.vcf.gz
quit the view with q and g takes you to the top of the file.

zip command not working

I am trying to zip a file using shell script command. I am using following command:
zip ./test/step1.zip $FILES
where $FILES contain all the input files. But I am getting a warning as follows
zip warning: name not matched: myfile.dat
and one more thing I observed that the file which is at last in the list of files in a folder has the above warning and that file is not getting zipped.
Can anyone explain me why this is happening? I am new to shell script world.
zip warning: name not matched: myfile.dat
This means the file myfile.dat does not exist.
You will get the same error if the file is a symlink pointing to a non-existent file.
As you say, whatever is the last file at the of $FILES, it will not be added to the zip along with the warning. So I think something's wrong with the way you create $FILES. Chances are there is a newline, carriage return, space, tab, or other invisible character at the end of the last filename, resulting in something that doesn't exist. Try this for example:
for f in $FILES; do echo :$f:; done
I bet the last line will be incorrect, for example:
:myfile.dat :
...or something like that instead of :myfile.dat: with no characters before the last :
UPDATE
If you say the script started working after running dos2unix on it, that confirms what everybody suspected already, that somehow there was a carriage-return at the end of your $FILES list.
od -c shows the \r carriage-return. Try echo $FILES | od -c
Another possible cause that can generate a zip warning: name not matched: error is having any of zip's environment variables set incorrectly.
From the man page:
ENVIRONMENT
The following environment variables are read and used by zip as described.
ZIPOPT
contains default options that will be used when running zip. The contents of this environment variable will get added to the command line just after the zip command.
ZIP
[Not on RISC OS and VMS] see ZIPOPT
Zip$Options
[RISC OS] see ZIPOPT
Zip$Exts
[RISC OS] contains extensions separated by a : that will cause native filenames with one of the specified extensions to be added to the zip file with basename and extension swapped.
ZIP_OPTS
[VMS] see ZIPOPT
In my case, I was using zip in a script and had the binary location in an environment variable ZIP so that we could change to a different zip binary easily without making tonnes of changes in the script.
Example:
ZIP=/usr/bin/zip
...
${ZIP} -r folder.zip folder
This is then processed as:
/usr/bin/zip /usr/bin/zip -r folder.zip folder
And generates the errors:
zip warning: name not matched: folder.zip
zip I/O error: Operation not permitted
zip error: Could not create output file (/usr/bin/zip.zip)
The first because it's now trying to add folder.zip to the archive instead of using it as the archive. The second and third because it's trying to use the file /usr/bin/zip.zip as the archive which is (fortunately) not writable by a normal user.
Note: This is a really old question, but I didn't find this answer anywhere, so I'm posting it to help future searchers (my future self included).
eebbesen hit the nail in his comment for my case (but i cannot vote for comment).
Another possible reason missed in the other comments is file exceeding the file size limit (4GB).
I converted my script for unix environment using dos2unix command and executed my script as ./myscript.sh instead bash myscript.sh.
I just discovered another potential cause for this. If the permissions of the directory/subdirectory don't allow the zip to find the file, it will report this error. Actually, if you run a chmod -R 444 on the directory, and then try to zip it, you will reproduce this error, and also have a "stored 0%" report, like this:
zip warning: name not matched: borrar/enviar
adding: borrar/ (stored 0%)
Hence, try changing the permissions of the file. If you are trying to send them through email, and those email filters (like Gmail's) invent silly filters of not sending executables, don't forget that making permissions very strict when making zip compression can be the cause of the error you are reporting, of "name not matched".
spaces are not allowed:
it would fail if there are more than one files(s) in $FILES unless you put them in loop
I also encountered this issue. In my case, the line separate is CRLF in my zip shell script which causes the problem. Using LF fixed it.

Fast Concatenation of Multiple GZip Files

I have list of gzip files:
file1.gz
file2.gz
file3.gz
Is there a way to concatenate or gzipping these files into one gzip file
without having to decompress them?
In practice we will use this in a web database (CGI). Where the web will receive
a query from user and list out all the files based on the query and present them
in a batch file back to the user.
With gzip files, you can simply concatenate the files together, like so:
cat file1.gz file2.gz file3.gz > allfiles.gz
Per the gzip RFC,
A gzip file consists of a series of "members" (compressed data sets). [...] The members simply appear one after another in the file, with no additional information before, between, or after them.
Note that this is not exactly the same as building a single gzip file of the concatenated data; among other things, all of the original filenames are preserved. However, gunzip seems to handle it as equivalent to a concatenation.
Since existing tools generally ignore the filename headers for the additional members, it's not easily possible to extract individual files from the result. If you want this to be possible, build a ZIP file instead. ZIP and GZIP both use the DEFLATE algorithm for the actual compression (ZIP supports some other compression algorithms as well as an option - method 8 is the one that corresponds to GZIP's compression); the difference is in the metadata format. Since the metadata is uncompressed, it's simple enough to strip off the gzip headers and tack on ZIP file headers and a central directory record instead. Refer to the gzip format specification and the ZIP format specification.
Here is what man 1 gzip says about your requirement.
Multiple compressed files can be concatenated. In this case, gunzip will extract all members at once. For example:
gzip -c file1 > foo.gz
gzip -c file2 >> foo.gz
Then
gunzip -c foo
is equivalent to
cat file1 file2
Needless to say, file1 can be replaced by file1.gz.
You must notice this:
gunzip will extract all members at once
So to get all members individually, you will have to use something additional or write, if you wish to do so.
However, this is also addressed in man page.
If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip. GNU tar supports the -z option to invoke gzip transparently. gzip is designed as a complement to tar, not as a replacement.
Just use cat. It is very fast (0.2 seconds for 500 MB for me)
cat *gz > final
mv final final.gz
You can then read the output with zcat to make sure it's pretty:
zcat final.gz
I tried the other answer of 'gz -c' but I ended up with garbage when using already gzipped files as input (I guess it double compressed them).
PV:
Better yet, if you have it, 'pv' instead of cat:
pv *gz > final
mv final final.gz
This gives you a progress bar as it works, but does the same thing as cat.
You can create a tar file of these files and then gzip the tar file to create the new gzip file
tar -cvf newcombined.tar file1.gz file2.gz file3.gz
gzip newcombined.tar

Resources