gzip: unexpected end of file when using gzip - linux

I have to process a file using my Linux machine.
When I try to write my output to a csv file then gzip it in the same line of script:
processing > output.csv | gzip -f output.csv
I get an 'unexpected end of file' error. Even when I download the file using the Linux machine I get the same error.
When I do not gzip via terminal (or in a single line) everything works fine.
Why does it fail like this when the commands are all in a single line?

You should remove > output.csv
You can either:
Use a pipe: | or:
Redirect to a file
For the same stream (stdout)
You can redirect errors from stderr to a file with 2>errors.txt or they will display on screen

When you redirect a process' IO with the > operator, its output cannot be used by a pipe afterwards (because there's no "output" anymore to be piped). You have two options:
processing > output.csv &&
gzip output.csv
Writes the unprocessed output of your program to the file output.csv and then in a second task gzips this file, replacing it with output.gz. Depending on the amount of data, this might not be feasible (storage reqirements are the full uncompressed output PLUS the compressed size)
processing | gzip > output.csv.gz
This will compress the output of your process in-line and write it directly to the output file, without storing the uncompressed output in an intermediate file.

Related

Is it possible to partially unzip a .vcf file?

I have a ~300 GB zipped vcf file (.vcf.gz) which contains the genomes of about 700 dogs. I am only interested in a few of these dogs and I do not have enough space to unzip the whole file at this time, although I am in the process of getting a computer to do this. Is it possible to unzip only parts of the file to begin testing my scripts?
I am trying to a specific SNP at a position on a subset of the samples. I have tried using bcftools to no avail: (If anyone can identify what went wrong with that I would also really appreciate it. I created an empty file for the output (722g.990.SNP.INDEL.chrAll.vcf.bgz) but it returns the following error)
bcftools view -f PASS --threads 8 -r chr9:55252802-55252810 -o 722g.990.SNP.INDEL.chrAll.vcf.gz -O z 722g.990.SNP.INDEL.chrAll.vcf.bgz
The output type "722g.990.SNP.INDEL.chrAll.vcf.bgz" not recognised
I am planning on trying awk, but need to unzip the file first. Is it possible to partially unzip it so I can try this?
Double check your command line for bcftools view.
The error message 'The output type "something" is not recognized' is printed by bcftools when you specify an invalid value for the -O (upper-case O) command line option like this -O something. Based on the error message you are getting it seems that you might have put the file name there.
Check that you don't have your input and output file names the wrong way around in your command. Note that the -o (lower-case o) command line option specifies the output file name, and the file name at the end of the command line is the input file name.
Also, you write that you created an empty file for the output. You don't need to do that, bcftools will create the output file.
I don't have that much experience with bcftools but generically If you want to to use awk to manipulate a gzipped file you can pipe to it so as to only unzip the file as needed, you can also pipe the result directly through gzip so it too is compressed e.g.
gzip -cd largeFile.vcf.gz | awk '{ <some awk> }' | gzip -c > newfile.txt.gz
Also zcat is an alias for gzip -cd, -c is input/output to standard out, -d is decompress.
As a side note if you are trying to perform operations on just a part of a large file you may also find the excellent tool less useful it can be used to view your large file loading only the needed parts, the -S option is particularly useful for wide formats with many columns as it stops line wrapping, as is -N for showing line numbers.
less -S largefile.vcf.gz
quit the view with q and g takes you to the top of the file.

Unzip the archive with more than one entry

I'm trying to decompress ~8GB .zip file piped from curl command. Everything I have tried is being interrupted at <1GB and returns a message:
... has more than one entry--rest ignored
I've tried: funzip, gunzip, gzip -d, zcat, ... also with different arguments - all end up in the above message.
The datafile is public, so it's easy to repro the issue:
curl -L https://archive.org/download/nycTaxiTripData2013/faredata2013.zip | funzip > datafile
Are you sure the mentioned file deflates to a single file? If it extracts to multiple files you unfortunately cannot unzip on the fly.
Zip is a container as well as compression format and it doesn't know where the new file begins. You'll have to download the whole file and unzip it.

Regarding lz4mt compression and linux buffering issue

I am using lz4mt multi-threaded version of lz4 and in my workflow I am sending thousands of large size files (620 MB) from client to server and when file reaches on server my rule will trigger and compress file using lz4mt and then remove uncompressed file. The problem is sometimes when I remove uncompressed file, I am not able to get compressed file of right size its because lz4mt returns immediately before sending output to disk.
So is there any way lz4mt will remove uncompressed file itself after compressing as done by bzip2.
Input: bzip2 uncompress_file
Output: Compressed file only
whereas
Input: lz4mt uncompress_file
Output: (Uncompressed + Compressed) file
Below script sync command also not working properly I think.
The script which execute as my rule triggers is:
script.sh
/bin/lz4mt uncompressed_file output_file
/bin/sync
/bin/rm uncompressed_file
Please tell me how to solve above issue.
Thanks a lot
Author here. You could try the following methods
Concatenate commands with && or ;.
Add lz4mt command line option -q (suppress prompt), and -f (force overwrite).
Try it with original lz4.

Writing Block buffered data to a File without fflush(stdout)

From what I understood about buffers: a buffer is a temporarily stored data.
For example: let's assume that you wanted to implement an algorithm for determining whether something is speech or just noise. How would you do this using a constant stream flow of sound data? It would be very difficult. Therefore, by storing this into an array you can perform analysis on this data.
This array of data is called a buffer.
Now, I have a Linux command where the output is continuous:
stty -F /dev/ttyUSB0 ispeed 4800 && awk -F"," '/SUF/ {print $3,$4,$5,$6,$10,$11,substr($2,1,2),".",substr($2,3,2),".",substr($2,5,2)}' < /dev/ttyUSB0
If I were to write the output of this command to a file, I won't be able to write it, because the output is probably block buffered and only an empty text file will be generated when I terminate the output of the above command (CTRL+C).
Here is what i mean by Block Buffered.
The three types of buffering available are unbuffered, block
buffered, and line buffered. When an output stream is unbuffered,
information appears on the destination file or terminal as soon as
written; when it is block buffered many characters are saved up and
written as a block; when it is line buffered characters are saved
up until a newline is output or input is read from any stream
attached to a terminal device (typically stdin). The function
fflush(3) may be used to force the block out early. (See
fclose(3).) Normally all files are block buffered. When the first
I/O operation occurs on a file, malloc(3) is called, and a buffer
is obtained. If a stream refers to a terminal (as stdout normally
does) it is line buffered. The standard error stream stderr is
always unbuffered by default.
Now, executing this command,
stty -F /dev/ttyUSB0 ispeed 4800 && awk -F"," '/SUF/ {print $3,$4,$5,$6,$10,$11,substr($2,1,2),".",substr($2,3,2),".",substr($2,5,2)}' < /dev/ttyUSB0 > outputfile.txt
An empty file will be generated because the buffer block might have not been completed when I terminated the process, and since i don't know the block buffer size, there is no way to wait for the block is complete.
In order to write the output of this command to a file I have to use fflush() inside awk, which would successfully write the output into the text file, which I have already done successfully.
Here it goes:
stty -F /dev/ttyUSB0 ispeed 4800 && awk -F"," '/GGA/ {print "Latitude:",$3,$4,"Longitude:",$5,$6,"Altitude:",$10,$11,"Time:",substr($2+50000,1,2),".",substr($2,3,2),".",substr($2,5,2); fflush(stdout) }' < /dev/ttyUSB0 | head -n 2 > GPS_data.txt
But my question is:
Is there any way to declare the buffer block size so that I would know when the buffer block in generated, so eliminating the need of using fflush()?
OR
Is there anyway to change buffer type from Block buffered to unbuffered or line buffered ?
You can use stdbuf to run a command with a modified buffer size.
For example, stdbuf -o 100 awk ... will run awk with a 100 byte standard output buffer.

Fast Concatenation of Multiple GZip Files

I have list of gzip files:
file1.gz
file2.gz
file3.gz
Is there a way to concatenate or gzipping these files into one gzip file
without having to decompress them?
In practice we will use this in a web database (CGI). Where the web will receive
a query from user and list out all the files based on the query and present them
in a batch file back to the user.
With gzip files, you can simply concatenate the files together, like so:
cat file1.gz file2.gz file3.gz > allfiles.gz
Per the gzip RFC,
A gzip file consists of a series of "members" (compressed data sets). [...] The members simply appear one after another in the file, with no additional information before, between, or after them.
Note that this is not exactly the same as building a single gzip file of the concatenated data; among other things, all of the original filenames are preserved. However, gunzip seems to handle it as equivalent to a concatenation.
Since existing tools generally ignore the filename headers for the additional members, it's not easily possible to extract individual files from the result. If you want this to be possible, build a ZIP file instead. ZIP and GZIP both use the DEFLATE algorithm for the actual compression (ZIP supports some other compression algorithms as well as an option - method 8 is the one that corresponds to GZIP's compression); the difference is in the metadata format. Since the metadata is uncompressed, it's simple enough to strip off the gzip headers and tack on ZIP file headers and a central directory record instead. Refer to the gzip format specification and the ZIP format specification.
Here is what man 1 gzip says about your requirement.
Multiple compressed files can be concatenated. In this case, gunzip will extract all members at once. For example:
gzip -c file1 > foo.gz
gzip -c file2 >> foo.gz
Then
gunzip -c foo
is equivalent to
cat file1 file2
Needless to say, file1 can be replaced by file1.gz.
You must notice this:
gunzip will extract all members at once
So to get all members individually, you will have to use something additional or write, if you wish to do so.
However, this is also addressed in man page.
If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip. GNU tar supports the -z option to invoke gzip transparently. gzip is designed as a complement to tar, not as a replacement.
Just use cat. It is very fast (0.2 seconds for 500 MB for me)
cat *gz > final
mv final final.gz
You can then read the output with zcat to make sure it's pretty:
zcat final.gz
I tried the other answer of 'gz -c' but I ended up with garbage when using already gzipped files as input (I guess it double compressed them).
PV:
Better yet, if you have it, 'pv' instead of cat:
pv *gz > final
mv final final.gz
This gives you a progress bar as it works, but does the same thing as cat.
You can create a tar file of these files and then gzip the tar file to create the new gzip file
tar -cvf newcombined.tar file1.gz file2.gz file3.gz
gzip newcombined.tar

Resources