Unzip part of file - linux

Is there a way to unzip part of a .gz file without having to unzip it all?
I have a large (~139Gb) zipped .csv.gz file. I have been told that the .csv file has ~540M rows of data. I only need to access a sample of the data in the .csv file and I would be happy for it to be the first 1M rows (which would constitute about ~250Mb of the zip file). I am happy to unzip by number of rows, or by number of bytes, but would prefer to not have to unzip the entire file to access only a sample of the data.

Untested (I don't have [or care to create] a gzipped CSV file that big):
zcat file.csv.gz | head -n 1000000 > extract.csv

Related

linux - Convert txt file with entries on different lines to .csv?

I have used exiftool to make .txt files with the EXIF data of some images and videos I am working with. I will also eventually need to create .csv manifests for these files, and I know there are simple ways to convert .txt files to .csv files, but the instructions I've found have described how to do so if the .txt file has the information to go in different columns within the same line, while mine are on different lines. Is there a way to do this conversion with .txt files that are organized differently?
For example, I have seen instructions for how to convert something like this
filename date size
abc.JPG 1/1/2001 1MB
def.JPG 1/1/2001 1MB
hij.JPG 1/1/2001 1MB
to
filename,date,size
abc.JPG,1/1/2001,1MB
def.JPG,1/1/2001,1MB
hij.JPG,1/1/2001,1MB
The .txt files I have, on the other hand, are formatted like this:
========abc.JPG
File Name abc.JPG
Date/Time Original 2001:01:01 1:00:00
Size 1 MB
========def.JPG
File Name def.JPG
Date/Time Original 2001:01:01 1:01:00
Size 1 MB
========hij.JPG
File Name hij.JPG
Date/Time Original 2001:01:01 1:02:00
Size 1 MB
but I still need an output like
filename,date,size
abc.JPG,2001:01:01 1:00:00,1 MB
def.JPG,2001:01:01 1:01:00,1 MB
hij.JPG,2001:01:01 1:02:00,1 MB
Using sed and Miller, you can run
<input.txt sed -r 's/==.+//g;s/([a-zA-Z]) ([a-zA-Z])/\1-\2/' | mlr --x2c label filename,date,size >output.csv
to have
filename,date,size
abc.JPG,2001:01:01 1:00:00,1 MB
def.JPG,2001:01:01 1:01:00,1 MB
hij.JPG,2001:01:01 1:02:00,1 MB

Is there a tool to extract a file from a ZIP archive when that file is not present in central directory but has its own LFH?

I'm looking for a tool that can extract files by searching aggressively through a ZIP archive. The compressed files are preceded with LFHs but no CDHs are present. Unzip outputs an empty folder.
I found one called 'binwalk' but even though it finds the hidden files inside ZIP archives it seems not to know how to extract them.
Thank You in advance.
You can try sunzip. It reads the zip file as a stream, and will extract files as it encounters the local headers and compressed data.
Use the -r option to retain the files decompressed in the event of an error. You will be left with a temporary directory starting with _z containing the extracted files, but with temporary, random names.

Is it possible to remove characters from a compressed file without extracting it?

I have a compressed file that's about 200 MB, in the form of a tar.gz file. I understand that I can extract the xml files in it. It contains several small and one 5 GB xml file. I'm trying to remove certain characters from the xml files.
So my very basic question is: is it even possible to accomplish this without ever extracting the content of the compressed file?
I'm trying to speed up the process of reading through xml files looking for characters to remove.
You will have to decompress, change, and then recompress the files. There's no way around that.
However, this does not necessarily include writing the file to a storage. You might be able to do the changes you like in a streaming fashion, i.e. that everything is only done in memory without ever having the complete decompressed file somewhere. Unix uses pipes for such tasks.
Here is an example on how to do it:
Create two random files:
echo "hello world" > a
echo "hello world" > b
Create a compressed archive containing both:
tar -c -z -f x.tgz a b
Pipe the contents of the uncompressed archive through a changer. Unfortunately I haven't found any shell-based way to do this but you also specified Python in the tags, and with the tarfile module you can achieve this:
Here is the file tar.py:
#!/usr/bin/env python3
import sys
import tarfile
tar_in = tarfile.open(fileobj=sys.stdin.buffer, mode='r:gz')
tar_out = tarfile.open(fileobj=sys.stdout.buffer, mode='w:gz')
for tar_info in tar_in:
reader = tar_in.extractfile(tar_info)
if tar_info.path == 'a': # my example file names are "a" and "b"
# now comes the code which makes our change:
# we just skip the first two bytes in each file:
reader.read(2) # skip two bytes
tar_info.size -= 2 # reduce size in info object as well
# add the (maybe changed) file to the output:
tar_out.addfile(tar_info, reader)
tar_out.close()
tar_in.close()
This can be called like this:
./tar.py < x.tgz > y.tgz
y.tgz will contain both files again, but in a the first two bytes will be skipped (so its contents will be llo world).
You will have noticed that you need to know the resulting size of your change beforehand. tar is designed to handle files, and so it needs to write the size of the entry files into the tar info datagram which precedes every entry file in the resulting file, so I see no way around this. With a compressed output it also isn't possible to skip back after writing all output and adjust the file size.
But as you phrased your question, this might be possible in your case.
All you will have to do is provide a file-like object (could be a Popen object's output stream) like reader in my simple example case.

Appending filename information to RDD initialized by sc.textFile

I have a set of log files I would like to read into an RDD.
These files are all compressed .gz and are the filenames are date stamped.
The source of these files is the page view statistics data for wikipedia
http://dumps.wikimedia.org/other/pagecounts-raw/
The file names look like this:
pagecounts-20090501-000000.gz
pagecounts-20090501-010000.gz
pagecounts-20090501-020000.gz
What I would like to do is read in all such files in a directory and prepend the date from the filename (e.g. 20090501) to each row of the resulting RDD.
I first thought of using sc.wholeTextFiles(..) instead of sc.textFile(..), which creates a PairRDD with the key being the file name with a path,
but sc.wholeTextFiles() doesn't handle compressed .gz files.
Any suggestions would be welcome.
The following seems to work fine for me in Spark 1.6.0:
sc.wholeTextFiles("file:///tmp/*.gz").flatMapValues(y => y.split("\n")).take(10).foreach(println)
Sample output:
(file:/C:/tmp/pagecounts-20160101-000000.gz,aa 271_a.C 1 4675)
(file:/C:/tmp/pagecounts-20160101-000000.gz,aa Battaglia_di_Qade%C5%A1/it/Battaglia_dell%27Oronte 1 4765)
(file:/C:/tmp/pagecounts-20160101-000000.gz,aa Category:User_th 1
4770)
(file:/C:/tmp/pagecounts-20160101-000000.gz,aa Chiron_Elias_Krase 1 4694)

Should all file structures in a ZIP file be consecutive?

While reading a ZIP file, can we safely assume that all file structures (by that I mean Local File Header + file data (compressed or stored) + Data Descriptor) are exactly consecutive? Can there be any irrelevant data in between?
PkWare Appnote tells that
"Immediately following the local header for a file is the compressed
or stored data for the file. The series of [local file header][file
data][data descriptor] repeats for each file in the .ZIP archive."
So there should be no gaps between them.
However, I would recommend to parse and read central directory, not go through local file headers (except that you need streamed processing).

Resources