linux - Convert txt file with entries on different lines to .csv? - linux

I have used exiftool to make .txt files with the EXIF data of some images and videos I am working with. I will also eventually need to create .csv manifests for these files, and I know there are simple ways to convert .txt files to .csv files, but the instructions I've found have described how to do so if the .txt file has the information to go in different columns within the same line, while mine are on different lines. Is there a way to do this conversion with .txt files that are organized differently?
For example, I have seen instructions for how to convert something like this
filename date size
abc.JPG 1/1/2001 1MB
def.JPG 1/1/2001 1MB
hij.JPG 1/1/2001 1MB
to
filename,date,size
abc.JPG,1/1/2001,1MB
def.JPG,1/1/2001,1MB
hij.JPG,1/1/2001,1MB
The .txt files I have, on the other hand, are formatted like this:
========abc.JPG
File Name abc.JPG
Date/Time Original 2001:01:01 1:00:00
Size 1 MB
========def.JPG
File Name def.JPG
Date/Time Original 2001:01:01 1:01:00
Size 1 MB
========hij.JPG
File Name hij.JPG
Date/Time Original 2001:01:01 1:02:00
Size 1 MB
but I still need an output like
filename,date,size
abc.JPG,2001:01:01 1:00:00,1 MB
def.JPG,2001:01:01 1:01:00,1 MB
hij.JPG,2001:01:01 1:02:00,1 MB

Using sed and Miller, you can run
<input.txt sed -r 's/==.+//g;s/([a-zA-Z]) ([a-zA-Z])/\1-\2/' | mlr --x2c label filename,date,size >output.csv
to have
filename,date,size
abc.JPG,2001:01:01 1:00:00,1 MB
def.JPG,2001:01:01 1:01:00,1 MB
hij.JPG,2001:01:01 1:02:00,1 MB

Related

Unzip part of file

Is there a way to unzip part of a .gz file without having to unzip it all?
I have a large (~139Gb) zipped .csv.gz file. I have been told that the .csv file has ~540M rows of data. I only need to access a sample of the data in the .csv file and I would be happy for it to be the first 1M rows (which would constitute about ~250Mb of the zip file). I am happy to unzip by number of rows, or by number of bytes, but would prefer to not have to unzip the entire file to access only a sample of the data.
Untested (I don't have [or care to create] a gzipped CSV file that big):
zcat file.csv.gz | head -n 1000000 > extract.csv

Is it possible to remove characters from a compressed file without extracting it?

I have a compressed file that's about 200 MB, in the form of a tar.gz file. I understand that I can extract the xml files in it. It contains several small and one 5 GB xml file. I'm trying to remove certain characters from the xml files.
So my very basic question is: is it even possible to accomplish this without ever extracting the content of the compressed file?
I'm trying to speed up the process of reading through xml files looking for characters to remove.
You will have to decompress, change, and then recompress the files. There's no way around that.
However, this does not necessarily include writing the file to a storage. You might be able to do the changes you like in a streaming fashion, i.e. that everything is only done in memory without ever having the complete decompressed file somewhere. Unix uses pipes for such tasks.
Here is an example on how to do it:
Create two random files:
echo "hello world" > a
echo "hello world" > b
Create a compressed archive containing both:
tar -c -z -f x.tgz a b
Pipe the contents of the uncompressed archive through a changer. Unfortunately I haven't found any shell-based way to do this but you also specified Python in the tags, and with the tarfile module you can achieve this:
Here is the file tar.py:
#!/usr/bin/env python3
import sys
import tarfile
tar_in = tarfile.open(fileobj=sys.stdin.buffer, mode='r:gz')
tar_out = tarfile.open(fileobj=sys.stdout.buffer, mode='w:gz')
for tar_info in tar_in:
reader = tar_in.extractfile(tar_info)
if tar_info.path == 'a': # my example file names are "a" and "b"
# now comes the code which makes our change:
# we just skip the first two bytes in each file:
reader.read(2) # skip two bytes
tar_info.size -= 2 # reduce size in info object as well
# add the (maybe changed) file to the output:
tar_out.addfile(tar_info, reader)
tar_out.close()
tar_in.close()
This can be called like this:
./tar.py < x.tgz > y.tgz
y.tgz will contain both files again, but in a the first two bytes will be skipped (so its contents will be llo world).
You will have noticed that you need to know the resulting size of your change beforehand. tar is designed to handle files, and so it needs to write the size of the entry files into the tar info datagram which precedes every entry file in the resulting file, so I see no way around this. With a compressed output it also isn't possible to skip back after writing all output and adjust the file size.
But as you phrased your question, this might be possible in your case.
All you will have to do is provide a file-like object (could be a Popen object's output stream) like reader in my simple example case.

ZipArchive ZipFile not compressing

I am using the System.IO.Compression in .net 4.5 and ZipArchive and ZipFile are used to add txt files to the archive. There are around 75 files .
The files when put in a folder and measured the size, it was around 15 KB
But when used to put in archive using the ZipArchive the zip file size generated was of 21KB.
Am I doing something wrong or this ZipArhive is just to put the files into an archive single file rather than compressing ?
Is that Deflate alogrithm used for the .zip creation ?
This is what I have done. Is there any high level compression that can be used ?for .7zip the file size is even smaller, around 1KB only
using (ZipArchive zippedFile = ZipFile.Open(zipFileName, ZipArchiveMode.Create))
{
foreach (string file in filesTobeZipped)
{
zippedFile.CreateEntryFromFile(file, Path.GetFileName(file), CompressionLevel.Optimal);
}
}
Each entry in a zip file has an overhead for the local and central headers of 76 bytes plus the length of the file name with path, twice, plus a single end record of 22 bytes. For 75 files each with, say, a three-character name, the total overhead would be about 6K. Each file averages about 200 bytes in length uncompressed, which is too short to compress effectively. If each file remained a 200 byte entry in the zip file, then you would end up with a 21K zip file. Which is in fact what you got.

Appending filename information to RDD initialized by sc.textFile

I have a set of log files I would like to read into an RDD.
These files are all compressed .gz and are the filenames are date stamped.
The source of these files is the page view statistics data for wikipedia
http://dumps.wikimedia.org/other/pagecounts-raw/
The file names look like this:
pagecounts-20090501-000000.gz
pagecounts-20090501-010000.gz
pagecounts-20090501-020000.gz
What I would like to do is read in all such files in a directory and prepend the date from the filename (e.g. 20090501) to each row of the resulting RDD.
I first thought of using sc.wholeTextFiles(..) instead of sc.textFile(..), which creates a PairRDD with the key being the file name with a path,
but sc.wholeTextFiles() doesn't handle compressed .gz files.
Any suggestions would be welcome.
The following seems to work fine for me in Spark 1.6.0:
sc.wholeTextFiles("file:///tmp/*.gz").flatMapValues(y => y.split("\n")).take(10).foreach(println)
Sample output:
(file:/C:/tmp/pagecounts-20160101-000000.gz,aa 271_a.C 1 4675)
(file:/C:/tmp/pagecounts-20160101-000000.gz,aa Battaglia_di_Qade%C5%A1/it/Battaglia_dell%27Oronte 1 4765)
(file:/C:/tmp/pagecounts-20160101-000000.gz,aa Category:User_th 1
4770)
(file:/C:/tmp/pagecounts-20160101-000000.gz,aa Chiron_Elias_Krase 1 4694)

How to Grep & Cat text files based on an identifier line from a multi-text file

all,
I am looking for an efficient way to organize and filter certain types of text files.
Let's say I have 10,000,000 text files that are concatenated to larger chunks that are formatted like this
#text_file_header
ID0001
some text
...
#text_file_header
ID0002
some text
...
#text_file_header
ID0003
some text
...
Now, I perform a certain operations on those files so that I end up with 200 x 10,000,000 text files (in chunks) -- each text file has "siblings" now
#text_file_header
ID0001_1
some text
...
#text_file_header
ID0001_2
some text
...
#text_file_header
ID0001_3
some text
...
#text_file_header
ID0002_1
some text
...
#text_file_header
ID0002_2
some text
...
#text_file_header
ID0002_3
some text
However, for certain tasks, I only need certain text files, and my main question is how I can extract them based on an "id" in the text files (e.g., grep ID0001_* and ID0005_* and ID0006_* and so on).
SQLite would be one option, and I also already have an SQLite database with ID and file columns, however, the problem is that I need to do this computation where I generate those 200 * 10,000,000 text files on a cluster due to time constraints. The file I/O for SQLite would be too limiting right now.
My idea was now to split those files into 10,000,000 inidividual files like so
gawk -v RS="#<TRIPOS>MOLECULE" 'NF{ print RS$0 > "file"++n".txt" }' all_chunk_01.txt
and after I generated those 200 "siblings", I would do a
cat in the folder based on the file IDs that I would be currently interested in. Let's say I need the corps of 10,000 out of the 10,000,000 text files, I would cat them together to a single document that I need for further processing steps.
Now, my concern is if it is a good idea at all to store 10,000,000 individual files in a single folder on a disk and perform the cat, or would it be better to grep out the files based on an ID from let's say 100 multitext files?
For example:
grep TextToFind FileWhereToFind
returns what you want.

Resources