Concat multiple .gz files, skipping header lines in all but the first file using Python - python-3.x

I have basically the same question as this, but rather than using awk I'd like to use Python, assuming it's not substantially slower than using some other method. I was thinking of reading line by line and compressing on the fly, but then I came across this post, and it sounds like that would be a bad idea (compression not very efficient). I came across this nice-looking gzip built-in Python library, so I'm hoping there is some clean, fast, and efficient pythonic way to do this.
I want to go from this:
gzcat file1.gz
# header
1
2
to this:
# header
1
2
1
2
1
2
1
2
I have a few hundred files, and the total uncompressed is about 60 GB. The files are gzipped CSV files.

Since you need to remove the first line of each CSV file, you have no choice but to decompress all of the data and recompress it.
You can open a gzip output file to write the result with with gzip.open('all.gz', 'wb') as g:.
You can open each input file using with gzip.open('filen.gz', 'rb') as f:, and then x = f.readline() on that object to read the uncompressed data one line at a time. You then have the option to, for example, discard the first line from each file but the first.
For the lines you want to keep, you can write them to the output with g.write(x).

Related

How to get information from text and safe it in variable with python

So I am trying to make an offline dictionary and as a source for the words, I am using a .txt file. I have some questions related to that. How can I find a specific word in my text file and save it in a variable? Also does the length of my file matter and will it affect the speed? That's just a part of my .txt file:
Abendhimmel m вечерно небе.|-|
Abendkasse f Theat вечерна каса.|-|
Abendkleid n вечерна рокля.|-|
Abendland n o.Pl. geh Западът.|-|
The thing that I want is to save the wort, for example, Abendkasse and everything else till this symbol |-| in one variable. Thanks for your help!
I recommend you to look at python's standard library functions (on open files) called realines() and read(). I don't know how large your file is, but you can usually just read the entire thing into ram (with read or readlines) and then search through the string you then get. Searchin can be done with regex or just with a simple loop.
The length of your file will sort of matter, in that opening larger files will take slightly longer. Though usually this is still pretty fast, even for large textfiles. In fact, I think in many cases it will be faster to first read the entire file, because once it is read into ram, all operations on it will be way faster.
an example:
with open("yourlargetextfile.txt", f):
contents = f.readlines()
for line in contents:
# split every line into parts from |-| to the next |-|
parts = line.split("|-|")

Partially expand VCF bgz file in Linux

I have downloaded gnomAD files from - https://gnomad.broadinstitute.org/downloads
This is the bgz file
https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.2.vcf.bgz
When I expand using:
zcat gnomad.genomes.r2.1.1.sites.2.vcf.bgz > gnomad.genomes.r2.1.1.sites.2.vcf
The output VCF file becomes more than 330GB. I do not have that kind of space available on my laptop.
Is there a way where I can just expand - say 1 GB of the bgz file OR just 100000 rows from the bgz file?
From what I've been able to determine, a bgz file is compatible with gzip, and a VCF file is a plain text file. Since it's a gzip file, and not a .tar.gz, the solution doesn't require listing any archive contents, and simplifies things a bit.
This can probably be accomplished in several ways, and I doubt this is the best way, but I've been able to successfully decompress the first 100,000 rows into a file using the following code in python3 (it should also work under earlier versions back to 2.7):
#!/usr/bin/env python3
import gzip
ifile = gzip.GzipFile("gnomad.genomes.r2.1.1.sites.2.vcf.bgz")
ofile = open("truncated.vcf", "wb")
LINES_TO_EXTRACT = 100000
for line in range(LINES_TO_EXTRACT):
ofile.write(ifile.readline())
ifile.close()
ofile.close()
I tried this on your example file, and the truncated file is about 1.4 GiB. It took about 1 minute, 40 seconds on a raspberry pi-like computer, so while it's slow, it's not unbearably so.
While this solution is somewhat slow, it's good for your application for the following reasons:
It minimizes disk and memory usage, which could otherwise be problematic with a large file like this.
It cuts the file to exactly the given number of lines, which avoids truncating your output file mid-line.
The three input parameters can be easily parsed from the command line in case you want to make a small CLI utility for parsing other files in this manner.

opening a gzipped fil, characters following three pipes ("|||") are not visible

My input file is a gzipped file containing genomic information. I'm trying to parse the content on a line-by-line basis and have run into a strange problem.
Any given line looks something like this:
AC=26;AF=0.00519169;AN=5008;NS=2504;DP=17308;EAS_AF=0;AMR_AF=0.0072;AFR_AF=0.0015;EUR_AF=0.0109;SAS_AF=0.0082;AA=A|||;VT=SNP
However, when I print out what is being read in...
import gzip
with gzip.open(myfile.gz, 'rt') as f:
for line in f:
print(line)
The line looks like this:
AC=26;AF=0.00519169;AN=5008;NS=2504;DP=17308;EAS_AF=0;AMR_AF=0.0072;AFR_AF=0.0015;EUR_AF=0.0109;SAS_AF=0.0082;AA=A|||
Whatever information comes after the "|||" has been truncated.
Moreover, I can't even search the lines for strings that follow the "|||" (e.g. "VT=SNP" in line always returns False) I also can't line.strip("|||")
Any advice on what is causing this or what I need to look at?
Thank you for any help
EDIT: ok, it looks like there was something wrong with the gzip file. I uncompressed it and the script ran fine. Then I recompressed it and the script again ran fine (using gzip.open). Is there any straightforward way to compare the two compressed files (ie, the one that doesn't get read properly vs the one that works) so that I might get a hint at the root cause?

How to remove multiple lines from CSV file in nodejs

I have a CSV file that has 5 lines at the top of the file that I want to remove using node.js. I then want to add my own header line that better matches the header I would use. I have no control of the original csv file so unable to to do this at the source.
It will be easiest using one of the following modules:
https://www.npmjs.com/package/csv
https://www.npmjs.com/package/tsv
or other that you find in:
https://www.npmjs.com/browse/keyword/csv
https://www.npmjs.com/browse/keyword/tsv
(don't worry if it's CSV or TSV - just make sure that you use the correct delimiter which is comma in your case).
You might do it all manually parsing the file as text but using a module for that will be much less error prone.
(cat good-header.csv; tail -n +5 original-file.csv) > the-result.csv

Shortening large CSV on debian

I have a very large CSV file and I need to write an app that will parse it but using the >6GB file to test against is painful, is there a simple way to extract the first hundred or two lines without having to load the entire file into memory?
The file resides on a Debian server.
Did you try the head command?
head -200 inputfile > outputfile
head -10 file.csv > truncated.csv
will take the first 10 lines of file.csv and store it in a file named truncated.csv
"The file resides on a Debian server."- this is interesting. This basically means that even if you use 'head', where does head retrieve the data from? The local memory (after the file has been copied) which defeats the purpose.

Resources