Shortening large CSV on debian - linux

I have a very large CSV file and I need to write an app that will parse it but using the >6GB file to test against is painful, is there a simple way to extract the first hundred or two lines without having to load the entire file into memory?
The file resides on a Debian server.

Did you try the head command?
head -200 inputfile > outputfile

head -10 file.csv > truncated.csv
will take the first 10 lines of file.csv and store it in a file named truncated.csv

"The file resides on a Debian server."- this is interesting. This basically means that even if you use 'head', where does head retrieve the data from? The local memory (after the file has been copied) which defeats the purpose.

Related

Partially expand VCF bgz file in Linux

I have downloaded gnomAD files from - https://gnomad.broadinstitute.org/downloads
This is the bgz file
https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.2.vcf.bgz
When I expand using:
zcat gnomad.genomes.r2.1.1.sites.2.vcf.bgz > gnomad.genomes.r2.1.1.sites.2.vcf
The output VCF file becomes more than 330GB. I do not have that kind of space available on my laptop.
Is there a way where I can just expand - say 1 GB of the bgz file OR just 100000 rows from the bgz file?
From what I've been able to determine, a bgz file is compatible with gzip, and a VCF file is a plain text file. Since it's a gzip file, and not a .tar.gz, the solution doesn't require listing any archive contents, and simplifies things a bit.
This can probably be accomplished in several ways, and I doubt this is the best way, but I've been able to successfully decompress the first 100,000 rows into a file using the following code in python3 (it should also work under earlier versions back to 2.7):
#!/usr/bin/env python3
import gzip
ifile = gzip.GzipFile("gnomad.genomes.r2.1.1.sites.2.vcf.bgz")
ofile = open("truncated.vcf", "wb")
LINES_TO_EXTRACT = 100000
for line in range(LINES_TO_EXTRACT):
ofile.write(ifile.readline())
ifile.close()
ofile.close()
I tried this on your example file, and the truncated file is about 1.4 GiB. It took about 1 minute, 40 seconds on a raspberry pi-like computer, so while it's slow, it's not unbearably so.
While this solution is somewhat slow, it's good for your application for the following reasons:
It minimizes disk and memory usage, which could otherwise be problematic with a large file like this.
It cuts the file to exactly the given number of lines, which avoids truncating your output file mid-line.
The three input parameters can be easily parsed from the command line in case you want to make a small CLI utility for parsing other files in this manner.

How to remove the nth occurrence of a substring from each line on four 100GB files

I have 4 100GB csv files where two fields need to be concatenated. Luckily the two fields are next to each other.
My thought is to remove the 41st occurence of "," from each line and then my two fields will be properly united and ready to be uploaded to an analytical tool that I use.
The development machine is a Windows 10 machine with 4 x 3.6GHz and 64G RAM and I push the file to a server on Centos 7 system with 40 x 2.4GHz and 512G RAM. I have sudo access on the server and I can technically change the file there if someone has a solution that is dependent on Linux tools. The idea is to accomplish the task in the fastest/easiest way possible. I have to repeat this task monthly and would be ecstatic to automate it.
My original way of accomplishing this was to load the csv to MySQL, concat the fields and remove the old fields. Export the table as a csv again and push to the server. This takes two days and is laborious.
Right now I'm torn between learning to use sed or using a something I'm more familiar with like node.js to stream the files line by line into a new file and then push those to the server.
If you recommend using sed, I've read here and here but don't know how to remove the nth occurrence from each line.
Edit: Cyrus asked for a sample input/output.
Input file formatted thusly:
"field1","field2",".........","field41","field42","......
Output file formatted like so:
"field1","field2",".........","field41field42","......
If you want to remove 41st occurrence of , then you can try :
sed -i 's/","//41' file

Combine lots of CSV files in CentOS?

I have a CentOS machine and I want to combine .csv data.
I have thousands of small documents all with the same column information.
How would I go about combining all of them into files of up to 20Mb in size?
For example 1.csv would combine the first few files and once the 20Mb limit is reached the data will continue to go into 2.csv and so on.
Any help is greatly appreciated
If they don't have headers, something as simple as;
$ cat *.csv > combined.csv
would work (we run in the directory containing the files (assuming you want them in the order returned by ls *.csv)).
You can acheive what you want with simple tail command :
tail -q -n+2 *.csv
You only need to add the proper header column afterward.
You might want to look at the join utility: https://www.gnu.org/software/coreutils/manual/html_node/join-invocation.html#join-invocation

multiple file view like DB-view

Is it possible, using bash, to create a view/virtual file that when opened combines 2 files into 1?
example:
FILE_META_1.txt
FILE_META_2.txt
combines into
FILE_META.txt
In general, this is not possible. I assume you mean you want to logically link 2 files without creating a 3rd file that is the sum of the 2 files. I've often wanted this feature also. It would have to be done at the kernel level or via a special file system, maybe use FUSE. UnionFS provides this for directories, but not for files. FuseFile looks like it does what you want. Also take a look at the Logic File System.
You can open them stream-like wise with process substitution:
cat <(cat FILE_META_1.txt; cat FILE_META_2.txt;)
<(*) here expands to a named pipe path which you could open and access like a file for input.

Download multiple files, with different final names

OK, what I need is fairly simple.
I want to download LOTS of different files (from a specific server), via cURL and would want to save each one of them as a specific new filename, on disk.
Is there an existing way (parameter, or whatever) to achieve that? How would you go about it?
(If there was an option to input all URL-filename pairs in a text file, one per line, and get cURL to process it, would be ideal)
E.g.
http://www.somedomain.com/some-image-1.png --> new-image-1.png
http://www.somedomain.com/another-image.png --> new-image-2.png
...
OK, just figured a smart way to do it myself.
1) Create a text file with pairs of URL (what to download) and Filename (how to save it to disk), separated by comma (,), one per line. And save it as input.txt.
2) Use the following simple BASH script :
while read line; do
IFS=',' read -ra PART <<< "$line";
curl $PART[0] -o $PART[1];
done < input.txt
*Haven't thoroughly tested it yet, but I think it should work.

Resources