The most efficient way to break down huge file into chunks of max 10MBs - linux

There is a text file of a size of 20GBs. Each line of the file is a json object. The odd lines are the ones describing the even subsequent lines.
The goal is to divide the big file into chunks of maximum size of 10MBs knowing that each file should have even number of lines so the formatting doesn't get lost. What would be the most efficient way to do that?
My research so far made me lean towards:
split function in Linux. Is there any way to export always even number of lines based on the size?
Modified version of divide & conquer algo. Would this even work?
Estimating average number of lines that meet the 10MBs criteria and iterating through it & exporting it it meets the criteria.
I'm thinking that 1. would be the most efficient but I wanted to get an opinion of experts here.

Related

Split a very large csv file into many smaller files

I read this question but it does not seem to give a good reply to my situation (maybe I misread it):
Splitting very large csv files into smaller files
I have a large CSV file (1.0 Gb) with lots of individual rows (over 1 million) and 8 columns. Two columns represent date and time while others represent other stock related information (price, etc.). I would like to save individual files separated by date and time attributes. So, if there are 100 different combinations of date-time, I would extract rows for each combination and save is as a separate CSV file under subfolder like C:/Date/Time/filename.csv.
I am currently using pandas to filter each time-date combination to get a dataset with required information for that combination and then saving each file using loops (a list of date-time combinations is looped through). This is taking a very long time.
Is there a better way to accomplish this?
(I will look into multithreading as well but I don't believe it will solve the speed issue significantly).
Thanks!
Adding a sample of input data:

Deleting column of huge fits file

I am having a relatively large dataset (~30GB) in fits format which does not fit into memory. In some instance I need to add one column and later on remove it from that file.
Adding the column seems to work with help of the fitsio.insert_column method, however I have not come across a similar method in various packages that removes that column again without having to read in the whole data. Am I missing an obvious function or is it something about the fits format that does not allow to do this in a simple and memory efficient fashion?
Thanks a lot for your help!

Replacing null values with zeroes in multiple columns [Spotfire]

I have about 100 columns with some empty values that I would like to replace with zeroes. I know how to do this with a single column using Calculate and Replace, but I wanted to see if there was a way to do this with multiple columns at once.
Thanks!
You could script it but it'd probably take you as long to write the script as it would to do it manually with a transformation. A better idea would be to fix it in the data source itself before you import it so SPOTFIRE doesn't have to do the transformation every time, which if you are dealing with a large amount of data, could hinder your performance.

Improve Vlookup on large file

I´ve a very large file that I reduced as much as possible to 3 columns and 80k rows.
I need to perform a vlookup in order to bring values from column 1 or 2 match some other spreadsheets values.
The thing is Excel doesn´t seem to support such large searches, and it stops responding - the computer has 4GB and a Quad core, and not much more running at the same time.
As far as I understand, as I´m not looking for exact matches, I should not use match-index.
The only thing I thouhgt could help but not sure about that, is dividing the file in 2-4, and asking Excel many parallel searches instead of a big one. Could this work?
What else should I try?
Thanks!!!
Sort your data and use True as the 4th VLOOKUP argument. This makes VLOOKUP use binary search rather than linear search and is lightning fast.
If you need to handle missing data you will need to use the double VLOOKUP trick, see
http://fastexcel.wordpress.com/2012/03/29/vlookup-tricks-why-2-vlookups-are-better-than-1-vlookup/

Fast repeated row counting in vast data - what format?

My Node.js app needs to index several gigabytes of timestamped CSV data, in such a way that it can quickly get the row count for any combination of values, either for each minute in a day (1440 queries) or for each hour in a couple of months (also 1440). Let's say in half a second.
The column values will not be read, only the row counts per interval for a given permutation. Reducing time to whole minutes is OK. There are rather few possible values per column, between 2 and 10, and some depend on other columns. It's fine to do preprocessing and store the counts in whatever format suitable for this single task - but what format would that be?
Storing actual values is probably a bad idea, with millions of rows and little variation.
It might be feasible to generate a short code for each combination and match with regex, but since these codes would have to be duplicated each minute, I'm not sure it's a good approach.
Or it can use an embedded database like SQLite, NeDB or TingoDB, but am not entirely convinced since they don't have native enum-like types and might or might not be made for this kind of counting. But maybe it would work just fine?
This must be a common problem with an idiomatic solution, but I haven't figured out what it might be called. Knowing what to call this and how to think about it would be very helpful!
Will answer with my own findings for now, but I'm still interested to know more theory about this problem.
NeDB was not a good solution here as it saved my values as normal JSON behind the hood, repeating key names for each row and adding unique IDs. It wasted lots of space and would surely have been too slow, even if just because of disk I/O.
SQLite might be better at compressing and indexing data, but I have yet to try it. Will update with my results if I do.
Instead I went with the other approach I mentioned: assign a unique letter to each column value we come across and get a short string representing a permutation. Then for each minute, add these strings as keys iff they occur, with the number of occurrences as values. We can later use our dictionary to create a regex that matches any set of combinations, and run it over this small index very quickly.
This was easy enough to implement, but would of course have been trickier if I had had more possible column values than the about 70 I found.

Resources