Split many CSV files into a few bigger files in Linux - linux

I have a bunch of small CSV files (a few hundred files about 100 MB each) that I want to pack into several bigger files. I know how to join all (or a subset) of those files into one file - I simply need to use cat command in Linux and redirect its output to a file. My problem is the result files have to be not bigger than some size (say, 5 GB), i.e. merging all small files into one is not a solution because the resulting file will be too big. So, I am wondering if there is a way to do it in the command line that would be simpler than writing a bash script looping over the directory?
Thanks.

The split command does exactly what you need. You can have it split STDIN to different outputs based on size or number of lines. You can also specify the output file suffix.

Related

Disk read performance - Does splitting 100k+ of files into subdirectories help while read them faster?

I have 100Ks+ of small JSON data files in one directory (not by choice). When accessing each of them, does a flat vs. pyramid directory structure make any difference? Does it help Node.js/Nginx/filesystem retrieve them faster, if the files would be grouped by e.g. first letter, in corresponding directories?
In other words, is it faster to get baaaa.json from /json/b/ (only b*.json here), then to get it from /json/ (all files), when it is same to assume that the subdirectories contain 33 times less files each? Does it make finding each file 33x faster? Or is there any disk read difference at all?
jfriend00's comment EDIT: I am not sure what the underlying filesystem will be yet. But let's assume an S3 bucket.

How can I find & delete duplicate strings from ~800gb worth of text files?

I have a dataset of ~800gb worth of text files, with about 50k .txt files in total.
I'd like to go through and make a master .txt file from these, with all duplicate lines removed from all txt files.
I can't find a way to do this that isn't going to take months for my computer to process, idealy i'd like to keep it less than a week.
sort -u <data.txt >clean.txt
All you need is a large disk.
sort is quite efficient: it will automatically split the file into manageable bites, sort each one separately, then merge them (which can be done in O(N) time); and while merging, it will discard the dupes (due to -u option). But you will need at least the space for the output file, plus the space for all the intermediate files.

Fastest way to sort very large files preferably with progress

I have a 200GB flat file (one word per line) and I want to sort the file, then remove the duplicates and create one clean final TXT file out of it.
I tried sort with --parallel but it ran for 3 days and I got frustrated and killed the process as I didn't see any changes to the chunk of files it created in /tmp.
I need to see the progress somehow and make sure its not stuck and its working. Whats the best way to do so? Are there any Linux tools or open source project dedicated for something like this?
I don't use Linux, but if this is Gnu sort, you should be able to see the temporary files it creates from another window to monitor progress. The parallel feature only helps during the initial pass that sorts and creates the initial list of temporary files. After that, the default is a 16-way merge.
Say for example the first pass is creating temp files around 1GB in size. In this case, Gnu sort will end up creating 200 of these 1GB temp files before starting the merge phase. The 16 way merge means that 16 of those temp files will be merged at a time, creating temp files of size 16GB, and so on.
So one way to monitor progress is to monitor the creation of those temporary files.

Creating .txt output with string and numbers in matlab

I'm new to matlab.
I have multiple .txt files with up to 1000 each with a content as the following:
09.10.2015,08:17:02,51683,8,3286,78,6,7,0,13
I'm trying to merge all .txt files together to create one big .txt file that I can use for further analysis.
The .txt files have the same number of columns but different number of lines.
I don't have difficulties merging the files if there are the numbers only but the date and time causes difficulties.
Really would appreciate any help you could give.
This is not a job for Matlab, as you will be reading data (with format) writing data (creating new file). Which is inefficient and could blow-up your memory if you have BIG BIG data.
This is a job for Bash - Unix, something like:
cat *.txt > bigFile.txt
Or in Windows :
cat *.txt >> bigFile.txt
Or
copy /b *.txt bigFile.txt
You just need to read all files, store them somehow (in a matrix, in a cell), whatever suits you better.
Use fopen, fread, fid, or even simpler - http://www.mathworks.com/help/matlab/ref/fscanf.html
Once you have your information totally organized, just use this function - http://www.mathworks.com/help/matlab/ref/fprintf.html

Optimal directory structure for saving large number of files

One software we developed generates more and more, currently about 70000 files per day, 3-5 MB each. We store these files on a Linux server with ext3 file system. The software creates a new directory every day, and writes the files generated that day into this directory. Writing and reading such a large number of files is getting slower and slower (I mean, per file), so one of my colleagues suggested opening subdirectories in every hour. We will test whether this makes the system faster, but this problem can be generalized:
Has anyone measured the speed of writing and reading files, as a function of the number of files in the target directory? Is there an optimal file count above which it's faster to put the files into subdirectories? What are the important parameters which may influence the optimum?
Thank you in advance.

Resources