How can I find & delete duplicate strings from ~800gb worth of text files? - io

I have a dataset of ~800gb worth of text files, with about 50k .txt files in total.
I'd like to go through and make a master .txt file from these, with all duplicate lines removed from all txt files.
I can't find a way to do this that isn't going to take months for my computer to process, idealy i'd like to keep it less than a week.

sort -u <data.txt >clean.txt
All you need is a large disk.
sort is quite efficient: it will automatically split the file into manageable bites, sort each one separately, then merge them (which can be done in O(N) time); and while merging, it will discard the dupes (due to -u option). But you will need at least the space for the output file, plus the space for all the intermediate files.

Related

If Spark reads data from multiple files via glob, then does some mapping, then does take(5), would it read only the first file?

I have multiple large files, and I use a glob that matches them all to read them into a single dataframe. Then I do so some mapping, i.e. processing rows independently from each other. For development purposes, I don't want to process the whole data, so I'm thinking of doing a df.take(5). Will Spark be smart enough to realize that it only needs to read the first five rows of the first file? Thanks!
I'm hoping it will only read the first five records, but I don't know if it does.

Fastest way to sort very large files preferably with progress

I have a 200GB flat file (one word per line) and I want to sort the file, then remove the duplicates and create one clean final TXT file out of it.
I tried sort with --parallel but it ran for 3 days and I got frustrated and killed the process as I didn't see any changes to the chunk of files it created in /tmp.
I need to see the progress somehow and make sure its not stuck and its working. Whats the best way to do so? Are there any Linux tools or open source project dedicated for something like this?
I don't use Linux, but if this is Gnu sort, you should be able to see the temporary files it creates from another window to monitor progress. The parallel feature only helps during the initial pass that sorts and creates the initial list of temporary files. After that, the default is a 16-way merge.
Say for example the first pass is creating temp files around 1GB in size. In this case, Gnu sort will end up creating 200 of these 1GB temp files before starting the merge phase. The 16 way merge means that 16 of those temp files will be merged at a time, creating temp files of size 16GB, and so on.
So one way to monitor progress is to monitor the creation of those temporary files.

Creating .txt output with string and numbers in matlab

I'm new to matlab.
I have multiple .txt files with up to 1000 each with a content as the following:
09.10.2015,08:17:02,51683,8,3286,78,6,7,0,13
I'm trying to merge all .txt files together to create one big .txt file that I can use for further analysis.
The .txt files have the same number of columns but different number of lines.
I don't have difficulties merging the files if there are the numbers only but the date and time causes difficulties.
Really would appreciate any help you could give.
This is not a job for Matlab, as you will be reading data (with format) writing data (creating new file). Which is inefficient and could blow-up your memory if you have BIG BIG data.
This is a job for Bash - Unix, something like:
cat *.txt > bigFile.txt
Or in Windows :
cat *.txt >> bigFile.txt
Or
copy /b *.txt bigFile.txt
You just need to read all files, store them somehow (in a matrix, in a cell), whatever suits you better.
Use fopen, fread, fid, or even simpler - http://www.mathworks.com/help/matlab/ref/fscanf.html
Once you have your information totally organized, just use this function - http://www.mathworks.com/help/matlab/ref/fprintf.html

Split many CSV files into a few bigger files in Linux

I have a bunch of small CSV files (a few hundred files about 100 MB each) that I want to pack into several bigger files. I know how to join all (or a subset) of those files into one file - I simply need to use cat command in Linux and redirect its output to a file. My problem is the result files have to be not bigger than some size (say, 5 GB), i.e. merging all small files into one is not a solution because the resulting file will be too big. So, I am wondering if there is a way to do it in the command line that would be simpler than writing a bash script looping over the directory?
Thanks.
The split command does exactly what you need. You can have it split STDIN to different outputs based on size or number of lines. You can also specify the output file suffix.

handling lots of temporary small files

I have a web server which saves cache files and keeps them for 7 days. The file names are md5 hashes, i.e. exactly 32 hex characters long, and are being kept in a tree structure that looks like this:
00/
00/
00000ae9355e59a3d8a314a5470753d8
.
.
00/
01/
You get the idea.
My problem is that deleting old files is taking a really long time. I have a daily cron job that runs
find cache/ -mtime +7 -type f -delete
which takes more than half a day to complete. I worry about scalability and the effect this has on the performance of the server. Additionally, the cache directory is now a black hole in my system, trapping the occasional innocent du or find.
The standard solution to LRU cache is some sort of a heap. Is there a way to scale this to the filesystem level?
Is there some other way to implement this in a way which makes it easier to manage?
Here are ideas I considered:
Create 7 top directories, one for each week day, and empty one directory every day. This increases the seek time for a cache file 7-fold, makes it really complicated when a file is overwritten, and I'm not sure what it will do to the deletion time.
Save the files as blobs in a MySQL table with indexes on name and date. This seemed promising, but in practice it's always been much slower than FS. Maybe I'm not doing it right.
Any ideas?
When you store a file, make a symbolic link to a second directory structure that is organized by date, not by name.
Retrieve your files using the "name" structure, delete them using the "date" structure.
Assuming this is ext2/3 have you tried adding in the indexed directories? When you have a large number of files in any particular directory the lookup will be painfully slow to delete something.
use tune2fs -o dir_index to enable the dir_index option.
When mounting a file system, make sure to use noatime option, which stops the OS from updating access time information for the directories (still needs to modify them).
Looking at the original post it seems as though you only have 2 levels of indirection to the files, which means that you can have a huge number of files in the leaf directories. When there are more than a million entries in these you will find that searches and changes are terribly slow. An alternative is to use a deeper hierarchy of directories, reducing the number of items in any particular directory, therefore reducing the cost of search and updates to the particular individual directory.
Reiserfs is relatively efficient at handling small files. Did you try different Linux file systems? I'm not sure about delete performance - you can consider formatting (mkfs) as a substitute for individual file deletion. For example, you can create a different file system (cache1, cache2, ...) for each weekday.
How about this:
Have another folder called, say, "ToDelete"
When you add a new item, get today's date and look for a subfolder in "ToDelete" that has a name indicative of the current date
If it's not there, create it
Add a symbolic link to the item you've created in today's folder
Create a cron job that goes to the folder in "ToDelete" which is of the correct date and delete all the folders that are linked.
Delete the folder which contained all the links.
How about having a table in your database that uses the hash as the key. The other field would then be the name of the file. That way the file can be stored in a date-related fashion for fast deletion, and the database can be used for finding that file's location based on the hash in a fast fashion.

Resources