Fastest way to shuffle lines in a file in Linux - linux

I want to shuffle a large file with millions of lines of strings in Linux. I tried 'sort -R' But it is very slow (takes like 50 mins for a 16M big file). Is there a faster utility that I can use in the place of it?

Use shuf instead of sort -R (man page).
The slowness of sort -R is probably due to it hashing every line. shuf just does a random permutation so it doesn't have that problem.
(This was suggested in a comment but for some reason not written as an answer by anyone)

The 50 minutes is not caused by the actual mechanics of sorting, based on your description. The time is likely spent waiting on /dev/random to generate enough entropy.
One approach is to use an external source of random data (http://random.org, for example) along with a variation on a Schwartzian Transform. The Schwartzian Transform turns the data to be sorted into "enriched" data with the sort key embedded. The data is sorted using the key and then the key is discarded.
To apply this to your problem:
generate a text file with random numbers, 1 per line, with the same number of lines as the file to be sorted. This can be done at any time, run in the background, run on a different server, downloaded from random.org, etc. The point is that this randomness is not generated while you are trying to sort.
create an enriched version of the file using paste:
paste random_number_file.txt string_data.txt > tmp_string_data.txt
sort this file:
sort tmp_string_data.txt > sorted_tmp_string_data.txt
remove the random data:
cut -f2- sorted_tmp_string_data.txt > random_string_data.txt
This is the basic idea. I tried it and it does work, but I don't have 16 million lines of text or 16 million lines of random numbers. You may want to pipeline some of those steps instead of saving it all to disk.

You may try my tool: HugeFileProcessor. It's capable of shuffling files of hundreds of GBs in a reasonable time.
Here are the details on shuffling implementation. It requires specifying batchSize - number of lines to keep in RAM when writing to output. The more is the better (unless you are out of RAM), because total shuffling time would be (number of lines in sourceFile) / batchSize * (time to fully read sourceFile). Please note that the program shuffles whole file, not on per-batch basis.
The algorithm is as follows.
Count lines in sourceFile. This is done simply by reading whole file line-by-line. (See some comparisons here.) This also gives a measurement of how much time would it take to read whole file once. So we could estimate how many times it would take to make a complete shuffle because it would require Ceil(linesCount / batchSize) complete file reads.
As we now know the total linesCount, we can create an index array of linesCount size and shuffle it using Fisher–Yates (called orderArray in the code). This would give us an order in which we want to have lines in a shuffled file. Note that this is a global order over the whole file, not per batch or chunk or something.
Now the actual code. We need to get all lines from sourceFile in a order we just computed, but we can't read whole file in memory. So we just split the task.
We would go through the sourceFile reading all lines and storing in memory only those lines that would be in first batchSize of the orderArray. When we get all these lines, we could write them into outFile in required order, and it's a batchSize/linesCount of work done.
Next we would repeat whole process again and again taking next parts of orderArray and reading sourceFile from start to end for each part. Eventually the whole orderArray is processed and we are done.
Why it works?
Because all we do is just reading the source file from start to end. No seeks forward/backward, and that's what HDDs like. File gets read in chunks according to internal HDD buffers, FS blocks, CPU cahce, etc. and everything is being read sequentially.
Some numbers
On my machine (Core i5, 16GB RAM, Win8.1, HDD Toshiba DT01ACA200 2TB, NTFS) I was able to shuffle a file of 132 GB (84 000 000 lines) in around 5 hours using batchSize of 3 500 000. With batchSize of 2 000 000 it took around 8 hours. Reading speed was around 118000 lines per second.

Related

Best Way to Save Sensor Data, Split Every x Megabytes in Python

I'm saving sensor data at 64 samples per second into a csv file. The file is about 150megs at end of 24 hours. It takes a bit longer than I'd like to process it and I need to do some processing in real time.
value = str(milivolts)
logFile.write(str(datet) + ',' + value + "\n")
So I end up with single lines with date and milivolts up to 150 megs. At end of 24 hours it makes a new file and starts saving to it.
I'd like to know if there is a better way to do this. I have searched but can't find any good information on a compression to use while saving sensor data. Is there a way to compress while streaming / saving? What format is best for this?
While saving the sensor data, is there an easy way to split it into x megabyte files without data gaps?
Thanks for any input.
I'd like to know if there is a better way to do this.
One of the simplest ways is to use a logging framework, it will allow you to configure what compressor to use (if any), the approximate size of a file and when to rotate logs. You could start with this question. Try experimenting with several different compressors to see if speed/size is OK for your app.
While saving the sensor data, is there an easy way to split it into x megabyte files without data gaps?
A logging framework would do this for you based on the configuration. You could combine several different options: have fixed-size logs and rotate at least once a day, for example.
Generally, this is accurate up to the size of a logged line, so if the data is split into lines of reasonable size, this makes life super easy. One line ends in one file, another is being written into a new file.
Files also rotate, so you can have order of the data encoded in the file names:
raw_data_<date>.gz
raw_data_<date>.gz.1
raw_data_<date>.gz.2
In the meta code it will look like this:
# Parse where to save data, should we compress data,
# what's the log pattern, how to rotate logs etc
loadLogConfig(...)
# any compression, rotation, flushing etc happens here
# but we don't care, and just write to file
logger.trace(data)
# on shutdown, save any temporary buffer to the files
logger.flush()

Why is Spark much faster at reading a directory compared to a list of filepaths?

I have a directory in S3 containing millions of small files. They are small (<10MB) and GZ, and I know it's inefficient for Spark. I am running a simple batch job to convert these files to parquet format. I've tried two different ways:
spark.read.csv("s3://input_bucket_name/data/")
as well as
spark.read.csv("file1", "file2"..."file8million")
where each file given in the list is located in the same bucket and subfolder.
I notice that when I feed in a whole directory, there isn't as much delay at the beginning for the driver indexing files (looks like around 20 minutes before the batch starts). In the UI for 1 directory, there is 1 task after this 20 minutes which looks like the conversion itself.
However, with individual filenames, this time for indexing increases to 2+ hours, and my job to do the conversion in the UI doesn't show up until this time. For the list of files, there are 2 tasks: (1) First one is listing leafs for 8mil files, and then (2) job that looks like the conversion itself.
I'm trying to understand why this is the case. Is there anything different about the underlying read API that would lead to this behaviour?
spark assumes every path passed in is a directory
so when given a list of paths, it has to do a list call on each
which for s3 means: 8M LIST calls against the s3 servers
which is rate limited to about 3k/second, ignoring details like thread count on client, http connectons etc
and with LIST build at $0.005 per 1000 calls, so 8M requests comes to $50
oh, and as the LIST returns nothing, the client falls back to a HEAD which adds another S3 API call, doubling execution time and adding another $32 to the query cost
in contrast,
listing a dir with 8M entries kicks off a single LIST request for the first 1K entries
and 7999 followups
s3a releases do async prefetch of the next page of results (faster, esp if the incremental list iterators are used). one thread to fetch, one to process and will cost you 4c
The big directory listing is more efficient and cost effective strategy, even ignoring EC2 server costs

Proc Groovy to parse larger XML into SAS

We tried reading 3-4 GB of XML file using SAS XML mapper .but when we PROC COPY the data from the XML engine to SAS Dataset its taking almost 5 to 6 mins which is too much time for us since we have to process 3000 files a day .We are running almost 10 files in parallel.One table almost have 230 columns.
Is there any other faster way to process the XML ?
can we use PROC GROOVY ? will it be efficient? if yes can any one provide me a sample code?
i tried searching online but not able to get one.
The XML has PII data and its huge of 3 GB .
The Code being run is very simple and straight forward:
filename NHL "/path/ODM.xml";
filename map "/path/odm_map.map";
libname NHL xmlv2 xmlmap=map;
proc copy in=nhl out=work;
run;
Total Table created : 54 in which more than 14 tables have ~18000 records and remaining tables have ~1000 records
The Log window shows
NOTE: PROCEDURE COPY used (Total process time):
real time 4:03.72
user cpu time 4:00.68
system cpu time 1.17 seconds
memory 32842.37k
OS Memory 52888.00k
Timestamp 19/05/2020 03:14:43 PM
Step Count 4 Switch Count 802
Page Faults 3
Page Reclaims 17172
Page Swaps 0
Voluntary Context Switches 3662
Involuntary Context Switches 27536
Block Input Operations 504
Block Output Operations 56512
SAS Version : 9.4_M2
total memsize is MEMSIZE=3221225472 in our server
3000 files total out of which 1000 will be 3 to 4 GB and some of which will be 1 GB and 1000 files will be in KB .The smaller files are getting processed quickly the problem is only with big files .it uses almost the entire CPU.
The copy time from XML engine varies when we reduce the number of file,but for that to happen we have to change the map file or the input xml.
Already raised SAS tracks and have questioned the same in SAS communities still no luck.looks like its parser limitation itself.
Any idea about the shredder in Teradata ? will it be efficient?
I would do this in two pieces, first convert XML to ascii and then into SAS. SAS isn't going to be very fast at converting XML into SAS; it's just not something SAS is optimized for. You're using nearly entirely CPU time, so you're not disk limited - you're limited by SAS's ability to parse the XML file.
Write a program in a more optimized language that can parse the XML much faster, and then read the results of that into SAS. Python might be one option - it's not super optimized either, but it's more optimized for this sort of thing than SAS I suspect - or an even lower level language (like c/c++) might be your best bet.

How to design spark program to process 300 most recent files?

Situation
New small files comes in periodically. I need to do calculation on recent 300 files. So basically there is a window moving forward. The size of the window is 300 and I need do calculation on the window.
But something very important to know is that this is not a spark stream computing. Because in spark stream, the unit/scope of window is time. Here the unit/scope is number of files.
Solution1
I will maintain a dict, the size of the dict is 300. Each new file comes in, I turn it into spark data frame and put it into dict. Then I make sure the oldest file in the dict is popped out if the length of dict is over 300.
After this I will merge all data frames in the dict to a bigger one and do calculation.
The above process will be run in a loop. Every time new file comes in we go through the loop.
pseudo code for solution 1
for file in file_list:
data_frame = get_data_frame(file)
my_dict[ timestamp ] = data_frame
for timestamp in my_dict.keys():
if timestamp older than 24 hours:
# not only unpersist, but also delete to make sure the memory is released
my_dict[timestamp].unpersist
del my_dict[ timestamp ]
# pop one data frame from the dict
big_data_frame = my_dict.popitem()
for timestamp in my_dict.keys():
df = my_dict.get( timestamp )
big_data_frame = big_data_frame.unionAll(df)
# Then we run SQL on the big_data_frame to get report
problem for solution 1
Always hit Out of memory or gc overhead limit
question
Do you see anything inappropriate in the solution 1?
Is there any better solution?
Is this the right kind of situation to use spark ?
One observation, you probably don't want to use popitem, the keys of a Python dictionary are not sorted, so you can't guarantee that you're popping the earliest item. Instead I would recreate the dictionary each time using a sorted list of timestamps. Assuming your filenames are just timestamps:
my_dict = {file:get_dataframe(file) for file in sorted(file_list)[-300:]}
Not sure if this will fix your problem, can you paste the full stacktrace of your error into the question? It's possible that your problem is happening in the Spark merge/join (not included in your question).
My suggestion to this is streaming, but not with respect to time, I mean you will still have some window and sliding interval set, but say it is 60 secs.
So every 60 secs you get the DStream of file contents, in 'x' partitions. These 'x' partitions represent the files you drop onto HDFS or file system.
So, this way you can keep track of how many files/partitions have been read, if they are less than 300 then wait until they become 300. After the count hits 300 then you can start processing.
If it's possible to keep track of the most recent files or if it's possible to just discover them once in a while, then I'd suggest to do something like
sc.textFile(','.join(files));
or if it's possible to identify specific pattern to get those 300 files, then
sc.textFile("*pattern*");
And it's even possible to have comma separated patterns, but it might happen that some files that match more, than one pattern, would be read more, than once.

Linux: sorting a 500GB text file with 10^10 records

I have a 500GB text file with about 10 billions rows that needs to be sorted in alphabetical order. What is the best algorithm to use? Can my implementation & set-up be improved ?
For now, I am using the coreutils sort command:
LANG=C
sort -k2,2 --field-separator=',' --buffer-size=(80% RAM) --temporary-directory=/volatile BigFile
I am running this in AWS EC2 on a 120GB RAM & 16 cores virtual machine. It takes the most part of the day.
/volatile is a 10TB RAID0 array.
The 'LANG=C' trick delivers a x2 speed gain (thanks to 1)
By default 'sort' uses 50% of the available RAM. Going up to 80-90% gives some improvement.
My understanding is that gnu 'sort' is a variant of the merge-sort algorithm with O(n log n), which is the fastest : see 2 & 3 . Would moving to QuickSort help (I'm happy with an unstable sort)?
One thing I have noticed is that only 8 cores are used. This is related to default_max_threads set to 8 in linux coreutils sort.c (See 4). Would it help to recompile sort.c with 16 ?
Thanks!
FOLLOW-UP :
#dariusz
I used Chris and your suggestions below.
As the data was already generated in batches: I sorted each bucket separately (on several separate machines) and then used the 'sort --merge' function. Works like a charm and is much faster: O(log N/K) vs O(log N).
I also rethinked the project from scratch: some data post-processing is now performed while the data is generated, so that some un-needed data (noise) can be discarded before sorting takes place.
All together, data size reduction & sort/merge led to massive reduction in computing resources needed to achieve my objective.
Thanks for all your helpful comments.
The benefit of quicksort over mergesort is no additional memory overhead. The benefit of mergesort is the guaranteed O(n log n) run time, where as quicksort can be much worse in the event of poor pivot point sampling. If you have no reason to be concerned about the memory use, don't change. If you do, just ensure you pick a quicksort implementation that does solid pivot sampling.
I don't think it would help spectacularly to recompile sort.c. It might be, on a micro-optimization scale. But your bottleneck here is going to be memory/disk speed, not amount of processor available. My intuition would be that 8 threads is going to be maxing out your I/O throughput already, and you would see no performance improvement, but this would certainly be dependent on your specific setup.
Also, you can gain significant performance increases by taking advantage of the distribution of your data. For example, evenly distributed data can be sorted very quickly by a single bucket sort pass, and then using mergesort to sort the buckets. This also has the added benefit of decreasing the total memory overhead of mergesort. If the memory comlexity of mergesort is O(N), and you can separate your data into K buckets, your new memory overhead is O(N/K).
Just an idea:
I assume the file contents are generated for quite a large amout of time. Write an application (script?) which would periodically move the up-till-now generated file to a different location, append its contents to another file, perform a sort on that different file, and repeat until all data is gathered.
That way your system would spend more time sorting, but the results would be available sooner, since sorting partially-sorted data will be faster than sorting the unsorted data.
I think, you need perform that sort in 2 stages:
Split to trie -like buckets, fit into memory.
Iterate buckets according alphabeth order, fetch each, sort, and append to output file.
This is example.
Imagine, you have bucket limit 2 lines only, and your input file is:
infile:
0000
0001
0002
0003
5
53
52
7000
on the 1st iteration, you read your input file "super-bucket, with empty prefix", and split according 1st letter.
There would be 3 output files:
0:
000
001
002
003
5:
(empty)
3
2
7:
000
As you see, bucket with filename/prefix 7 contains only one record 000, which is "7000", splitted to 7 - filename, and 000 - tail of the string. since this is just one record, wil do not need to split this file anymore. But, files "0" and "5" contains 4 and 3 records, what is more than limit 2. So, need split them again.
After split:
00:
01
02
03
5:
(empty)
52:
(empty)
53:
(empty)
7:
000
As you see, files with prefix "5" and "7" already splitted. so, need just split file "00".
As you see, after splitting, you will have set of relative small files.
Thereafter, run 2nd stage:
Sort filenames, and process filenames according sorted order.
sort each file, and append resut to output, with adding file name to output string.

Resources