How get unique lines from a very large file in linux? - linux

I have a very large data file (255G; 3,192,563,934 lines). Unfortunately I only have 204G of free space on the device (and no other devices I can use). I did a random sample and found that in a given, say, 100K lines, there are about 10K unique lines... but the file isn't sorted.
Normally I would use, say:
pv myfile.data | sort | uniq > myfile.data.uniq
and just let it run for a day or so. That won't work in this case because I don't have enough space left on the device for the temporary files.
I was thinking I could use split, perhaps, and do a streaming uniq on maybe 500K lines at a time into a new file. Is there a way to do something like that?
I thought I might be able to do something like
tail -100000 myfile.data | sort | uniq >> myfile.uniq && trunc --magicstuff myfile.data
but I couldn't figure out a way to truncate the file properly.

Use sort -u instead of sort | uniq
This allows sort to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.

Related

How to speed up grep/awk command?

I am going to process the text file (>300 GB) and split it into small text files (~1 GB). I want to speed up grep/awk commands.
I need to grep the line which has values on column b, here are my ways:
# method 1:
awk -F',' '$2 ~ /a/ { print }' input
# method 2:
grep -e ".a" < inpuy
Both of ways cost 1min for each file. So how can I speed up this operation?
Sample of input file:
a,b,c,d
1,4a337485,2,54
4,2a4645647,4,56
6,5a3489556,3,22
9,,3,12
10,0,34,45
24,4a83944,3,22
45,,435,34
Expected output file:
a,b,c,d
1,4a337485,2,54
4,2a4645647,4,56
6,5a3489556,3,22
24,4a83944,3,22
How to speed up grep/awk command?
Are you so sure that grep or awk is the culprit of your perceived slowness ? Do you know about cut(1) or sed(1) ? Have you benchmarked the time to run wc(1) on your data? Probably the textual I/O is taking a lot of time.
Please benchmark several times, and use time(1) to benchmark your program.
I have a high-end Debian desktop (with a AMD 2970WX, 64Gb RAM, 1Tbyte SSD system disk, multi-terabyte 7200RPM SATA data disks) and just running wc on a 25Gbyte file (some *.tar.xz archive) sitting on a hard disk takes more than 10 minutes (measured with time), and wc is doing some really simple textual processing by reading that file sequentially so should run faster than grep (but, to my surprise, does not!) or awk on the same data :
wc /big/basile/backup.tar.xz 640.14s user 4.58s system 99% cpu 10:49.92 total
and (using grep on the same file to count occurrences of a)
grep -c a /big/basile/backup.tar.xz 38.30s user 7.60s system 33% cpu 2:17.06 total
general answer to your question:
Just write cleverly (with efficient O(log n) time complexity data structures: red-black trees, or hash tables, etc ...) an equivalent program in C or C++ or Ocaml or most other good language and implementation. Or buy more RAM to increase your page cache. Or buy an SSD to hold your data. And repeat your benchmarks more than once (because of the page cache).
suggestion for your problem : use a relational database
It is likely that using a plain textual file of 300Gb is not the best approach. Having huge textual files is usually wrong and is likely to be wrong once you need to process several times the same data. You'll better pre-process it somehow..
If you repeat the same grep search or awk execution on the same data file more than once, consider instead using sqlite (see also this answer) or even some other real relational database (e.g. with PostGreSQL or some other good RDBMS) to store then process your original data.
So a possible approach (if you have enough disk space) might be to write some program (in C, Python, Ocaml etc...), fed by your original data, and filling some sqlite database. Be sure to have clever database indexes and take time to design a good enough database schema, being aware of database normalization.
Use mawk, avoid regex and do:
$ mawk -F, '$2!=""' file
a,b,c,d
1,4a337485,2,54
4,2a4645647,4,56
6,5a3489556,3,22
10,0,34,45
24,4a83944,3,22
Let us know how long that took.
I did some tests with 10M records of your data, based on the results: use mawk and regex:
GNU awk and regex:
$ time gawk -F, '$2~/a/' file > /dev/null
real 0m7.494s
user 0m7.440s
sys 0m0.052s
GNU awk and no regex:
$ time gawk -F, '$2!=""' file >/dev/null
real 0m9.330s
user 0m9.276s
sys 0m0.052s
mawk and no regex:
$ time mawk -F, '$2!=""' file >/dev/null
real 0m4.961s
user 0m4.904s
sys 0m0.060s
mawk and regex:
$ time mawk -F, '$2~/a/' file > /dev/null
real 0m3.672s
user 0m3.600s
sys 0m0.068s
I suspect your real problem is that you're calling awk repeatedly (probably in a loop), once per set of values of $2 and generating an output file each time, e.g.:
awk -F, '$2==""' input > novals
awk -F, '$2!=""' input > yesvals
etc.
Don't do that as it's very inefficient since it's reading the whole file on every iteration. Do this instead:
awk '{out=($2=="" ? "novals" : "yesvals")} {print > out}' input
That will create all of your output files with one call to awk. Once you get past about 15 output files it would require GNU awk for internal handling of open file descriptors or you need to add close(out)s when $2 changes and use >> instead of >:
awk '$2!=prev{close(out); out=($2=="" ? "novals" : "yesvals"); prev=$2} {print >> out}' input
and that would be more efficient if you sorted your input file first with (requires GNU sort for -s for stable sort if you care about preserving input ordering for the unique $2 values):
sort -t, -k2,2 -s

Random String in linux by system time

I work with Bash. I want to generate randrom string by system time . The length of the unique string must be between 10 and 30 characters.Can anybody help me?
There are many ways to do this, my favorite one using the urandom device:
burhan#sandbox:~$ tr -cd '[:alnum:]' < /dev/urandom | fold -w30 | head -n1
CCI4zgDQ0SoBfAp9k0XeuISJo9uJMt
tr (translate) makes sure that only alphanumerics are shown
fold will wrap it to 30 character width
head makes sure we get only the first line
To use the current system time (as you have this specific requirement):
burhan#sandbox:~$ date +%s | sha256sum | base64 | head -c30; echo
NDc0NGQxZDQ4MWNiNzBjY2EyNGFlOW
date +%s = this is our date based seed
We run it through a few hashes to get a "random" string
Finally we truncate it to 30 characters
Other ways (including the two I listed above) are available at this page and others if you simply google.
Maybe you can use uuidgen -t.
Generate a time-based UUID. This method creates a UUID based on the system clock plus the system's ethernet hardware address, if present.
I recently put together a script to handle this, the output is 33 digit md5 checksum but you can trim it down with sed to between 10-30.
E.g. gen_uniq_id.bsh | sed 's/\(.\{20\}\)\(.*$\)/\1/'
The script is fairly robust - it uses current time to nanoseconds, /dev/urandom, mouse movement data and allows for optionally changing the collection times for random and mouse data collection.
It also has a -s option that allows an additional string argument to be incorporated, so you can random seed from anything.
https://code.google.com/p/gen-uniq-id/

Is it possible to display the progress of a sort in linux?

My job involves a lot of sorting fields from very large files. I usually do this with the sort command in bash. Unfortunately, when I start a sort I am never really sure how long it is going to take. Should I wait a second for the results to appear, or should I start working on something else while it runs?
Is there any possible way to get an idea of how far along a sort has progressed or how fast it is working?
$ cut -d , -f 3 VERY_BIG_FILE | sort -du > output
No, GNU sort does not do progress reporting.
However, if are you using sort just to remove duplicates, and you don't actually care about the ordering, then there's a more scalable way of doing that:
awk '! a[$0]++'
This writes out the first occurrence of a line as soon as it's been seen, which can give you an idea of the progress.
You might want to give pv a try, it should give you a pretty good idea of what is going on in your pipe in terms of throughput.
Example (untested) injecting pv before and after the sort command to get an idea of the throughput:
$ cut -d , -f 3 VERY_BIG_FILE | pv -cN cut | sort -du | pv -cN sort > output
EDIT: I missed the -u in your sort command, so calculating lines first to be able to get a percentage output is void. Removed that part from my answer.
You can execute your "sort" in background
you will get prompt and you can do other jobs
$sort ...... & # (& means run in background )

What standard commands can I use to print just the first few lines of sorted output on the command line efficiently?

I basically want the equivalent of
... | sort -arg1 -arg2 -... | head -n $k
but, my understanding is that sort will go O(n log n) over the whole input. In my case I'm dealing with lots of data, so runtime matters to me - and also I have a habit of overflowing my tmp/ folder with sort temporary files.
I'd rather have it go O(n log k) using e.g. a heap, which would presumably go faster, and which also reduces the working set memory to k as well.
Is there some combination of standard command-line tools that can do this efficiently, without me having to code something myself? Ideally it would support the full expressive sort power of the sort command. sort (on ubuntu at least) appears to have no man-page-documented switch to pull it off...
Based on the above, and some more poking, I'd say the official answer to my question is "there is no solution." You can use specialized tools, or you can use the tools you've got with their current performance, or you can write your own tool.
I'm debating tracking down the sort source code and offering a patch. In the meantime, in case this quick hack code helps for anybody doing something similar to what I was doing, here's what I wrote for myself. Not the best python, and a very shady benchmark: I offer it to anybody else who cares to provide more rigorous:
256 files, of about 1.6 Gigs total size, all sitting on an ssd, lines
separated by \n, lines of format [^\t]*\t[0-9]+
Ubuntu 10.4, 6 cores, 8 gigs of ram, /tmp on ssd as well.
$ time sort -t^v<tab> -k2,2n foo* | tail -10000
real 7m26.444s
user 7m19.790s
sys 0m17.530s
$ time python test.py 10000 foo*
real 1m29.935s
user 1m28.640s
sys 0m1.220s
using diff to analyze, the two methods differ on tie-breaking, but otherwise the sort order is the same.
test.py:
#!/usr/bin/env python
# test.py
from sys import argv
import heapq
from itertools import chain
# parse N - the size of the heap, and confirm we can open all input files
N = int(argv[1])
streams = [open(f, "r") for f in argv[2:]]
def line_iterator_to_tuple_iterator(line_i):
for line in line_i:
s,c = line.split("\t")
c = int(c)
yield (c, s)
# use heap to process inputs
rez = heapq.nlargest(N,
line_iterator_to_tuple_iterator(chain(*streams)),
key=lambda x: x[0])
for r in rez:
print "%s\t%s" % (r[1], r[0])
for s in streams:
s.close()
UNIX/Linux provides generalists toolset. For large datasets it does loads of I/O. It will do everything you can want, but slowly. If we had an idea of the input data it would help immensely.
IMO, You have some choices, none you will really like.
do a multipart "radix" pre-sort - for example have awk write all of the lines whose keys start with 'A' to one file 'B' to another, etc. Or if you only 'P' 'D' & 'Q', have awk just suck out what you want. Then do a full sort on a small subset. This creates 26 files named A, B ...Z
awk '{print $0 > substr($0,1,1)} bigfile; sort [options here] P D Q > result
Spend $$: (Example) Buy CoSort from iri.com any other sort software. These sorts use all kinds of optimizations, but they are not free like bash. You could also buy an SSD which speeds up sorting on disk by several orders of magnitude. 5000iops now to 75000iops. Use the TMPDIR variable to put your tmp files on the SSD, read and write only to the SSD. But use your existing UNIX toolset.
Use some software like R or strata, or preferably a database; all of these are meant for large datasets.
Do what you are doing now, but watch youtube while the UNIX sort runs.
IMO, you are using the wrong tools for large datasets when you want quick results.
Here's a crude partial solution:
#!/usr/bin/perl
use strict;
use warnings;
my #lines = ();
while (<>) {
push #lines, $_;
#lines = sort #lines;
if (scalar #lines > 10) {
pop #lines;
}
}
print #lines;
It reads the input data only once, continuously maintaining a sorted array of the top 10 lines.
Sorting the whole array every time is inefficient, of course, but I'll guess that for a gigabyte input it will still be substantially faster than sort huge-file | head.
Adding an option to vary the number of lines printed would be easy enough. Adding options to control how the sorting is done would be a bit more difficult, though I wouldn't be surprised if there's something in CPAN that would help with that.
More abstractly, one approach to getting just the first N sorted elements from a large array is to use a partial Quicksort, where you don't bother sorting the right partition unless you need to. That requires holding the entire array in memory, which is probably impractical in your case.
You could split the input into medium-sized chunks, apply some clever algorithm to get the top N lines of each chunk, concatenate the chunks together, then apply the same algorithm to the result. Depending on the sizes of the chunks, sort ... | head might be sufficiently clever. It shouldn't be difficult to throw together a shell script using split -l ... to do this.
(Insert more hand-waving as needed.)
Disclaimer: I just tried this on a much smaller file than what you're working with (about 1.7 million lines), and my method was slower than sort ... | head.

Performance of sort command in unix

I am writing a custom apache log parser for my company and I noticed a performance issue that I can't explain. I have a text file log.txt with size 1.2GB.
The command: sort log.txt is up to 3 sec slower than the command: cat log.txt | sort
Does anybody know why this is happening?
cat file | sort is a Useless Use of Cat.
The purpose of cat is to concatenate
(or "catenate") files. If it's only
one file, concatenating it with
nothing at all is a waste of time, and
costs you a process.
It shouldn't take longer. Are you sure your timings are right?
Please post the output of:
time sort file
and
time cat file | sort
You need to run the commands a few times and get the average.
Instead of worrying about the performance of sort instead you should change your logging:
Eliminate unnecessarily verbose output to your log.
Periodically roll the log (based on either date or size).
...fix the errors outputting to the log. ;)
Also, are you sure cat is reading the entire file? It may have a read buffer etc.

Resources