remove dups from many csv files - linux

Given n csv files where they add up to 100 GB in size, I need to remove duplicate rows based on the following rules and conditions:
The csv files are numbered 1.csv to n.csv, and each file is about 50MB in size.
The first column is a string key, 2 rows are considered dup if their first columns are the same.
I want to remove dups by keeping the one in a later file (2.csv is considered later than 1.csv)
My algorithm is the following, I want to know if there's a better one.
merge all files into one giant file
cat *.csv > one.csv
sort the csv
sort one.csv >one_sorted.csv
not sure how to elimiate dups at this point. uniq has a -f flag that skips the first N fields, but in my case I want to skip all but the first 1 fields.
I need help with the last step (eliminating dups in a sorted file). Also is there a more efficient algorithm?

Here's one way using GNU awk:
awk -F, '{ array[$1]=$0 } END { for (i in array) print array[i] }' $(ls -v *.csv)
Explanation: Reading a numerically sorted glob of files, we add the first column of each file to an associative array whose value is the whole line. In this way, the duplicate that's kept is the one that occurs in the latest file. Once complete, loop through the keys of the array and print out the values. GNU awk does provide sorting abilities through asort() and asorti() functions, but piping the output to sort makes things much easier to read, and is probably quicker and more efficient.
You could do this if you require numerical sorting on the first column:
awk -F, '{ array[$1]=$0 } END { for (i in array) print array[i] | "sort -nk 1" }' $(ls -v *.csv)

If you can keep the lines in memory
If enough of the data will fit in memory, the awk solution by steve is pretty neat, whether you write to the sort command by pipe within awk or simply by piping the output of the unadorned awk to sort at the shell level.
If you have 100 GiB of data with perhaps 3% duplication, then you'll need to be able to store 100 GiB of data in memory. That's a lot of main memory. A 64-bit system might handle it with virtual memory, but it is likely to run rather slowly.
If the keys fit in memory
If you can't fit enough of the data in memory, then the task ahead is much harder and will require at least two scans over the files. We need to assume, pro tem, that you can at least fit each key in memory, along with a count of the number of times the key has appeared.
Scan 1: read the files.
Count the number of times each key appears in the input.
In awk, use icount[$1]++.
Scan 2: reread the files.
Count the number of times each key has appeared; ocount[$1]++.
If icount[$1] == ocount[$1], then print the line.
(This assumes you can store the keys and counts twice; the alternative is to use icount (only) in both scans, incrementing in Scan 1 and decrementing in Scan 2, printing the value when the count decrements to zero.)
I'd probably use Perl for this rather than awk, if only because it will be easier to reread the files in Perl than in awk.
Not even the keys fit?
What about if you can't even fit the keys and their counts into memory? Then you are facing some serious problems, not least because scripting languages may not report the out of memory condition to you as cleanly as you'd like. I'm not going to attempt to cross this bridge until it's shown to be necessary. And if it is necessary, we'll need some statistical data on the file sets to know what might be possible:
Average length of a record.
Number of distinct keys.
Number of distinct keys with N occurrences for each of N = 1, 2, ... max.
Length of a key.
Number of keys plus counts that can be fitted into memory.
And probably some others...so, as I said, let's not try crossing that bridge until it is shown to be necessary.
Perl solution
Example data
$ cat x000.csv
abc,123,def
abd,124,deg
abe,125,deh
$ cat x001.csv
abc,223,xef
bbd,224,xeg
bbe,225,xeh
$ cat x002.csv
cbc,323,zef
cbd,324,zeg
bbe,325,zeh
$ perl fixdupcsv.pl x???.csv
abd,124,deg
abe,125,deh
abc,223,xef
bbd,224,xeg
cbc,323,zef
cbd,324,zeg
bbe,325,zeh
$
Note the absence of gigabyte-scale testing!
fixdupcsv.pl
This uses the 'count up, count down' technique.
#!/usr/bin/env perl
#
# Eliminate duplicate records from 100 GiB of CSV files based on key in column 1.
use strict;
use warnings;
# Scan 1 - count occurrences of each key
my %count;
my #ARGS = #ARGV; # Preserve arguments for Scan 2
while (<>)
{
$_ =~ /^([^,]+)/;
$count{$1}++;
}
# Scan 2 - reread the files; count down occurrences of each key.
# Print when it reaches 0.
#ARGV = #ARGS; # Reset arguments for Scan 2
while (<>)
{
$_ =~ /^([^,]+)/;
$count{$1}--;
print if $count{$1} == 0;
}
The 'while (<>)' notation destroys #ARGV (hence the copy to #ARGS before doing anything else), but that also means that if you reset #ARGV to the original value, it will run through the files a second time. Tested with Perl 5.16.0 and 5.10.0 on Mac OS X 10.7.5.
This is Perl; TMTOWTDI. You could use:
#!/usr/bin/env perl
#
# Eliminate duplicate records from 100 GiB of CSV files based on key in column 1.
use strict;
use warnings;
my %count;
sub counter
{
my($inc) = #_;
while (<>)
{
$_ =~ /^([^,]+)/;
$count{$1} += $inc;
print if $count{$1} == 0;
}
}
my #ARGS = #ARGV; # Preserve arguments for Scan 2
counter(+1);
#ARGV = #ARGS; # Reset arguments for Scan 2
counter(-1);
There are probably ways to compress the body of the loop, too, but I find what's there reasonably clear and prefer clarity over extreme terseness.
Invocation
You need to present the fixdupcsv.pl script with the file names in the correct order. Since you have files numbered from 1.csv through about 2000.csv, it is important not to list them in alphanumeric order. The other answers suggest ls -v *.csv using the GNU ls extension option. If it is available, that's the best choice.
perl fixdupcsv.pl $(ls -v *.csv)
If that isn't available, then you need to do a numeric sort on the names:
perl fixdupcsv.pl $(ls *.csv | sort -t. -k1.1n)
Awk solution
awk -F, '
BEGIN {
for (i = 1; i < ARGC; i++)
{
while ((getline < ARGV[i]) > 0)
count[$1]++;
close(ARGV[i]);
}
for (i = 1; i < ARGC; i++)
{
while ((getline < ARGV[i]) > 0)
{
count[$1]--;
if (count[$1] == 0) print;
}
close(ARGV[i]);
}
}'
This ignores awk's innate 'read' loop and does all reading explicitly (you could replace BEGIN by END and would get the same result). The logic is closely based on the Perl logic in many ways. Tested on Mac OS X 10.7.5 with both BSD awk and GNU awk. Interestingly, GNU awk insisted on the parentheses in the calls to close where BSD awk did not. The close() calls are necessary in the first loop to make the second loop work at all. The close() calls in the second loop are there to preserve symmetry and for tidiness — but they might also be relevant when you get around to processing a few hundred files in a single run.

My answer is based on steve's
awk -F, '!count[$1]++' $(ls -rv *.csv)
{print $0} is implied in the awk statement.
Essentially awk prints only the first line whose $1 contains that value. Since the .csv files are listed in reversed natural order, this means for all the lines that has the same value for $1, only the one in the latest file is printed.
Note: This will not work if you have duplicates in the same file (i.e. if you have multiple instances of the same key within the same file)

Regarding your sorting plan, it might be more practical to sort the individual files and then merge them, rather than concatenating and then sorting. The complexity of sorting using the sort program is likely to be O(n log(n)). If you have say 200000 lines per 50MB file, and 2000 files, n will be about 400 million, and n log(n) ~ 10^10. If instead you treat F files of R records each separately, the cost of sorting is O(F*R*log(R)) and the cost of merging is O(F*R*log(R)). These costs are high enough that separate sorting is not necessarily faster, but the process can be broken into convenient chunks so can be more easily checked as things go along. Here is a small-scale example, which supposes that comma can be used as a delimiter for the sort key. (A quote-delimited key field that contains commas would be a problem for the sort as shown.) Note that -s tells sort to do a stable sort, leaving lines with the same sort key in the order they were encountered.
for i in $(seq 1 8); do sort -t, -sk1,1 $i.csv > $i.tmp; done
sort -mt, -sk1,1 [1-8].tmp > 1-8.tmp
or if more cautious might save some intermediate results:
sort -mt, -sk1,1 [1-4].tmp > 1-4.tmp
sort -mt, -sk1,1 [5-8].tmp > 5-8.tmp
cp 1-4.tmp 5-8.tmp /backup/storage
sort -mt, -sk1,1 1-4.tmp 5-8.tmp > 1-8.tmp
Also, an advantage of doing separate sorts followed by a merge or merges is the ease of splitting the workload across multiple processors or systems.
After you sort and merge all the files (into, say, file X) it is fairly simple to write an awk program that at BEGIN reads a line from X and puts it in variable L. Thereafter, each time it reads a line from X, if the first field of $0 doesn't match L, it writes out L and sets L to $0. But if $0 does match L, it sets L to $0. At END, it writes out L.

Related

How to speed up grep/awk command?

I am going to process the text file (>300 GB) and split it into small text files (~1 GB). I want to speed up grep/awk commands.
I need to grep the line which has values on column b, here are my ways:
# method 1:
awk -F',' '$2 ~ /a/ { print }' input
# method 2:
grep -e ".a" < inpuy
Both of ways cost 1min for each file. So how can I speed up this operation?
Sample of input file:
a,b,c,d
1,4a337485,2,54
4,2a4645647,4,56
6,5a3489556,3,22
9,,3,12
10,0,34,45
24,4a83944,3,22
45,,435,34
Expected output file:
a,b,c,d
1,4a337485,2,54
4,2a4645647,4,56
6,5a3489556,3,22
24,4a83944,3,22
How to speed up grep/awk command?
Are you so sure that grep or awk is the culprit of your perceived slowness ? Do you know about cut(1) or sed(1) ? Have you benchmarked the time to run wc(1) on your data? Probably the textual I/O is taking a lot of time.
Please benchmark several times, and use time(1) to benchmark your program.
I have a high-end Debian desktop (with a AMD 2970WX, 64Gb RAM, 1Tbyte SSD system disk, multi-terabyte 7200RPM SATA data disks) and just running wc on a 25Gbyte file (some *.tar.xz archive) sitting on a hard disk takes more than 10 minutes (measured with time), and wc is doing some really simple textual processing by reading that file sequentially so should run faster than grep (but, to my surprise, does not!) or awk on the same data :
wc /big/basile/backup.tar.xz 640.14s user 4.58s system 99% cpu 10:49.92 total
and (using grep on the same file to count occurrences of a)
grep -c a /big/basile/backup.tar.xz 38.30s user 7.60s system 33% cpu 2:17.06 total
general answer to your question:
Just write cleverly (with efficient O(log n) time complexity data structures: red-black trees, or hash tables, etc ...) an equivalent program in C or C++ or Ocaml or most other good language and implementation. Or buy more RAM to increase your page cache. Or buy an SSD to hold your data. And repeat your benchmarks more than once (because of the page cache).
suggestion for your problem : use a relational database
It is likely that using a plain textual file of 300Gb is not the best approach. Having huge textual files is usually wrong and is likely to be wrong once you need to process several times the same data. You'll better pre-process it somehow..
If you repeat the same grep search or awk execution on the same data file more than once, consider instead using sqlite (see also this answer) or even some other real relational database (e.g. with PostGreSQL or some other good RDBMS) to store then process your original data.
So a possible approach (if you have enough disk space) might be to write some program (in C, Python, Ocaml etc...), fed by your original data, and filling some sqlite database. Be sure to have clever database indexes and take time to design a good enough database schema, being aware of database normalization.
Use mawk, avoid regex and do:
$ mawk -F, '$2!=""' file
a,b,c,d
1,4a337485,2,54
4,2a4645647,4,56
6,5a3489556,3,22
10,0,34,45
24,4a83944,3,22
Let us know how long that took.
I did some tests with 10M records of your data, based on the results: use mawk and regex:
GNU awk and regex:
$ time gawk -F, '$2~/a/' file > /dev/null
real 0m7.494s
user 0m7.440s
sys 0m0.052s
GNU awk and no regex:
$ time gawk -F, '$2!=""' file >/dev/null
real 0m9.330s
user 0m9.276s
sys 0m0.052s
mawk and no regex:
$ time mawk -F, '$2!=""' file >/dev/null
real 0m4.961s
user 0m4.904s
sys 0m0.060s
mawk and regex:
$ time mawk -F, '$2~/a/' file > /dev/null
real 0m3.672s
user 0m3.600s
sys 0m0.068s
I suspect your real problem is that you're calling awk repeatedly (probably in a loop), once per set of values of $2 and generating an output file each time, e.g.:
awk -F, '$2==""' input > novals
awk -F, '$2!=""' input > yesvals
etc.
Don't do that as it's very inefficient since it's reading the whole file on every iteration. Do this instead:
awk '{out=($2=="" ? "novals" : "yesvals")} {print > out}' input
That will create all of your output files with one call to awk. Once you get past about 15 output files it would require GNU awk for internal handling of open file descriptors or you need to add close(out)s when $2 changes and use >> instead of >:
awk '$2!=prev{close(out); out=($2=="" ? "novals" : "yesvals"); prev=$2} {print >> out}' input
and that would be more efficient if you sorted your input file first with (requires GNU sort for -s for stable sort if you care about preserving input ordering for the unique $2 values):
sort -t, -k2,2 -s

How to calculate the CPU load when viewing video (web-server on linux)

I have a shop of video courses. How to calculate the CPU load when viewing video (web-server on linux)?
You are asking for a kernel-internal value, so you won't have to compute anything, you just will query the kernel for that value.
Interactively you can use the top command, as Daniel stated in his comment.
Programmatically, top will be cumbersome. Instead, you can use uptime as a high level tool for this. Use uptime | { IFS=\ , read a b c d e f g h i j k l m; echo "$j"; } to only get the current load.
A little more lower level would be to use the proc file system. The file /proc/loadavg provides the information about the load. You can use cut -d' ' -f 1 /proc/loadavg to only get the current load.
For the smoothed-out average values of the load (there typically are given three values, the first is the current load, the two others are averages over long time periods), use $k or $l instead of $j in the uptime solution, and use -f 2 or -f 3 in the proc file system solution.

How to split a large variable?

I'm working with large variables and it can be very slow "looping" through them with while read line, I found out that the smaller the variable the faster it works.
How can I split large variable into smaller variables and then read them one by one?
for example,
What I would like to achieve:
bigVar=$(echo "$bigVar" | split_var)
for var in "${bigVar[#]}"; do
while read line; do
...
done <<< "${var}"
done
or may be split to bigVar1, bigVar2, bigVar3 etc.. and than read them one by one.
Instead of doing
bigVar=$(someCommand)
while read line
do
...
done <<< "$bigVar"
Use
while read line
do
...
done < <(someCommand)
This way, you avoid the problem with big variables entirely, and someCommand can output gigabyte after gigabyte with no problem.
If the reason you put it in a variable was to do work in multiple steps on it, rewrite it as a pipeline.
If BigVar is made of words, you could use xargs to split it in lines no longer than the maximum length of a command line, usually 32kb or 64kb :
someCommand|xargs|while read line
do
...
done
In this case xargs uses its default command, which is echo.
I'm curious about what you want to do in the while loop, as it may be optimized with a pipeline.

What standard commands can I use to print just the first few lines of sorted output on the command line efficiently?

I basically want the equivalent of
... | sort -arg1 -arg2 -... | head -n $k
but, my understanding is that sort will go O(n log n) over the whole input. In my case I'm dealing with lots of data, so runtime matters to me - and also I have a habit of overflowing my tmp/ folder with sort temporary files.
I'd rather have it go O(n log k) using e.g. a heap, which would presumably go faster, and which also reduces the working set memory to k as well.
Is there some combination of standard command-line tools that can do this efficiently, without me having to code something myself? Ideally it would support the full expressive sort power of the sort command. sort (on ubuntu at least) appears to have no man-page-documented switch to pull it off...
Based on the above, and some more poking, I'd say the official answer to my question is "there is no solution." You can use specialized tools, or you can use the tools you've got with their current performance, or you can write your own tool.
I'm debating tracking down the sort source code and offering a patch. In the meantime, in case this quick hack code helps for anybody doing something similar to what I was doing, here's what I wrote for myself. Not the best python, and a very shady benchmark: I offer it to anybody else who cares to provide more rigorous:
256 files, of about 1.6 Gigs total size, all sitting on an ssd, lines
separated by \n, lines of format [^\t]*\t[0-9]+
Ubuntu 10.4, 6 cores, 8 gigs of ram, /tmp on ssd as well.
$ time sort -t^v<tab> -k2,2n foo* | tail -10000
real 7m26.444s
user 7m19.790s
sys 0m17.530s
$ time python test.py 10000 foo*
real 1m29.935s
user 1m28.640s
sys 0m1.220s
using diff to analyze, the two methods differ on tie-breaking, but otherwise the sort order is the same.
test.py:
#!/usr/bin/env python
# test.py
from sys import argv
import heapq
from itertools import chain
# parse N - the size of the heap, and confirm we can open all input files
N = int(argv[1])
streams = [open(f, "r") for f in argv[2:]]
def line_iterator_to_tuple_iterator(line_i):
for line in line_i:
s,c = line.split("\t")
c = int(c)
yield (c, s)
# use heap to process inputs
rez = heapq.nlargest(N,
line_iterator_to_tuple_iterator(chain(*streams)),
key=lambda x: x[0])
for r in rez:
print "%s\t%s" % (r[1], r[0])
for s in streams:
s.close()
UNIX/Linux provides generalists toolset. For large datasets it does loads of I/O. It will do everything you can want, but slowly. If we had an idea of the input data it would help immensely.
IMO, You have some choices, none you will really like.
do a multipart "radix" pre-sort - for example have awk write all of the lines whose keys start with 'A' to one file 'B' to another, etc. Or if you only 'P' 'D' & 'Q', have awk just suck out what you want. Then do a full sort on a small subset. This creates 26 files named A, B ...Z
awk '{print $0 > substr($0,1,1)} bigfile; sort [options here] P D Q > result
Spend $$: (Example) Buy CoSort from iri.com any other sort software. These sorts use all kinds of optimizations, but they are not free like bash. You could also buy an SSD which speeds up sorting on disk by several orders of magnitude. 5000iops now to 75000iops. Use the TMPDIR variable to put your tmp files on the SSD, read and write only to the SSD. But use your existing UNIX toolset.
Use some software like R or strata, or preferably a database; all of these are meant for large datasets.
Do what you are doing now, but watch youtube while the UNIX sort runs.
IMO, you are using the wrong tools for large datasets when you want quick results.
Here's a crude partial solution:
#!/usr/bin/perl
use strict;
use warnings;
my #lines = ();
while (<>) {
push #lines, $_;
#lines = sort #lines;
if (scalar #lines > 10) {
pop #lines;
}
}
print #lines;
It reads the input data only once, continuously maintaining a sorted array of the top 10 lines.
Sorting the whole array every time is inefficient, of course, but I'll guess that for a gigabyte input it will still be substantially faster than sort huge-file | head.
Adding an option to vary the number of lines printed would be easy enough. Adding options to control how the sorting is done would be a bit more difficult, though I wouldn't be surprised if there's something in CPAN that would help with that.
More abstractly, one approach to getting just the first N sorted elements from a large array is to use a partial Quicksort, where you don't bother sorting the right partition unless you need to. That requires holding the entire array in memory, which is probably impractical in your case.
You could split the input into medium-sized chunks, apply some clever algorithm to get the top N lines of each chunk, concatenate the chunks together, then apply the same algorithm to the result. Depending on the sizes of the chunks, sort ... | head might be sufficiently clever. It shouldn't be difficult to throw together a shell script using split -l ... to do this.
(Insert more hand-waving as needed.)
Disclaimer: I just tried this on a much smaller file than what you're working with (about 1.7 million lines), and my method was slower than sort ... | head.

Read data from pipe and write to standard out with a delay in between. Must handle binary files too

I have been trying for about an hour now to find an elegant solution to this problem. My goal is basically to write a bandwidth control pipe command which I could re-use in various situations (not just for network transfers, I know about scp -l 1234). What I would like to do is:
Delay for X seconds.
Read Y amount (or less than Y if there isn't enough) data from pipe.
Write the read data to standard output.
Where:
X could be 1..n.
Y could be 1 Byte up to some high value.
My problem is:
It must support binary data which Bash can't handle well.
Roads I've taken or at least thought of:
Using a while read data construct, it filters all white characters in the encoding your using.
Using dd bs=1 count=1 and looping. dd doesn't seem to have different exit codes for when there were something in if and not. Which makes it harder to know when to stop looping. This method should work if I redirect standard error to a temporary file, read it to check if something was transfered (as it's in the statistics printed on stderr) and repeat. But I suspect that it's extremely slow if used on large amounts of data and if it's possible I'd like to skip creating any temporary files.
Any ideas or suggestions on how to solve this as cleanly as possible using Bash?
may be pv -qL RATE ?
-L RATE, --rate-limit RATE
Limit the transfer to a maximum of RATE bytes per second. A
suffix of "k", "m", "g", or "t" can be added to denote kilobytes
(*1024), megabytes, and so on.
It's not much elegant but you can use some redirection trick to catch the number of bytes copied by dd and then use it as the exit condition for a while loop:
while [ -z "$byte_copied" ] || [ "$byte_copied" -ne 0 ]; do
sleep $X;
byte_copied=$(dd bs=$Y count=1 2>&1 >&4 | awk '$2 == "byte"{print $1}');
done 4>&1
However, if your intent is to limit the transfer throughput, I suggest you to use pv.
Do you have to do it in bash? Can you just use an existing program such as cstream?
cstream meets your goal of a bandwidth controlled pipe command, but doesn't necessarily meet your other criteria with regard to your specific algorithm or implementation language.
What about using head -c ?
cat /dev/zero | head -c 10 > test.out
Gives you a nice 10 bytes file.

Resources