Fastest way to find lines from a large file in another file

Fastest way to find lines from a large file in another file - linux

I am using grep in a while loop to find lines from one file in another file and saving the output to a new file. My file is quite large (226 million lines) and the script is taking forever (12 days and counting). Do you have a suggestion to speed it up, perhaps there is a better way rather than grep?
(I also need the preceding line for the output, therefore grep -B 1.)
Here is my code:
#!/bin/bash
while IFS= read -r line; do
grep -B 1 $line K33.21mercounts.bf.trimmedreads.dumps.fa >> 21mercounts.bf.trimmedreads.diff.kmers.K33;
done <21mercounts.bf.trimmedreads.diff.kmers
Update:
The input file with the lines to look for is 4.7 GB and 226 mio lines and looks like this:
AAAGAAAAAAAAAGCTAAAAT
ATCTCGACGCTCATCTCAGCA
GTTCGTCGGAGAGGAGAGAAC
GAGGACTATAAAATTGTCGCA
GGCTTCAATAATTTGTATAAC
GACATAGAATCACGAGTGACC
TGGTGAGTGACATCCTTGACA
ATGAAAACTGCCAGCAAACTC
AAAAAACTTACCTTAAAAAGT
TTAGTACACAATATCTCCCAA
The file to look in is 26 GB and 2 billion lines and looks like this:
>264638
AAAAAAAAAAAAAAAAAAAAA
>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>1
GTTCGTCGGAGAGGAGAGAAC
>28
TCTTTTCAGGAGTAATAACAA
>13
AATCATTTTCCGCTGGAGAGA
>38
ATTCAATAAATAATAAATTAA
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC
The expected output would be this:
>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>1
GTTCGTCGGAGAGGAGAGAAC
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC

You can try this grep -f command without shell loop and using a fixed string search:
grep -B1 -Ff 21mercounts.bf.trimmedreads.diff.kmers \
K33.21mercounts.bf.trimmedreads.dumps.fa > 21mercounts.bf.trimmedreads.diff.kmers.K33

There are quite a few tools (e.g. ripgrep) and options (-f, -F, and -x) to speed up your basic approach. But all of them are basically the same slow approach as you are using now, "only" sped up by a huge but still constant factor. For your problem and input sizes, I'd recommend to change the approach altogether. There are many different ways to tackle your problem. First, lets define some variables to estimate the speedup of those approaches:
Problem
A 26 GB haystack file with h = 1 million entries (description, sequence) = 2 billion lines, e.g.
>28
TCTTTTCAGGAGTAATAACAA
>13
AATCATTTTCCGCTGGAGAGA
>38
ATTCAATAAATAATAAATTAA
...
A 4.7 GB needles file with n = 226 million lines, each of length m = 21, e.g.
GACATAGAATCACGAGTGACC
TGGTGAGTGACATCCTTGACA
ATGAAAACTGCCAGCAAACTC
...
For all needles, we want to extract the corresponding entries in the haystack (if they exist).
Solutions
We assume n < h and a constant m. Therefore O(n+h) = O(h), O(m)=O(1) and so on.
Our goal is to minimize the number of times we have to iterate the the biggest file (= the haystack).
Naive – O(h·n) time
Currently, you are using the naive approach. For each needle, the entire haystack is searched once.
Put needles into data structure; search haystack once – O(h) time
Store all needles in a data structure which has a fast contains() operation.
Then iterate the haystack and call needles.contains(haystackEntry) for each entry, to decide whether it is something you are searching for.
Currently, your "data structure" is a list, which takes O(1) time to "build" (because it is already in that form) , but O(n) time to query once!
Below data structures take O(n) time to populate and O(1) time to query once, resulting in O(n + h·1) = O(h) time in total.
Tries (= prefix trees) can be expressed as a regexes, so you can stick with grep. E.g. the needles ABC, ABX, and XBC can be stored in the Trie regex ^(AB(C|X)|XBC). But converting the list of needles to such a Trie regex is a bit complicated in bash.
Hash maps are available in awk, see sundeep's answer. But putting 4.7 GB of raw data in such a structure in memory is probably not very efficient if even possible (depends on your available memory. The hash map needs to be many times bigger than the raw data).
Either way, data structures and bash don't mix very well. And even if we switched to a better language, we would have to re-build or store/load the structure each time the program runs. Therefore it is easier and nearly as efficient to ...
Sort everything; search haystack once – O( h·log(h) + h ) time
First sort the haystack and the needles, then iterate the haystack only once.
Take the first needle and search the haystack from the beginning. When reaching a haystack entry that would have to be sorted behind the the current needle, take the next needle and continue the search from your current location.
This can be done easily in bash. Here we use GNU coreutils to make processing a bit easier, faster, and safer:
export LC_ALL=C # speeds up sorting
mem=66% # Max. memory to be used while sorting. More is better.
sep=$'\f' # A character not appearing in your data.
paste -d"$sep" - - < haystack > haystack2
sort -S"$mem" -o needles2 needles
sort -S"$mem" -t"$sep" -k2,2 -o haystack2 haystack2
# --nocheck-order is not needed, but speeds up the process
join -t"$sep" -22 -o2.1,2.2 --nocheck-order needles2 haystack2 |
tr "$sep" \\n
This changes the order of the output. If you need the output in the original order, use a Schwartzian transform (= decorate-sort-undecorate): Before sorting the needles/haystack, store their line numbers. Drag those along through the entire process. At the end, sort the found entries by those line numbers. Finally, remove the line numbers and print the result.
export LC_ALL=C # speeds up sorting
mem=66% # Max. memory to be used while sorting. More is better.
sep=$'\f' # A character not appearing in your data.
nl -ba -d '' -s"$sep" needles > needles2
paste -d"$sep" - - < haystack | nl -ba -d '' -s"$sep" > haystack2
sort -t"$sep" -k2,2 -S"$mem" -o needles2 needles2
sort -t"$sep" -k3,3 -S"$mem" -o haystack2 haystack2
# --nocheck-order is not needed, but speeds up the process
join -t"$sep" -12 -23 -o1.1,2.1,2.2,2.3 --nocheck-order needles2 haystack2 > result
sort -t"$sep" -k1,2n -S"$mem" -o result result
cut -d"$sep" -f3- result | tr "$sep" \\n

If preserving the original order is not required, using GNU uniq and GNU sed:
{ cat 21mercounts.bf.trimmedreads.diff.kmers
sed -n 'x;n;G;s/\n//p' K33.21mercounts.bf.trimmedreads.dumps.fa
} | LC_ALL=C sort | uniq -w21 -D |
sed -n 's/\(.*\)>\(.*\)/>\2\n\1/p' > 21mercounts.bf.trimmedreads.diff.kmers.K33

Here's a solution using awk. Not sure if it will be faster than grep or ripgrep, but it is possible due to hash-based lookup. This assumes your RAM is big enough to load the first file (4.7 GB and 226 mio lines).
$ awk 'NR==FNR{a[$1]; next} $0 in a{print p; print} {p=$0}' f1 f2
>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>1
GTTCGTCGGAGAGGAGAGAAC
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC
mawk is usually the fastest option, but I have come across examples where gawk is faster, especially for arrays like in this command. If you can install frawk, that can give you even faster results. Command needs to be slightly modified:
frawk 'NR==FNR{a[$1]; next} $0 in a{print p; print $0} {p=$0}' f1 f2

Any time I deal with files this big, I almost always end up sorting them. Sorts are slow, but take a lot less time that your while read loop that is scanning 2 billion lines 226 million times.
sort 4GB>4gb.srt
and
sed '/>/{N;s/\n/ /}' 26GB |sort -t' ' -k2 >25gb.srt
which will produce a file like this:
>264638 AAAAAAAAAAAAAAAAAAAAA
>1 AAAGAAAAAAAAAGCTAAAAT
>13 AATCATTTTCCGCTGGAGAGA
>1 ATCTCGACGCTCATCTCAGCA
>38 ATTCAATAAATAATAAATTAA
>2 GAGGACTATAAAATTGTCGCA
>1 GGCTTCAATAATTTGTATAAC
>1 GTTCGTCGGAGAGGAGAGAAC
>28 TCTTTTCAGGAGTAATAACAA
Now you only have to read through each file once.
$ cat tst
awk 'BEGIN{ getline key < "4gb.srt"; }
$2 < key { next; }
$2 > key { while ($2 > key){ getline key < "4gb.srt"; } }
$2 == key { $0=gensub(/ /,"\n",1); print }' 25gb.srt
$ ./tst
>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC
>1
GTTCGTCGGAGAGGAGAGAAC
The ordering is different from yours, but otherwise does that work?
(Try some tests with smaller files first...)
addendum
Please c.f. Socowi's better implementation, but I was asked to explain the awk, so -
First, see above where I parsed the larger "haystraw" file to single lines sorted on the key field, which will be $2 in awk, and the smaller "needles" file in the same order. Making a few (not necessarily safe) assumptions, I ran through both once.
BEGIN{ getline key < "4gb.srt"; }
This just initializes the first "needle" into a variable called key by reading from the appropriate file.
Then as awk reads each line of the "haystraw" file, it automatically parses it into the fields - since we stacked them, the first field is the previous line of the original haystack, and the second field is the value to check, so we do our comparisons between key and $2.
$2 < key { next; } # skip ahead to next key/needle
If the current straw is less than the needle, throw it away and grab the next one.
$2 > key { while ($2 > key){ getline key < "4gb.srt"; } }
If the current straw is greater than the needle, then the needle wasn't in the file. The next one might not be either, so we grab needles in sequential order and compare then until they catch up.
There's actually a potential bug here - it's not confirming that something was read and could hang in an endless loop when the needles run out. This section should have been something like -
$2 > key { while ( ($2 > key) { if( 0 == getline key < "4gb.srt" ) key = "ZZZZZZZZZZZZZZZZZZZZZZ"; } }
Finally,
$2 == key { $0=gensub(/ /,"\n",1); print }' 25gb.srt
If they match, reinsert the newline between the previous record and the matching line, and print them both.
There really should also have been an END{ close("4gb.srt") } as well.

grep can search for many patterns (given in a separate file) simultaneously, so reading K33.21mercounts.bf.trimmedreads.dumps.fa will only be done once.
Something like the following might work:
#!/bin/bash
grep --f 21mercounts.bf.trimmedreads.diff.kmers -B 1 K33.21mercounts.bf.trimmedreads.dumps.fa >> 21mercounts.bf.trimmedreads.diff.kmers.K33;
However, it probably requires lots of RAM

Related

How to speed-up sed that uses Regex on very large single cell BAM file

I have the following simple script that tries to count
the tag encoded with "CB:Z" in SAM/BAM file:
samtools view -h small.bam | grep "CB:Z:" |
sed 's/.*CB:Z:\([ACGT]*\).*/\1/' |
sort |
uniq -c |
awk '{print $2 " " $1}'
Typically it needs to process 40 million lines. That codes takes around 1 hour to finish.
This line sed 's/.*CB:Z:\([ACGT]*\).*/\1/' is very time consuming.
How can I speed it up?
The reason I used the Regex is that the "CB" tag column-wise position
is not fixed. Sometimes it's at column 20 and sometimes column 21.
Example BAM file can be found HERE.
Update
Speed comparison on complete 40 million lines file:
My initial code:
real 21m47.088s
user 26m51.148s
sys 1m27.912s
James Brown's with AWK:
real 1m28.898s
user 2m41.336s
sys 0m6.864s
James Brown's with MAWK:
real 1m10.642s
user 1m41.196s
sys 0m6.484s

Another awk, pretty much like #tripleee's, I'd assume:
$ samtools view -h small.bam | awk '
match($0,/CB:Z:[ACGT]*/) { # use match for the regex match
a[substr($0,RSTART+5,RLENGTH-5)]++ # len(CB:z:)==5, hence +-5
}
END {
for(i in a)
print i,a[i] # sample output,tweak it to your liking
}'
Sample output:
...
TCTTAATCGTCC 175
GGGAAGGCCTAA 190
TCGGCCGATCGG 32
GACTTCCAAGCC 76
CCGCGGCATCGG 36
TAGCGATCGTGG 125
...
Notice: Your sed 's/.*CB:Z:... matches the last instance where as my awk 'match($0,/CB:Z:[ACGT]*/)... matches the first.
Notice 2: Quoting #Sundeep in the comments: - - using LC_ALL=C mawk '..' will give even better speed.

With perl
perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}'
CB:Z:\K[ACGT]++ will match any sequence of ACGT characters preceded by CB:Z:. \K is used here to prevent CB:Z: from being part of matched portion, which is available via $& variable
Sample time with small.bam input file. mawk is fastest for this input, but it might change for larger input file.
# script.awk is the one mentioned in James Brown's answer
# result here shown with GNU awk
$ time LC_ALL=C awk -f script.awk small.bam > f1
real 0m0.092s
# mawk is faster compared to GNU awk for this use case
$ time LC_ALL=C mawk -f script.awk small.bam > f2
real 0m0.054s
$ time perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}' small.bam > f3
real 0m0.064s
$ diff -sq <(sort f1) <(sort f2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -sq <(sort f1) <(sort f3)
Files /dev/fd/63 and /dev/fd/62 are identical

Better to avoid parsing the output of samtools view in the first place. Here's one way to get what you need just using python and the pysam library:
import pysam
from collections import defaultdict
counts = defaultdict(int)
tag = 'CB'
with pysam.AlignmentFile('small.bam') as sam:
for aln in sam:
if aln.has_tag(tag):
counts[ aln.get_tag(tag) ] += 1
for k, v in counts.items():
print(k, v)

Following your original pipeline approach:
pcre2grep -o 'CB:Z:\K[^\t]*' small.bam |
awk '{++c[$0]} END {for (i in c) print i,c[i]}'
In case you're interested in trying to speed up sed (although it's not likely to be the fastest):
sed 't a;s/CB:Z:/\n/;D;:a;s/\t/\n/;P;d' small.bam |
awk '{++c[$0]} END {for (i in c) print i,c[i]}'
above syntax is compatible with GNU sed.

regrading the AWK based solutions, i've noticed few taking advantage of FS.
I'm not too familiar with BAM format. If CB only show up once per line, then
mawk/mawk2/gawk -b 'BEGIN { FS = "CB:Z:";
} $2 ~ /^[ACGT]/ { # if FS never matches, $2 would be beyond
# end of line, then this would just match
# against null string, & eval to false
seen[substr($2, 1, -1 + match($2, /[^ACGT]|$/))]++
} END { for (x in seen) { print seen[x] " " x } }'
If it shows up more than once, then change that to a loop of any field greater than 1. This version uses the laziest evaluation model possible to speed it up, then do all the uniq -c item.
While this is rather similar to the best answer above, by having FS pre-split the fields, it causes match() and substr() to do a lot less work. I'm simply matching 1 single char after the genetic sequence, and directly using its return, minus 1, as the substring length, and skipping RSTART or RLENGTH all together.

Regarding :
$ diff -sq <(sort f1) <(sort f2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -sq <(sort f1) <(sort f3)
Files /dev/fd/63 and /dev/fd/62 are identical
there's absolutely no need to have them physically output to disk and do a diff. Just simply have the output of each piped to a very high speed hashing algorithm that adds close to no time (when the output is gigantic enough you might end up saving time versus going to disk.
my personal favorite is xxhash in 128-bit mode, available via python pip. it's NOT a cryptographic hash, but it's much faster than even something like MD5. This method also allows for hassle-free compare since the benchmark timing of it will also perform the accuracy check.

Remove redundant strings without looping

Is there a way to remove both duplicates and redundant substrings from a list, using shell tools? By "redundant", I mean a string that is contained within another string, so "foo" is redundant with "foobar" and "barfoo".
For example, take this list:
abcd
abc
abd
abcd
bcd
and return:
abcd
abd
uniq, sort -u and awk '!seen[$0]++' remove duplicates effectively but not redundant strings:
How to delete duplicate lines in a file without sorting it in Unix?
Remove duplicate lines without sorting
I can loop through each line recursively with grep but this is is quite slow for large files. (I have about 10^8 lines to process.)
There's an approach using a loop in Python here: Remove redundant strings based on partial strings and Bash here: How to check if a string contains a substring in Bash but I'm trying to avoid loops. Edit: I mean nested loops here, thanks for the clarification #shellter
Is there a way to use a awk's match() function with an array index? This approach builds the array progressively so never has to search the whole file, so should be faster for large files. Or am I missing some other simple solution?
An ideal solution would allow matching of a specified column, as for the methods above.
EDIT
Both of the answers below work, thanks very much for the help. Currently testing for performance on a real dataset, will update with results and accept an answer. I tested both approaches on the same input file, which has 430,000 lines, of which 417,000 are non-redundant. For reference, my original looped grep approach took 7h30m with this file.
Update:
James Brown's original solution took 3h15m and Ed Morton's took 8h59m. On a smaller dataset, James's updated version was 7m versus the original's 20m. Thank you both, this is really helpful.
The data I'm working with are around 110 characters per string, with typically hundreds of thousands of lines per file. The way in which these strings (which are antibody protein sequences) are created can lead to characters from one or both ends of the string getting lost. Hence, "bcd" is likely to be a fragment of "abcde".

An awk that on first run extracts and stores all substrings and strings to two arrays subs and strs and checks on second run:
$ awk '
NR==FNR { # first run
if(($0 in strs)||($0 in subs)) # process only unseen strings
next
len=length()-1 # initial substring length
strs[$0] # hash the complete strings
while(len>=1) {
for(i=1;i+len-1<=length();i++) { # get all substrings of current len
asub=substr($0,i,len) # sub was already resetved :(
if(asub in strs) # if substring is in strs
delete strs[asub] # we do not want it there
subs[asub] # hash all substrings too
}
len--
}
next
}
($0 in strs)&&++strs[$0]==1' file file
Output:
abcd
abd
I tested the script with about 30 M records of 1-20 char ACGT strings. The script ran 3m27s and used about 20 % of my 16 GBs. Using strings of length 1-100 I OOM'd in a few mins (tried it again with about 400k records oflength of 50-100 and it uses about 200 GBs and runs about an hour). (20 M records of 1-30 chars ran 7m10s and used 80 % of the mem)
So if your data records are short or you have unlimited memory, my solution is fast but in the opposite case it's going to crash running out of memory.
Edit:
Another version that tries to preserve memory. On the first go it checks the min and max lengths of strings and on the second run won't store substrings shorter than global min. For about 400 k record of length 50-100 it used around 40 GBs and ran 7 mins. My random data didn't have any redundancy so input==putput. It did remove redundance with other datasets (2 M records of 1-20 char strings):
$ awk '
BEGIN {
while((getline < ARGV[1])>0) # 1st run, check min and max lenghts
if(length()<min||min=="") # TODO: test for length()>0, too
min=length()
else if(length()>max||max=="")
max=length()
# print min,max > "/dev/stderr" # debug
close(ARGV[1])
while((getline < ARGV[1])>0) { # 2nd run, hash strings and substrings
# if(++nr%10000==0) # debug
# print nr > "/dev/stderr" # debug
if(($0 in strs)||($0 in subs))
continue
len=length()-1
strs[$0]
while(len>=min) {
for(i=1;i+len-1<=length();i++) {
asub=substr($0,i,len)
if(asub in strs)
delete strs[asub]
subs[asub]
}
len--
}
}
close(ARGV[1])
while((getline < ARGV[1])>0) # 3rd run, output
if(($0 in strs)&&!strs[$0]++)
print
}' file

$ awk '{print length($0), $0}' file |
sort -k1,1rn -k2 -u |
awk '!index(str,$2){str = str FS $2; print $2}'
abcd
abd
The above assumes the set of unique values will fit in memory.

EDIT
This won't work. Sorry.
#Ed's solution is the best idea I can imagine without some explicit looping, and even that is implicitly scanning over the near-entire growing history of data on every record. It has to.
Can your existing resources hold that whole column in memory, plus a delimiter per record? If not, then you're going to be stuck with either very complex optimization algorithms, or VERY slow redundant searches.
Original post left for reference in case it gives someone else an inspiration.
That's a lot of data.
Given the input file as-is,
while read next
do [[ "$last" == "$next" ]] && continue # throw out repeats
[[ "$last" =~ $next ]] && continue # throw out sustrings
[[ "$next" =~ $last ]] && { last="$next"; continue; } # upgrade if last a substring of next
echo $last # distinct string
last="$next" # set new key
done < file
yields
abcd
abd
With a file of that size I wouldn't trust that sort order, though. Sorting is going to be very slow and take a lot of resources, but will give you more trustworthy results. If you can sort the file once and use that output as the input file, great. If not, replace that last line with done < <( sort -u file ) or something to that effect.
Reworking this logic in awk will be faster.
$: sort -u file | awk '1==NR{last=$0} last~$0{next} $0~last{last=$0;next} {print last;last=$0}'
Aside from the sort this uses trivial memory and should be very fast and efficient, for some value of "fast" on a file with 10^8 lines.

Filter a very large, numerically sorted CSV file based on a minimum/maximum value using Linux?

I'm trying to output lines of a CSV file which is quite large. In the past I have tried different things and ultimately come to find that Linux's command line interface (sed, awk, grep, etc) is the fastest way to handle these types of files.
I have a CSV file like this:
1,rand1,rand2
4,randx,randy,
6,randz,randq,
...
1001,randy,randi,
1030,rando,randn,
1030,randz,randc,
1036,randp,randu
...
1230994,randm,randn,
1230995,randz,randl,
1231869,rande,randf
Although the first column is numerically increasing, the space between each number varies randomly. I need to be able to output all lines that have a value between X and Y in their first column.
Something like:
sed ./csv -min --col1 1000 -max --col1 1400
which would output all the lines that have a first column value between 1000 and 1400.
The lines are different enough that in a >5 GB file there might only be ~5 duplicates, so it wouldn't be a big deal if it counted the duplicates only once -- but it would be a big deal if it threw an error due to a duplicate line.
I may not know whether particular line values exist (e.g. 1000 is a rough estimate and should not be assumed to exist as a first column value).

Optimizations matter when it comes to large files; the following awk command:
is parameterized (uses variables to define the range boundaries)
performs only a single comparison for records that come before the range.
exits as soon as the last record of interest has been found.
awk -F, -v from=1000 -v to=1400 '$1 < from { next } $1 > to { exit } 1' ./csv
Because awk performs numerical comparison (with input fields that look like numbers), the range boundaries needn't match field values precisely.

You can easily do this with awk, though it won't take full advantage of the file being sorted:
awk -F , '$1 > 1400 { exit(0); } $1 >= 1000 { print }' file.csv

If you know that the numbers are increasing and unique, you can use addresses like this:
sed '/^1000,/,/^1400,/!d' infile.csv
which does not print any line that is outside of the lines between the one that matches /^1000,/ and the one that matches /^1400,/.
Notice that this doesn't work if 1000 or 1400 don't actually exist as values, i.e., it wouldn't print anything at all in that case.
In any case, as demonstrated by the answers by mklement0 and that other guy, awk is a the better choice here.

Here's a bash-version of the script:
#! /bin/bash
fname="$1"
start_nr="$2"
end_nr="$3"
while IFS=, read -r nr rest || [[ -n $nr && -n $rest ]]; do
if (( $nr < $start_nr )); then continue;
elif (( $nr > $end_nr )); then break; fi
printf "%s,%s\n" "$nr" "$rest"
done < "$fname"
Which you would then call script.sh foo.csv 1000 2000
The script will start printing when the number is large enough and then immediately stops when the number gets above the limit.

Get list of all duplicates based on first column within large text/csv file in linux/ubuntu

I am trying to extract all the duplicates based on the first column/index of my very large text/csv file (7+ GB / 100+ Million lines). Format is like so:
foo0:bar0
foo1:bar1
foo2:bar2
first column is any lowercase utf-8 string and the second column is any utf-8 string. I have been able to sort my file based on the first column and only the first column with:
sort -t':' -k1,1 filename.txt > output_sorted.txt
I have also been able to drop all duplicates with:
sort -t':' -u -k1,1 filename.txt > output_uniq_sorted.txt
These operations take 4-8 min.
I am now trying to extract all duplicates based on the first column and only the first column, to make sure all entries in the second columns are matching.
I think I can achieve this with awk with this code:
BEGIN { FS = ":" }
{
count[$1]++;
if (count[$1] == 1){
first[$1] = $0;
}
if (count[$1] == 2){
print first[$1];
}
if (count[$1] > 1){
print $0;
}
}
running it with:
awk -f awk.dups input_sorted.txt > output_dup.txt
Now the problem is this takes way to long 3+hours and not yet done. I know uniq can get all duplicates with something like:
uniq -D sorted_file.txt > output_dup.txt
The problem is specifying the delimiter and only using the first column. I know uniq has a -f N to skip the first N fields. Is there a way to get these results without having to change/process my data? Is there another tool the could accomplish this? I have already used python + pandas with read_csv and getting the duplicates but this leads to errors (segmentation fault) and this is not efficient since I shouldn't have to load all the data in memory since the data is sorted. I have decent hardware
i7-4700HQ
16GB ram
256GB ssd samsung 850 pro
Anything that can help is welcome,
Thanks.
SOLUTION FROM BELOW
Using:
awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c'
with the command time I get the following performance.
real 0m46.058s
user 0m40.352s
sys 0m2.984s

If your file is already sorted you don't need to store more than one line, try this
$ awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c' sorted.input
If you try this please post the timings...

I have changed the awk script slightly because I couldn't fully understand what was happening in the above awnser.
awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c>=1{if(c==1){print p0;} print $0}' sorted.input > duplicate.entries
I have tested and this produces the same output as the above but might be easier to understand.
{if(p!=$1){p=$1; c=0; p0=$0} else c++}
If the first token in the line is not the same as the previous we will save the first token then set c to 0 and save the whole line into p0. If it is the same we increment c.
c>=1{if(c==1){print p0;} print $0}
In the case of the repeat, we check if its first repeat. If thats the case we print save line and current line, if not just print current line.

How to display line numbers when comparing files with linux "comm" tool

I would like to diff two very large files (multi-GB), using linux command line tools, and see the line numbers of the differences. The order of the data matters.
I am running on a Linux machine and the standard diff tool gives me the "memory exhausted" error. -H had no effect.
In my application, I only need to stream the diff results. That is, I just want to visually look at the first few differences, I don't need to inspect the entire file. If there are differences, a quick glance will tell me what is wrong.
'comm' seems well suited to this, but it does not display line numbers of the differences.
In general, my multi-GB files only have a few hundred lines that are different, the rest of the file is the same.
Is there a way to get comm to dump the line number? Or a way to make diff run without loading the entire file into memory? (like cutting the input files into 1k blocks, without actually creating a million 1k-files in my filesystem and cluttering everything up)?

I won't use comm, but as you said WHAT you need, in addition to HOW you thought you should do it, I'll focus on the "WHAT you need" instead :
An interesting way would be to use paste and awk : paste can show 2 files "side by side" using a separator. If you use \n as separator, it display the 2 files with line 1 of each , followed by line 2 of each etc.
So the script you could use could be simply (once you know that there are the same number of lines in each files) :
paste -d '\n' /tmp/file1 /tmp/file2 | awk '
NR%2 { linefirstfile=$0 ; }
!(NR%2) { if ( $0 != linefirstfile )
{ print "line",NR/2,": "; print linefirstfile ; print $0 ; } }'
(Interrestingly, this solution will allow be easily extended to do a diff of N files in a single read, whatever the sizes of the N files are ... just adding a check that all have the same amount of lines before doing the comparison steps (otherwise "paste" will in the end show only lines from the bigger files))
Here is a (short) example, to show how it works:
$ cat > /tmp/file1
A
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
E
$ cat > /tmp/file2
A
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
E
$ paste -d '\n' /tmp/file1 /tmp/file2
A
A
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
E
E
$ paste -d '\n' /tmp/file1 /tmp/file2 | awk '
NR%2 { linefirstfile=$0 ; }
!(NR%2) { if ( $0 != linefirstfile )
{ print "line",NR/2,": "; print linefirstfile ; print $0 ; } }'
line 2 :
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
If it happens that the files don't have the same amount of lines, then you can add first a check of the number of line, comparing $(wc -l /tmp/file1) and $(wc -l /tmp/file2) , and only do the past...|awk if they have the same amount of line, to ensure the "paste" works correctly by always having one line of each! (But of course, in that case, there will be one (fast!) entire read of each file...)
You can easily adjust it to display exactly as you need it to. And you could quit after the Nth difference (either automatically, with a counter in the awk loop, or by pressing CTRL-C when you saw enough)

Which versions of diff have you tried? GNU diff has a "--speed-large-files" which may help.
The comm tool assumes the lines are sorted.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string