How can I sort a 10GB file? - linux

I'm trying to sort a big table stored in a file. The format of the file is
(ID, intValue)
The data is sorted by ID, but what I need is to sort the data using the intValue, in descending order.
For example
ID | IntValue
1 | 3
2 | 24
3 | 44
4 | 2
to this table
ID | IntValue
3 | 44
2 | 24
1 | 3
4 | 2
How can I use the Linux sort command to do the operation? Or do you recommend another way?

How can I use the Linux sort command to do the operation? Or do you recommend another way?
As others have already pointed out, see man sort for -k & -t command line options on how to sort by some specific element in the string.
Now, the sort also has facility to help sort huge files which potentially don't fit into the RAM. Namely the -m command line option, which allows to merge already sorted files into one. (See merge sort for the concept.) The overall process is fairly straight forward:
Split the big file into small chunks. Use for example the split tool with the -l option. E.g.:
split -l 1000000 huge-file small-chunk
Sort the smaller files. E.g.
for X in small-chunk*; do sort -t'|' -k2 -nr < $X > sorted-$X; done
Merge the sorted smaller files. E.g.
sort -t'|' -k2 -nr -m sorted-small-chunk* > sorted-huge-file
Clean-up: rm small-chunk* sorted-small-chunk*
The only thing you have to take special care about is the column header.

How about:
sort -t' ' -k2 -nr < test.txt
where test.txt
$ cat test.txt
1 3
2 24
3 44
4 2
gives sorting in descending order (option -r)
$ sort -t' ' -k2 -nr < test.txt
3 44
2 24
1 3
4 2
while this sorts in ascending order (without option -r)
$ sort -t' ' -k2 -n < test.txt
4 2
1 3
2 24
3 44
in case you have duplicates
$ cat test.txt
1 3
2 24
3 44
4 2
4 2
use the uniq command like this
$ sort -t' ' -k2 -n < test.txt | uniq
4 2
1 3
2 24
3 44

Related

extract numbers from repeated block

I have a text file that contain repeated block of nos as given below(with > symbol) and I want to extract repeated values only once.
Input
>
1
2
3
4
10
100
>
1
2
3
4
10
100
expected output
1
2
3
4
10
100
I tried script
#!/bin/sh
uniq -d input > output
but it gives output same as input.can anybody suggest a solution on this.Thanks.
you need to sort it first. Try this
sort input | uniq -d
or if you want to save it as a file
sort input | uniq -d > output
Suggesting to try:
sort -n -u input > output

If first two columns are equal, select top 3 based on descending order of 3rd column

I want to select top 3 results for every line that has the same first two column.
For example the data will look like,
cat data.txt
A A 10
A A 1
A A 2
A A 5
A A 8
A B 1
A B 2
A C 6
A C 5
A C 10
A C 1
B A 1
B A 1
B A 2
B A 8
And for the result I want
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 1
B A 1
B A 2
Note that some of the "groups" do not contain 3 rows.
I have tried
sort -k1,1 -k2,2 -k3,3nr data.txt | sort -u -k1,1 -k2,2 > 1.txt
comm -23 <(sort data.txt) <(sort 1.txt)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 2.txt
comm -23 <(sort data.txt) <(cat 1.txt 2.txt | sort)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 3.txt
It seems like it's working but since I am learning to code better was wondering if there was a better way to go about this. Plus, my code will generate many files that I will have to delete.
You can do:
$ sort -k1,1 -k2,2 -k3,3nr file | awk 'a[$1,$2]++<3'
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 8
B A 2
B A 1
Explanation:
There are two key items to understand the awk program; associative arrays and fields.
If you reference an empty awk array element, it is an empty container -- ready for anything you put into it. You can use that as a counter.
You state If first two columns are equal...
The sort puts the file in order desired. The statement a[$1,$2] uses the values of the first two fields as a unique entry into an associative array.
You then state ...select top 3 based on descending order of 3rd column...
Once again, the sort put the file into the desired order, and the statement a[$1,$2]++ counts them. Now just count up to three.
awk is organized into blocks of condition {action} The statement a[$1,$2]++<3 is true until there are more than 3 of the same pattern seen.
A wordier version of the program would be:
awk 'a[$1,$2]++<3 {print $0}'
But the default action if the condition is true is to print $0 so it is not needed.
If you are processing text in Unix, you should get to know awk. It is the most powerful tool that POSIX guarantees you will have, and is commonly used for these tasks.
Great place to start is the online book Effective AWK Programming by Arnold D. Robbins
#Dawg has the best answer. This one will be a little lighter on memory, which probably won't be a concern for your data:
sort -k1,2 -k3,3nr file |
awk '
{key = $1 FS $2}
prev != key {prev = key; count = 1}
count <= 3 {print; count++}
'
You can sort the file by first two columns primarily and by the 3rd one numerically secondarily, then read the output and only print the first three lines for each combination of the first two columns.
sort -k1,2 -k3,3rn data.txt \
| while read c1 c2 n ; do
if [[ $c1 == $l1 && $c2 == $l2 ]] ; then
((c++))
else
c=0
fi
if (( c < 3 )) ; then
echo $c1 $c2 $n
l1=$c1
l2=$c2
fi
done

Find unique lines

How can I find the unique lines and remove all duplicates from a file?
My input file is
1
1
2
3
5
5
7
7
I would like the result to be:
2
3
sort file | uniq will not do the job. Will show all values 1 time
uniq has the option you need:
-u, --unique
only print unique lines
$ cat file.txt
1
1
2
3
5
5
7
7
$ uniq -u file.txt
2
3
Use as follows:
sort < filea | uniq > fileb
You could also print out the unique value in "file" using the cat command by piping to sort and uniq
cat file | sort | uniq -u
While sort takes O(n log(n)) time, I prefer using
awk '!seen[$0]++'
awk '!seen[$0]++' is an abbreviation for awk '!seen[$0]++ {print}', print line(=$0) if seen[$0] is not zero.
It take more space but only O(n) time.
I find this easier.
sort -u input_filename > output_filename
-u stands for unique.
you can use:
sort data.txt| uniq -u
this sort data and filter by unique values
uniq -u has been driving me crazy because it did not work.
So instead of that, if you have python (most Linux distros and servers already have it):
Assuming you have the data file in notUnique.txt
#Python
#Assuming file has data on different lines
#Otherwise fix split() accordingly.
uniqueData = []
fileData = open('notUnique.txt').read().split('\n')
for i in fileData:
if i.strip()!='':
uniqueData.append(i)
print uniqueData
###Another option (less keystrokes):
set(open('notUnique.txt').read().split('\n'))
Note that due to empty lines, the final set may contain '' or only-space strings. You can remove that later. Or just get away with copying from the terminal ;)
#
Just FYI, From the uniq Man page:
"Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'."
One of the correct ways, to invoke with:
#
sort nonUnique.txt | uniq
Example run:
$ cat x
3
1
2
2
2
3
1
3
$ uniq x
3
1
2
3
1
3
$ uniq -u x
3
1
3
1
3
$ sort x | uniq
1
2
3
Spaces might be printed, so be prepared!
uniq -u < file will do the job.
uniq should do fine if you're file is/can be sorted, if you can't sort the file for some reason you can use awk:
awk '{a[$0]++}END{for(i in a)if(a[i]<2)print i}'
sort -d "file name" | uniq -u
this worked for me for a similar one. Use this if it is not arranged.
You can remove sort if it is arranged
This was the first i tried
skilla:~# uniq -u all.sorted
76679787
76679787
76794979
76794979
76869286
76869286
......
After doing a cat -e all.sorted
skilla:~# cat -e all.sorted
$
76679787$
76679787 $
76701427$
76701427$
76794979$
76794979 $
76869286$
76869286 $
Every second line has a trailing space :(
After removing all trailing spaces it worked!
thank you
Instead of sorting and then using uniq, you could also just use sort -u. From sort --help:
-u, --unique with -c, check for strict ordering;
without -c, output only the first of an equal run

Why does "uniq" count identical words as different?

I want to calculate the frequency of the words from a file, where the words are one by line. The file is really big, so this might be the problem (it counts 300k lines in this example).
I do this command:
cat .temp_occ | uniq -c | sort -k1,1nr -k2 > distribution.txt
and the problem is that it gives me a little bug: it considers the same words as different.
For example, the first entries are:
306 continua
278 apertura
211 eventi
189 murah
182 giochi
167 giochi
with giochi repeated twice as you can see.
At the bottom of the file it becomes even worse and it looks like this:
1 win
1 win
1 win
1 win
1 win
1 win
1 win
1 win
1 win
1 winchester
1 wind
1 wind
for all the words.
What am I doing wrong?
Try to sort first:
cat .temp_occ | sort| uniq -c | sort -k1,1nr -k2 > distribution.txt
Or use "sort -u" which also eliminates duplicates. See here.
The size of the file has nothing to do with what you're seeing. From the man page of uniq(1):
Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without
'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'.`
So running uniq on
a
b
a
will return:
a
b
a
Is it possible that some of the words have whitespace characters after them? If so you should remove them using something like this:
cat .temp_occ | tr -d ' ' | uniq -c | sort -k1,1nr -k2 > distribution.txt

concatenating linux program outputs, and returning only those which repeat

I have multiple programs which each produce lines of output. How do I concatenate those outputs, and then only return one copy of each line which repeated at least once? In other words, I want to return the set intersection of all response lines.
for example:
$ progA
9
13
14
15
$ progA --someFlag
13
14
15
100
$ progB
14
15
-42
$ magicFunction 'progA' 'progA --someFlag' 'progB'
14
15
This doesn't have to be a function per se. I just wanted a unix command-line way.
How about:
( progA; progA --someFlag; progB ) | sort | uniq -d
The -d option for uniq forces it to output only lines with more than one copy.
Here's a variant of the one-liner above that does not use a subshell:
{ progA; progA --someFlag; progB; } | sort | uniq -d
This works at least in bash. Note the required terminating semicolon (;) after the last command in the curly braces.
The solutions above don't really compute the set intersection of all 3 outputs. uniq -d will also output lines which are output only by 2 of the 3 programs.
Here's my take on it:
progA | sort > f1; progA --someFlag | sort > f2; progB | sort > f3; comm -1 -2 f1 f2 | comm -1 -2 f3 -; rm f[123]

Resources