Using sort in linux, how can I make 12 after 2? - linux

I have a file, leading with numbers:
$ cat file
1
3
13
2
4
12
When I use cat file | sort, it displays like this:
$ cat file | sort
1
12
13
2
3
4
How can I get the answer like this:
1
2
3
4
12
13

Use the -n option to enable numerical sorting:
$ cat file | sort -n
This is faster and more portable than -g, which is a proprietary extension of GNU sort.

Use -g option of sort for general sorting of numbers (can be slow for large inputs):
$ sort -g file
or:
$ sort -n file
The difference can be found in a related question.
UPD: Fixed the useless cat as stated in comments.

Related

Merge text files in a numeric order

I'm stuck with a problem. I would like to merge two text files given a specific onset time.
For example:
Text1 (in column):
30
100
200
Text2 (in column):
10
50
70
My output should be 1 text file (single column) like this:
10
30
50
70
100
200
I can use cat or merge to combine files, but not sure how to take care of the order for the onset time.
Thank you in advance for all your help!
Like this:
sort -n file1 file2
Most sort commands (e.g. GNU coreutils, free BSD, open BSD, mac osx, uutils) have a merge option for creating one sorted file from multiple files that are already sorted.
sort -m -n text1 text2
The only sort without such an option I could find is from busybox. But even that version tolerates an -m option, ignores it, sorts the files as usual, and therefore still gives the expected result.
I would have assumed that using -m doesn't really matter that much compared to just sorting the concatenated files like busybox does, since sorting algorithms should have optimizations for already sorted parts. However, a small test on my system with GNU coreutils 8.28 proved the contrary:
shuf -i 1-1000000000 -n 10000000 | sort -n > text1 # 10M lines, 95MB
shuf -i 1-1000000000 -n 10000000 | sort -n > text2
time sort -m -n text1 text2 | md5sum # real: 2.0s (used only 1 CPU core)
time sort -n text1 text2 | md5sum # real: 4.5s (used 2 CPU cores)
Although you could just pipe both files to sort -n it seems inelegant not to use the fact that your input files are already sorted. If it is indeed the case that your inputs are sorted, you could do something like:
awk 'BEGIN{n = getline a < "text2"} {
while( n && a < $1) { print a; n = getline a < "text2"}
} 1 ' text1

Remove duplicates from INSANE BIG WORDLIST

What is the best way of doing this? It's a 250GB Text file 1 word per line
Input:
123
123
123
456
456
874
875
875
8923
8932
8923
Output wanted:
123
456
874
875
8923
8932
I need to get 1 copy of each duplicated line I DON'T WANT if there are 2 of the SAME LINES, REMOVE BOTH, just remove 1, always keeping 1 unique line.
What I do now:
$ cat final.txt | sort | uniq > finalnoduplicates.txt
In a screen, this is working? I don't know, because when I check the size of output file, and it's 0:
123user#instance-1:~$ ls -l
total 243898460
-rw-rw-r-- 1 123user 249751990933 Sep 3 13:59 final.txt
-rw-rw-r-- 1 123user 0 Sep 3 14:26 finalnoduplicates.txt
123user#instance-1:~$
But when I check htop cpu value of the screen running this command is at 100%.
Am I doing something wrong?
You can do this using just sort.
$ sort -u final.txt > finalnoduplicates.txt
You can simplify this further and just have sort do all of it:
$ sort -u final.txt -o finalnoduplicates.txt
Finally, since your input file is purely just numerical data, you can tell sort via the -n switch this to further improve the overall performance of this task:
$ sort -nu final.txt -o finalnoduplicates.txt
sort's man page
-n, --numeric-sort
compare according to string numerical value
-u, --unique
with -c, check for strict ordering; without -c, output only the
first of an equal run
-o, --output=FILE
write result to FILE instead of standard output
I found out about this awesome tool called Duplicut. The entire point of the project was to combine the advantages of unique sorting and increasing the memory limit for wordlists.
It is pretty simple to install, this is the GitHub link
https://github.com/nil0x42/duplicut

Merging files in reverse

I am working on the logs, they are in multiple number.
lets assume the following files has the content
file1
1
file2
2
file3
3
by using the command cat file* the result would be
1
2
3
but i am looking for some thing , while i use the regex/command using file* i want the output to be some thing like this.
3
2
1
could some one help me please.
Pass the output of cat to tac :
$ cat file*
1
2
3
$ cat file* | tac
3
2
1
You may call
ls -1r file* | xargs cat
in order to specify the order of the files. Its output is different from the tac solution, since each single logfile is in the correct order. (Perhaps this is not even the desired output).

Binned histogram of timings in log file on command line

To quickly evaluate the timings of various operations from a log file on a linux server, I would like to extract them from the log and create a textual/tsv style histogram. To have a better idea of how the timings are distributed, I want to bin them into ranges of 0-10ms, 10-20ms etc.
The output should look something like this:
121 10
39 20
12 30
7 40
1 100
How to achieve this with the usual set of unix command line tools?
Quick answer:
cat <file> | egrep -o [0-9]+ | sed "s/$/ \/10*10/" | bc | sort -n | uniq -c
Detailed answer:
grep the pattern of your timing or number. You may need to do multiple grep steps to extract exactly the numbers you want from your logs.
use sed to add arithmetic expression for integer division by desired factor and multiply it back on
bc performs the calculation
the well-known sort | uniq combo to count occurrences

Sort a tab delimited file based on column sort command bash [duplicate]

This question already has answers here:
Sorting a tab delimited file
(11 answers)
Closed 7 years ago.
I am trying to sort this file based on the fourth column. I want the file to reordered based on the values of the fourth column.
File:
2 1:103496792:A 0 103496792
3 1:103544434:A 0 103544434
4 1:103548497:A 0 103548497
1 1:10363487:T 0 10363487
I want it sorted like this:
1 1:10363487:T 0 10363487
2 1:103496792:A 0 103496792
3 1:103544434:A 0 103544434
4 1:103548497:A 0 103548497
I tried this command:
sort -t$'\t' -k1,1 -k2,2 -k3,3 -k 4,4 <filename>
But I get illegal variable name error. Can somebody help me with this?
To sort on the fourth column use just the -k 4,4 selector.
sort -t $'\t' -k 4,4 <filename>
You might also want -V which sorts numbers more naturally. For example, yielding 1 2 10 rather than 1 10 2 (lexicographic order).
sort -t $'\t' -k 4,4 -V <filename>
If you're getting errors about the $'\t' then make sure your shell is bash. Perhaps you're missing #!/bin/bash at the top of your script?
I believe you have an errant $ in your command.
Try:
sort -t\t -nk4

Resources