sort -u cannot remove the duplicate lines

sort -u cannot remove the duplicate lines - linux

I am new to the Linux shell commands, and I am learning sort command.
The input file is as follow:
a 1
b 2
a 0
I want to make the first column as key for sort and use '-u' option to remove the line "a 0", because it has the same key with the first line and the command manual says '-u' will keep only the first of an equal run.
When I used the command sort -k 1 -u text, the result is:
a 0
a 1
b 0
And however, when I used the command sort -k 1, 1 -u text, the output is:
a 1
b 2
Can anyone tell me what the difference between the two commands is?

-k 1
will sort from field 1 till the end of line.
-k 1,1
will sort only by first field. You defined stop position.
That is the reason why you got different output.
Read KEYDEF in sort man page.

-k option is setting the key as fields from position [to position]. So -k1 is not descriptive (actually useless) since it defines the whole record which is the default. By setting -k1,1 you're asking sort to use only the first field as the key, hence the desired result.

Related

How to sort text file based on number after first letter

I have a text file in linux that looks like this:
H-988 -0.5418829321861267 no
H-989 -0.5033702254295349 yes
H-990 -1.1516857147216797 hi
H-99 -0.5005123019218445 hello
I want to sort this file based on the number coming after the hyphen. So the order should be:
H-99 -0.5005123019218445 hello
H-988 -0.5418829321861267 no
H-989 -0.5033702254295349 yes
H-990 -1.1516857147216797 hi
I tried the grep sort command but it did not work. For example it puts 95 after 949 instead of before same goes for 99 and 990 as the example provided

You should sort as numerically,
sort --numeric-sort --field-separator "-" --key 2 some.txt
or a shorter version
sort -n -t "-" -k 2 some.txt

If the first field always has only two characters before the number, you can skip them:
sort -k1.3,1n ip.txt
-k1.3 tells sort to start considering the field (which is 1 here) only from the third character

Linux bash sorting in reverse order based on a column isn't working as expected

I'm trying to sort a text file in the descending order based on the last column. It just doesn't seem to work.
cat 1.txt | sort -r -n -k 4,4
ACHG,89.46,0.08,34200
UUKJL,0.85,-15.00,200
NIMJKY,34.35,0.09,17700
TBBNHW,10.24,0.00,4600
JJkLEYE,73.67,0.48,25400
I've tried removing spaces just in case but, hasn't helped. Also, tried sorting by the other fields just to see but, ahve the same problem.
I just can't work out what is wrong with the command I've issued. Please could I request help with this one?

Your command is almost right but it is missing field separator option -t that should set comma as field separator.
This should work for you:
sort -t, -rnk 4,4 1.txt
ACHG,89.46,0.08,34200
JJkLEYE,73.67,0.48,25400
NIMJKY,34.35,0.09,17700
TBBNHW,10.24,0.00,4600
UUKJL,0.85,-15.00,200
Note that there is no need to use cat | sort here.

How to sort by name then date modification in BASH

Lets say I have a folder of .txt files that have a dd-MM-yyyy_HH-mm-ss time followed by _name.txt. I want to be able to sort by name first then time after. Example:
BEFORE
15-2-2010_10-01-55_greg.txt
10-2-1999_10-01-55_greg.txt
10-2-1999_10-01-55_jason.txt
AFTER
greg_1_10-2-1999_10-01-55
greg_2_15-2-2010_10-01-55
jason_1_10-2-1999_10-01-55
Edit: Apologies, from my "cp" line I was meant to copy them into another directory with a different name to them.
Something I tried to do is make a copy with the count, but it doesn't sort the files with the same name properly in terms of dates:
cd data/unfilteredNames
for filename in *.txt; do
n=${filename%.*}
n=${filename##*_}
filteredName=${n%.*}
count=0
find . -type f -name "*_$n" | while read name; do
count=$(($count+1))
cp -p $name ../filteredNames/"$filteredName"_"$count"
done
done

Not sure that the renaming of files is one of your expectation. At least for only sorting file name, you don't need to.
You can do this by only using GNU sort command:
sort -t- -k5.4 -k3.1,3.4 -k2.1,2.1 -k1.1,1.2 -k3.6,3.13 <(printf "%s\n" *.txt)
-t sets the field separator to a dash -.
-k enables to sort based on fields. As explained in man sort page, the syntax is -k<start>,<stop> where <start> or is composed of <field number>.<position>. Adding several -k option to the command allows to sort on multiple fields; the first in he command line having more precedence than the other.
For example, the first -k5.4 tells to sort based on the 5th fields with an offset of 4 characters. There isn't a stop field because this is the end of the filename.
The -k3.1,3.4 option sorts based on the 3rd field starting from offset 1 to 4.
The same principle applies to other -k options.
In your example the month field only has 1 digit. If you have files with a month coded with 2 digits, you might want to pad with 0 all month filenames. This can be done by adding to the printf statement this <(... | sed 's/-0\?\([0-9]\)/-0\1/') and change the -k 2.1,2.1 by -k2.1,2.2.

Bash script - delete duplicates

I need to extract name from a file and delete duplicates.
output.txt:
Server001-1
Server001-2
Server001-3
Server001-4
Server002-1
Server002-2
Server003-1
Server003-2
Server003-3
I need to only have output as follow.
Server001-1
Server002-1
Server003-1
So, only print first server for every server group (Server00*) and delete the rest in that group.

try simply with awk:
awk -F"-" '!a[$1]++' Input_file
Explanation: Making a field separator as - and then creating an array named a whose index is current line's 1st field and checking here a condition !a[$1] means it will check if current line's 1st field doesn't have any presence in array a then do a print of that line and then ++ means it will create that specific line's 1st field's occurrence value to 1 in array a so next time that line will not be printed.

awk -F- 'dat[$1]=="" { dat[$1]=$0 } END { for (i in dat) {print dat[i]}}' filename
result:
Server001-1
Server002-1
Server003-1
Create an array keyed on the first space delimited piece of data storing the complete line only when there are no other entries in that array entry. This will ensure that only the first unique entry is stored. Loop through the array and print

Simple GNU datamash solution:
datamash -t'-' -g1 first 2 <file
-t'-' - field separator
-g1 - group lines by the 1st field
first 2 - get only first value of the 2 field for each group. Can be also changed to min 2 operation
The output:
Server001-1
Server002-1
Server003-1

Since you've mentioned the string format as Server00*, you can simply use this one :
grep -E "Server\d+-1" file
Server\d+ for cases Server1000, Server100000 etc
or even
grep '[-]1$' file
Output for both :
Server001-1
Server002-1
Server003-1

A simple way just 1 command line to get general unique result:
nin output.txt nul "^(\w+)-\d+" -u -w
Explanation:
nul is a non-existing Windows file like /dev/null on Linux.
-u to get unique result, -w to output whole lines. Ignore case ? use -i.
"^(\w+)-\d+" is the same Regex syntax in C++/C#/Java/Scala, etc.
Save to file ? nin output.txt nul "^(\w+)-\d+" -u -w > result.txt
Save to file with summary info ? nin output.txt nul "^(\w+)-\d+" -u -w -I > result.txt
Future automation with nin.exe : Result count = return value %ERRORLEVEL%
nin.exe / nin.gcc* is a single portable exe tool to get difference or intersection keys/lines between 2 files or a pipe and a file. See my open project tools directory of https://github.com/qualiu/msr.
And you can also see colorful built-in usage/examples: https://qualiu.github.io/msr/usage-by-running/nin-Windows.html

Sort & uniq in Linux shell

What is the difference between the following to commands?
sort -u FILE
sort FILE | uniq

Using sort -u does less I/O than sort | uniq, but the end result is the same. In particular, if the file is big enough that sort has to create intermediate files, there's a decent chance that sort -u will use slightly fewer or slightly smaller intermediate files as it could eliminate duplicates as it is sorting each set. If the data is highly duplicative, this could be beneficial; if there are few duplicates in fact, it won't make much difference (definitely a second order performance effect, compared to the first order effect of the pipe).
Note that there times when the piping is appropriate. For example:
sort FILE | uniq -c | sort -n
This sorts the file into order of the number of occurrences of each line in the file, with the most repeated lines appearing last. (It wouldn't surprise me to find that this combination, which is idiomatic for Unix or POSIX, can be squished into one complex 'sort' command with GNU sort.)
There are times when not using the pipe is important. For example:
sort -u -o FILE FILE
This sorts the file 'in situ'; that is, the output file is specified by -o FILE, and this operation is guaranteed safe (the file is read before being overwritten for output).

There is one slight difference: return code.
The thing is that unless shopt -o pipefail is set the return code of the piped command will be return code of the last one. And uniq always returns zero (success). Try examining exit code, and you'll see something like this (pipefail is not set here):
pavel#lonely ~ $ sort -u file_that_doesnt_exist ; echo $?
sort: open failed: file_that_doesnt_exist: No such file or directory
2
pavel#lonely ~ $ sort file_that_doesnt_exist | uniq ; echo $?
sort: open failed: file_that_doesnt_exist: No such file or directory
0
Other than this, the commands are equivalent.

Beware! While it's true that "sort -u" and "sort|uniq" are equivalent, any additional options to sort can break the equivalence. Here's an example from the coreutils manual:
For example, 'sort -n -u' inspects only the value of the initial numeric string when checking for uniqueness, whereas 'sort -n | uniq' inspects the entire line.
Similarly, if you sort on key fields, the uniqueness test used by sort won't necessarily look at the entire line anymore. After being bitten by that bug in the past, these days I tend to use "sort|uniq" when writing Bash scripts. I'd rather have higher I/O overhead than run the risk that someone else in the shop won't know about that particular pitfall when they modify my code to add additional sort parameters.

sort -u will be slightly faster, because it does not need to pipe the output between two commands
also see my question on the topic: calling uniq and sort in different orders in shell

I have worked on some servers where sort don't support '-u' option. there we have to use
sort xyz | uniq

Nothing, they will produce the same result

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

sort -u cannot remove the duplicate lines - linux

-k 1 will sort from field 1 till the end of line. -k 1,1 will sort only by first field. You defined stop position. That is the reason why you got different output. Read KEYDEF in sort man page.

-k option is setting the key as fields from position [to position]. So -k1 is not descriptive (actually useless) since it defines the whole record which is the default. By setting -k1,1 you're asking sort to use only the first field as the key, hence the desired result.

Related

How to sort text file based on number after first letter

Linux bash sorting in reverse order based on a column isn't working as expected

How to sort by name then date modification in BASH

Bash script - delete duplicates

Sort & uniq in Linux shell

Categories

Resources