Sort hyphenated names alphabetically, then numerically - linux

I have an unsorted server list like the following;
bgsqlnp-z101
bgsqlnp-z102
bgsqlnp-z103
bgsqlnp-z2
bgsqlnp-z3
bgsqlnp-z5
dfsqlnp-z108
dfsqlnp-z4
bgsqlnp-z1
dfsqlprd-z8
fuqddev-z88
fuqhdev-z8
ghsbqudev-z18
heiappprod-z1
htsybprd-z24
Using sort to read-in the file, I'm trying to get the following;
bgsqlnp-z1
bgsqlnp-z2
bgsqlnp-z3
bgsqlnp-z5
bgsqlnp-z101
bgsqlnp-z102
bgsqlnp-z103
dfsqlnp-z4
dfsqlnp-z108
dfsqlprd-z8
fuqddev-z88
fuqhdev-z8
ghsbqudev-z18
heiappprod-z1
htsybprd-z24
I'm just not able to find the right keydef for my -k option.
Here's the closest I've been able to get;
sort -k2n -t"z"
bgsqlnp-z1
bgsqlnp-z101
bgsqlnp-z102
bgsqlnp-z103
bgsqlnp-z2
bgsqlnp-z3
bgsqlnp-z5
dfsqlnp-z108
dfsqlnp-z4
dfsqlprd-z8
fuqddev-z88
fuqhdev-z8
ghsbqudev-z18
heiappprod-z1
htsybprd-z24
The numbers are in the right order, but the server names aren't sorted.
Attempts using a multi-field keydef (-k1,2n) seem to have zero effect (i get no sorting at all).
Here's some extra info about the server names;
1) All of them have a "-z[1-200]" suffix on the names, some numbers repeat.
2) Server names are differing lengths (4 to 16 characters)
So using 'cut' is out of the question

You can use sed to get around having a multi-character separator. You can switch between numeric and dictionary order after each sort key definition. Note that you have to have multiple -k options for multiple keys, check the man page for details on this.
Something like this:
sed 's/-z/ /' file | sort -k2,2n -k1,1d | sed 's/ /-z/'

Related

Linux bash: How to group a ip list into common subnets?

I've a huge list of IP's which is already sorted, but I still need to group them into subnet. For instance:
223.247.184.95
223.247.186.243
223.247.208.16
223.247.209.139
223.84.128.24
223.84.159.214 *
223.84.159.245 *
The market IP's with the "*" should all be grouped by '223.84.159.*'. There
is no database, just this text file with 10.000 entries !
I tested awk and uniq commands, but my results are all not what I want.
It is not clear, as you haven't shown the output. Could you please try following and let me know if this helps you, it will print only those lines which have * in them from Input_file.
awk '$2=="*"{print $1}' Input_file
Say your IP's are in a file called ips spread over different fields over different lines and the field separator is the default one, namely a space.
Example:
cat ips
223.247.184.95 223.247.186.243 223.247.208.16 223.247.209.139 223.84.128.24
223.84.159.214* 223.84.159.245*
then for
Unsorted IPs file: the following awk code
cat ips | awk '{for(i=1;i<=NF;i++){split($i,a,".");k=a[1]"."a[2]"."a[3];h[k]=h[k]" "$i}}END{for(k in h)printf(k": "h[k]"\n")}'
gives you a hash table of the different C domains:
223.247.184: 223.247.184.95
223.247.186: 223.247.186.243
223.84.159: 223.84.159.214* 223.84.159.245*
223.84.128: 223.84.128.24
223.247.208: 223.247.208.16
223.247.209: 223.247.209.139
Sorted IPs file: If the file containing your IP has been previously sorted, then there is a more efficient and faster one-liner
cat ips| awk '{for(i=1;i<=NF;i++){split($i,a,".");cn=a[1]"."a[2]"."a[3]; if(cn != c){c=cn;printf("\n"c": ")};printf($i" ")}}END{printf("\n")}'
giving you as well
223.247.184: 223.247.184.95
223.247.186: 223.247.186.243
223.247.208: 223.247.208.16
223.247.209: 223.247.209.139
223.84.128: 223.84.128.24
223.84.159: 223.84.159.214* 223.84.159.245*

How to randomly sort one key while the other is kept in its original sort order with GNU "sort"

Given an input list like the following:
405:alice#level1
405:bob#level2
405:chuck#level1
405:don#level3
405:eric#level1
405:francis#level1
004:ac#jjj
004:la#jjj
004:za#zzz
101:amy#floor1
101:brian#floor3
101:christian#floor1
101:devon#floor1
101:eunuch#floor2
101:frank#floor3
005:artie#le2
005:bono#nuk1
005:bozo#nor2
(As you can see, the first field was randomly sorted (the original input had all of the first field in numerical order, with 004 coming first, then 005, 101, 405, et al) but the second field is in alphabetical order on the first character.)
What is desired is a randomized sort where the first field - as separated by a colon ':', is randomly sorted so that all of the entries of the second field do not matter during the random sort, so long as all lines where the first field are the same are grouped together but randomly distributed throughout the file - is to have the second field randomly sorted as well. That is, in the final output, lines with the same value in the first field are grouped together (but randomly distributed throughout the file) but also to have the second field randomly sorted. I am unable to get this desired result as I am not too familiar with sort keys and whatnot.
The desired output would look similar to this:
405:francis#level1
405:don#level3
405:eric#level1
405:bob#level2
405:alice#level1
405:chuck#level1
004:za#zzz
004:ac#jjj
004:la#jjj
101:christian#floor1
101:amy#floor1
101:frank#floor3
101:eunuch#floor2
101:brian#floor3
101:devon#floor1
005:bono#nuk1
005:artie#le2
005:bozo#nor2
Does anyone know how to achieve this type of sort?
Thank you!
You can do this with awk pretty easily.
As a one-liner:
awk -F: 'BEGIN{cmd="sort -R"} $1 != key {close(cmd)} {key=$1; print | cmd}' input.txt
Or, broken apart for easier explanation:
-F: - Set awk's field separator to colon.
BEGIN{cmd="sort -R"} - before we start, set a variable that is a command to do the "randomized sort". This one works for me on FreeBSD. Should work with GNU sort as well.
$1 != key {close(cmd)} - If the current line has a different first field than the last one processed, close the output pipe...
{key=$1; print | cmd} - And finally, set the "key" var, and print the current line, piping output through the command stored in the cmd variable.
This usage takes advantage of a bit of awk awesomeness. When you pipe through a string (be it stored in a variable or not), that pipe is automatically created upon use. You can close it any time, and a subsequent use will reopen a new command.
The impact of this is that each time you close(cmd), you print the current set of randomly sorted lines. And awk closes cmd automatically once you come to the end of the file.
Of course, for this solution to work, it's vital that all lines with a shared first field are grouped together.
not as elegant but a different method
$ awk -F: '!($1 in a){a[$1]=c++} {print a[$1] "\t" $0}' file |
sort -R -k2 |
sort -nk1,1 -s |
cut -f2-
or, this alternative which doesn't assume initial grouping
$ sort -R file |
awk -F: '!($1 in a){a[$1]=c++} {print a[$1] "\t" $0}' |
sort -nk1,1 -s |
cut -f2-

Using grep for multiple patterns from multiple lines in a output file

I have a data output something like this captured in a file.
List item1
attrib1: someval11
attrib2: someval12
attrib3: someval13
attrib4: someval14
List item2
attrib1: someval21
attrib2: someval12
attrib4: someval24
attrib3: someval23
List item3
attrib1: someval31
attrib2: someval32
attrib3: someval33
attrib4: someval34
I want to extract attrib1, attrib3, attrib4 from the list of data only if "attrib2 is someval12".
note that attrib3 and attrib4 could be in any order after attrib2.
so far I tried to use grep with -A and -B option but I need to specify line number and that is sort of hardcoding which I don't want to do it.
grep -B 1 -A 1 -A 2 "attrib2: someval12" | egrep -w "attrib1|attrib3|attrib4"
can i use any other option of grep which doesn't involve specifying the before and after occurence for this example?
Grep and other tools (like join, sort, uniq) work on the principle "one record per line". It is therefore possible to use a 3-step pipe:
Convert each list item to a single line, using sed.
Do the filtering, using grep.
Convert back to the original format, using sed.
First you need to pick a character that is known not to occur in the input, and use it as separator character. For example, '|'.
Then, find the sed command for step 1, which transforms the input to the format
List item1|attrib1: someval11|attrib2: someval12|attrib3: someval13|attrib4: someval14|
List item2|attrib1: someval21|attrib2: someval12|attrib4: someval24|attrib3: someval23|
List item3|attrib1: someval31|attrib2: someval32|attrib3: someval33|attrib4: someval34|
Now step 2 is easy.

Performing a sort using k1,1 only

Assume you have an unsorted file with the following content:
identifier,count=Number
identifier, extra information
identifier, extra information
...
I want to sort this file so that for each id, write the line with the count first and then the lines with extra info. I can only use the sort unix command with option -k1,1 but am allowed to slightly change the lines to get this sort.
As an example, take
a,Count=1
a,giulio
aa,Count=44
aa,tango
aa,information
ee,Count=2
bb,que
f,Count=3
b,Count=23
bax,game
f,ee
c,Count=3
c,roma
b,italy
bax,Count=332
a,atlanta
bb,Count=78
c,Count=3
The output should be
a,Count=1
a,atlanta
a,giulio
aa,Count=44
aa,information
aa,tango
b,Count=23
b,italy
bax,Count=332
bax,game
bb,Count=78
bb,que
c,Count=3
c,roma
ee,Count=2
f,Count=3
f,ee
but I get:
aa,Count=44
aa,information
aa,tango
a,atlanta
a,Count=1
a,giulio
bax,Count=332
bax,game
bb,Count=78
bb,que
b,Count=23
b,italy
c,Count=3
c,Count=3
c,roma
ee,Count=2
f,Count=3
f,ee
I tried adding spaces at the end of the identifier and/or at the beginning of the count field and other characters, but none of these approaches work.
Any pointer on how to perform this sorting?
EDIT:
if you consider for example the products with id starting with a, one of them has info 'atlanta' and appears before Count (but I wand Count to appear before any information). In addition, bb should be after b in alphabetical order for the ids. To make my question clearer: How can I get the IDs sorted by alphabetical order and such that for a given ID, the line with Count appears before the others. And how to do this using sort -k1,1 (This is a group project I am working on and I am not free to change the sorting command) and maybe changing the content (I tried for example adding a '~' to all the infos so that Count is before)
you need to tell sort, that comma is used as field separator
sort -t, -k1,1
For ASCII sorting make sure LC_ALL=C and LANG and LANGUAGE are unset

Linux join utility complains about input file not being sorted

I have two files:
file1 has the format:
field1;field2;field3;field4
(file1 is initially unsorted)
file2 has the format:
field1
(file2 is sorted)
I run the 2 following commands:
sort -t\; -k1 file1 -o file1 # to sort file 1
join -t\; -1 1 -2 1 -o 1.1 1.2 1.3 1.4 file1 file2
I get the following message:
join: file1:27497: is not sorted: line_which_was_identified_as_out_of_order
Why is this happening ?
(I also tried to sort file1 taking into consideration the entire line not only the first filed of the line but with no success)
sort -t\; -c file1 doesn't output anything. Around line 27497, the situation is indeed strange which means that sort doesn't do its job correctly:
XYZ113017;...
line 27497--> XYZ11301;...
XYZ11301;...
To complement Wumpus Q. Wumbley's helpful answer with a broader perspective (since I found this post researching a slightly different problem).
When using join, the input files must be sorted by the join field ONLY, otherwise you may see the warning reported by the OP.
There are two common scenarios in which more than the field of interest is mistakenly included when sorting the input files:
If you do specify a field, it's easy to forget that you must also specify a stop field - even if you target only 1 field - because sort uses the remainder of the line if only a start field is specified; e.g.:
sort -t, -k1 ... # !! FROM field 1 THROUGH THE REST OF THE LINE
sort -t, -k1,1 ... # Field 1 only
If your sort field is the FIRST field in the input, it's tempting to not specify any field selector at all.
However, if field values can be prefix substrings of each other, sorting whole lines will NOT (necessarily) result in the same sort order as just sorting by the 1st field:
sort ... # NOT always the same as 'sort -k1,1'! see below for example
Pitfall example:
#!/usr/bin/env bash
# Input data: fields separated by '^'.
# Note that, when properly sorting by field 1, the order should
# be "nameA" before "nameAA" (followed by "nameZ").
# Note how "nameA" is a substring of "nameAA".
read -r -d '' input <<EOF
nameA^other1
nameAA^other2
nameZ^other3
EOF
# NOTE: "WRONG" below refers to deviation from the expected outcome
# of sorting by field 1 only, based on mistaken assumptions.
# The commands do work correctly in a technical sense.
echo '--- just sort'
sort <<<"$input" | head -1 # WRONG: 'nameAA' comes first
echo '--- sort FROM field 1'
sort -t^ -k1 <<<"$input" | head -1 # WRONG: 'nameAA' comes first
echo '--- sort with field 1 ONLY'
sort -t^ -k1,1 <<<"$input" | head -1 # ok, 'nameA' comes first
Explanation:
When NOT limiting sorting to the first field, it is the relative sort order of chars. ^ and A (column index 6) that matters in this example. In other words: the field separator is compared to data, which is the source of the problem: ^ has a HIGHER ASCII value than A, and therefore sorts after 'A', resulting in the line starting with nameAA^ sorting BEFORE the one with nameA^.
Note: It is possible for problems to surface on one platform, but be masked on another, based on locale and character-set settings and/or the sort implementation used; e.g., with a locale of en_US.UTF-8 in effect, with , as the separator and - permissible inside fields:
sort as used on OSX 10.10.2 (which is an old GNU sort version, 5.93) sorts , before - (in line with ASCII values)
sort as used on Ubuntu 14.04 (GNU sort 8.21) does the opposite: sorts - before ,[1]
[1] I don't know why - if somebody knows, please tell me. Test with sort <<<$'-\n,'
sort -k1 uses all fields starting from field 1 as the key. You need to specify a stop field.
sort -t\; -k1,1
... or the gnu sort is just as buggy as every other GNU command
try and sort Gi1/0/11 vs Gi1/0/1 and you'll never be able to get an actual regular textual sort suitable for join input because someone added some extra intelligence in sort which will happily use numeric or human numeric sorting automagically in such cases without even bothering to add a flag to force the regular behavior
what is suitable for humans is seldom suitable for scripting

Resources