File is not sort after sort - linux

I have a problem with sorting my file. My file look like this
geom-10-11.com 1
geom-1-10.com 9
geom-1-11.com 10
geom-1-2.com 1
geom-1-3.com 2
geom-1-4.com 3
geom-1-5.com 4
geom-1-6.com 5
geom-1-7.com 6
geom-1-8.com 7
geom-1-9.com 8
geom-2-10.com 8
geom-2-11.com 9
geom-2-3.com 1
geom-2-4.com 2
geom-2-5.com 3
geom-2-6.com 4
geom-2-7.com 5
geom-2-8.com 6
geom-2-9.com 7
geom-3-10.com 7
geom-3-11.com 8
geom-3-4.com 1
geom-3-5.com 2
geom-3-6.com 3
geom-3-7.com 4
geom-3-8.com 5
geom-3-9.com 6
geom-4-10.com 6
geom-4-11.com 7
geom-4-5.com 1
geom-4-6.com 2
geom-4-7.com 3
geom-4-8.com 4
geom-4-9.com 5
geom-5-10.com 5
geom-5-11.com 6
geom-5-6.com 1
geom-5-7.com 2
geom-5-8.com 3
geom-5-9.com 4
geom-6-10.com 4
geom-6-11.com 5
geom-6-7.com 1
geom-6-8.com 2
geom-6-9.com 3
geom-7-10.com 3
geom-7-11.com 4
geom-7-8.com 1
geom-7-9.com 2
geom-8-10.com 2
geom-8-11.com 3
geom-8-9.com 1
geom-9-10.com 1
geom-9-11.com 2
So I used sort -k1.6 -k2 -n and I got
geom-1-2.com 1
geom-1-3.com 2
geom-1-4.com 3
geom-1-5.com 4
geom-1-6.com 5
geom-1-7.com 6
geom-1-8.com 7
geom-1-9.com 8
geom-1-10.com 9
geom-1-11.com 10
geom-2-3.com 1
geom-2-4.com 2
geom-2-5.com 3
geom-2-6.com 4
geom-2-7.com 5
geom-2-8.com 6
geom-2-9.com 7
geom-2-10.com 8
geom-2-11.com 9
geom-3-4.com 1
geom-3-5.com 2
geom-3-6.com 3
geom-3-7.com 4
geom-3-8.com 5
geom-3-9.com 6
geom-3-10.com 7
geom-3-11.com 8
geom-4-5.com 1
geom-4-6.com 2
geom-4-7.com 3
geom-4-8.com 4
geom-4-9.com 5
geom-4-10.com 6
geom-4-11.com 7
geom-5-6.com 1
geom-5-7.com 2
geom-5-8.com 3
geom-5-9.com 4
geom-5-10.com 5
geom-5-11.com 6
geom-6-7.com 1
geom-6-8.com 2
geom-6-9.com 3
geom-6-10.com 4
geom-6-11.com 5
geom-7-8.com 1
geom-7-9.com 2
geom-7-10.com 3
geom-7-11.com 4
geom-8-9.com 1
geom-8-10.com 2
geom-8-11.com 3
geom-9-10.com 1
geom-9-11.com 2
geom-10-11.com 1
But when I tried use uniq -f1 or sort -k1.6 -k2 -n -u I got same long sorted output. So I used
sort -k1.6 -k2 -n -c
and get message that this file is disordered
(sort: glist2:2: disorder: geom-1-2.com 1).
I tried use just sort -k2 -n -u but got
geom-10-11.com 1
geom-1-3.com 2
geom-1-4.com 3
geom-1-5.com 4
geom-1-6.com 5
geom-1-7.com 6
geom-1-8.com 7
geom-1-9.com 8
geom-1-10.com 9
geom-1-11.com 10
That is not what I need, I need to have
geom-1-2.com 1
geom-1-3.com 2
geom-1-4.com 3
geom-1-5.com 4
geom-1-6.com 5
geom-1-7.com 6
geom-1-8.com 7
geom-1-9.com 8
geom-1-10.com 9
geom-1-11.com 10
So I need to have at begening geom-1-X and not geom-10-X. It would be great use juste uniq because I have more bigger files with more geometries (about thousands of lines) but with same structure. Thank you for your answers.

You can use this:
grep -E '^geom-1-' file | sort -k1.8n
grep is filtering the lines you want. sort is sorting numerically the first field starting at the 8th character.

Related

How can we groupby selected row values from a column and assign it to a new column in pandas df?

Id B
1 6
2 13
1 6
2 6
1 6
2 6
1 10
2 6
2 6
2 6
I want a new columns say C where I can get a grouped value of B=6 at Id level
Jan18.loc[Jan18['Enquiry Purpose']==6].groupby(Jan18['Member Reference']).transform('count')
Id B No_of_6
1 6 3
2 13 5
1 6 3
2 6 5
1 6 3
2 6 5
1 10 3
2 6 5
2 6 5
2 6 5
Comapre values by Series.eq for ==, convert to integers and use GroupBy.transform for new column filled by sum per groups:
df['No_of_6'] = df['B'].eq(6).astype(int).groupby(df['Id']).transform('sum')
#alternative
#df['No_of_6'] = df.assign(B= df['B'].eq(6).astype(int)).groupby('Id')['B'].transform('sum')
print (df)
Id B No_of_6
0 1 6 3
1 2 13 5
2 1 6 3
3 2 6 5
4 1 6 3
5 2 6 5
6 1 10 3
7 2 6 5
8 2 6 5
9 2 6 5
Generally create boolean mask by your condition(s) and pass below:
mask = df['B'].eq(6)
#alternative
#mask = (df['B'] == 6)
df['No_of_6'] = mask.astype(int).groupby(df['Id']).transform('sum')
A solution using map. This solution will return NaN on groups of Id have no number of 6
df['No_of_6'] = df.Id.map(df[df.B.eq(6)].groupby('Id').B.count())
Out[113]:
Id B No_of_6
0 1 6 3
1 2 13 5
2 1 6 3
3 2 6 5
4 1 6 3
5 2 6 5
6 1 10 3
7 2 6 5
8 2 6 5
9 2 6 5

Average of multiple files with unequal row sizes in Shell

I have 15 datafiles with unequal row sizes, but number of columns in each file is same. e.g.
ifile1.dat ifile2.dat ifile3.dat and so on ............
0 0 0 0 1 6
1 2 5 3 2 7
2 5 6 10 4 6
5 2 8 9 5 9
10 2 10 3 8 2
In each file 1st column represents the index number.
I would like to compute average of all these files for each index number in column 1. i.e.
ofile.txt
0 0 [This is computed as (0+0)/2]
1 4 [This is computed as (2+6)/2]
2 6 [This is computed as (5+7)/2]
3 [no value]
4 6 [This is computed as (6)/1]
5 4.66 [This is computed as (2+3+9)/3]
6 10
7
8 5.5
9
10 2.5
I can't think of any simple method to do it. I was thinking of a method, but seems very lengthy. Taking the average after converting all the files with same row size, .e.g.
ifile1.dat ifile2.dat ifile3.dat and so on ............
0 0 0 0 0 0
1 2 1 1 6
2 5 2 2 7
3 3 3
4 4 4 6
5 2 5 3 5 9
6 6 10 6
7 7 7
8 8 9 8 2
9 9 9
10 2 10 3 10
$ awk '{s[$1]+=$2; c[$1]++;} END{for (i in s) print i,s[i]/c[i];}' ifile*.dat
0 0
1 4
2 6
4 6
5 4.66667
6 10
8 5.5
10 2.5
In the above code, there are two arrays, s and c. s[i] is the sum of all entries with index i and c[i] is the number of entries with index i. After we have read all the files, we print the average, s[i]/c[i], for each index i.

how to calculate standard deviation from different colums in shell script

I have a datafile with 10 columns as given below
ifile.txt
2 4 4 2 1 2 2 4 2 1
3 3 1 5 3 3 4 5 3 3
4 3 3 2 2 1 2 3 4 2
5 3 1 3 1 2 4 5 6 8
I want to add 11th column which will show the standard deviation of each rows along 10 columns. i.e. STDEV(2 4 4 2 1 2 2 4 2 1) and so on.
I am able to do by taking tranpose, then using the following command and again taking transpose
awk '{x[NR]=$0; s+=$1} END{a=s/NR; for (i in x){ss += (x[i]-a)^2} sd = sqrt(ss/NR); print sd}'
Can anybody suggest a simpler way so that I can do it directly along each row.
You can do the same with one pass as well.
awk '{for(i=1;i<=NF;i++){s+=$i;ss+=$i*$i}m=s/NF;$(NF+1)=sqrt(ss/NF-m*m);s=ss=0}1' ifile.txt
Do you mean something like this ?
awk '{for(i=1;i<=NF;i++)s+=$i;M=s/NF;
for(i=1;i<=NF;i++)sd+=(($i-M)^2);$(NF+1)=sqrt(sd/NF);M=sd=s=0}1' file
2 4 4 2 1 2 2 4 2 1 1.11355
3 3 1 5 3 3 4 5 3 3 1.1
4 3 3 2 2 1 2 3 4 2 0.916515
5 3 1 3 1 2 4 5 6 8 2.13542
You just use the fields instead of transposing and using the rows.

Sort a group of data based on a column

I have an input file that contains following data:
1 2 3 4
4 6
8 9
10
2 1 5 7
3
3 4 2 9
2 7
11
I'm trying to sort the group of data based on the third column and get such an output:
2 1 5 7
3
1 2 3 4
4 6
8 9
10
3 4 2 9
2 7
11
Could you tell me how to do so?
sort -nk3r
will sort in reverse order based on 3rd column. Note however, that this outputs
2 1 5 7
1 2 3 4
3 4 2 9
10
11
2 7
3
4 6
8 9
because of the way bash sort functions, and this produces a different result than the output you posted, but correct according to the question.

Reverse sort order of a multicolumn file in BASH

I've the following file:
1 2 3
1 4 5
1 6 7
2 3 5
5 2 1
and I want that the file be sorted for the second column but from the largest number (in this case 6) to the smallest. I've tried with
sort +1 -2 file.dat
but it sorts in ascending order (rather than descending).
The results should be:
1 6 7
1 4 5
2 3 5
5 2 1
1 2 3
sort -nrk 2,2
does the trick.
n for numeric sorting, r for reverse order and k 2,2 for the second column.
Have you tried -r ? From the man page:
-r, --reverse
reverse the result of comparisons
As mention most version of sort have the -r option if yours doesn't try tac:
$ sort -nk 2,2 file.dat | tac
1 6 7
1 4 5
2 3 5
5 2 1
1 2 3
$ sort -nrk 2,2 file.dat
1 6 7
1 4 5
2 3 5
5 2 1
1 2 3
tac - concatenate and print files in reverse

Resources