Sort a group of data based on a column - linux

I have an input file that contains following data:
1 2 3 4
4 6
8 9
10
2 1 5 7
3
3 4 2 9
2 7
11
I'm trying to sort the group of data based on the third column and get such an output:
2 1 5 7
3
1 2 3 4
4 6
8 9
10
3 4 2 9
2 7
11
Could you tell me how to do so?

sort -nk3r
will sort in reverse order based on 3rd column. Note however, that this outputs
2 1 5 7
1 2 3 4
3 4 2 9
10
11
2 7
3
4 6
8 9
because of the way bash sort functions, and this produces a different result than the output you posted, but correct according to the question.

Related

How can we groupby selected row values from a column and assign it to a new column in pandas df?

Id B
1 6
2 13
1 6
2 6
1 6
2 6
1 10
2 6
2 6
2 6
I want a new columns say C where I can get a grouped value of B=6 at Id level
Jan18.loc[Jan18['Enquiry Purpose']==6].groupby(Jan18['Member Reference']).transform('count')
Id B No_of_6
1 6 3
2 13 5
1 6 3
2 6 5
1 6 3
2 6 5
1 10 3
2 6 5
2 6 5
2 6 5
Comapre values by Series.eq for ==, convert to integers and use GroupBy.transform for new column filled by sum per groups:
df['No_of_6'] = df['B'].eq(6).astype(int).groupby(df['Id']).transform('sum')
#alternative
#df['No_of_6'] = df.assign(B= df['B'].eq(6).astype(int)).groupby('Id')['B'].transform('sum')
print (df)
Id B No_of_6
0 1 6 3
1 2 13 5
2 1 6 3
3 2 6 5
4 1 6 3
5 2 6 5
6 1 10 3
7 2 6 5
8 2 6 5
9 2 6 5
Generally create boolean mask by your condition(s) and pass below:
mask = df['B'].eq(6)
#alternative
#mask = (df['B'] == 6)
df['No_of_6'] = mask.astype(int).groupby(df['Id']).transform('sum')
A solution using map. This solution will return NaN on groups of Id have no number of 6
df['No_of_6'] = df.Id.map(df[df.B.eq(6)].groupby('Id').B.count())
Out[113]:
Id B No_of_6
0 1 6 3
1 2 13 5
2 1 6 3
3 2 6 5
4 1 6 3
5 2 6 5
6 1 10 3
7 2 6 5
8 2 6 5
9 2 6 5

Pull Values from a Table in Excel

I am working on creating a user friendly character sheet for the new Pathfinder Playtest in Excel. I have run into an issue with a section and I have come here for help, not sure if it's the right place.
I want to have a cell return a value from a table (below) based on two other cell's values, e.g., if A1=19 and B1=4th it would pull the number from the appropriate area (3 in this case).
1st 2nd 3rd 4th 5th 6th 7th 8th 9th
1 2
2 3
3 3 2
4 3 3
5 3 3 2
6 3 3 3
7 3 3 3 2
8 3 3 3 3
9 3 3 3 3 2
10 3 3 3 3 3
11 3 3 3 3 3 2
12 3 3 3 3 3 3
13 3 3 3 3 3 3 2
14 3 3 3 3 3 3 3
15 3 3 3 3 3 3 3 2
16 3 3 3 3 3 3 3 3
17 3 3 3 3 3 3 3 3 2
18 3 3 3 3 3 3 3 3 3
19 3 3 3 3 3 3 3 3 3
20 3 3 3 3 3 3 3 3 3
I have tried using the below as well as just Indexing and I can't figure this out. Any help is appreciated, thanks!
=INDEX(P137:X156,MATCH(B2,O137:O156,1),MATCH(A10,P137:P156,1))
=INDEX(O137:O156,MATCH(1,(J125=P137:P156)*(J126=Q137:Q156)*(J127=R137:R156)*(J128=S137:S156)*(J129=T137:T156)*(J130=U137:U156)*(J131=V137:V156)*(J132=W137:W156)*(J133=X137:X156),0))
Let's say your data starts at A1 like image below:
I Added 2 simple cells where user chooses the row and the column. Both cells use data validation lists related to your data, so no wrong info can be entered.
The formula is:
=INDEX($1:$1048576;MATCH($C$25;$A:$A;0);MATCH($C$26;$1:$1;0))
Hope you can adapt this to your needs.
You can download the sample from Google Drive if you wish:
https://drive.google.com/open?id=1QXFmmEPMtJeiHDjKKM0o6kclpMIzaw_i

File is not sort after sort

I have a problem with sorting my file. My file look like this
geom-10-11.com 1
geom-1-10.com 9
geom-1-11.com 10
geom-1-2.com 1
geom-1-3.com 2
geom-1-4.com 3
geom-1-5.com 4
geom-1-6.com 5
geom-1-7.com 6
geom-1-8.com 7
geom-1-9.com 8
geom-2-10.com 8
geom-2-11.com 9
geom-2-3.com 1
geom-2-4.com 2
geom-2-5.com 3
geom-2-6.com 4
geom-2-7.com 5
geom-2-8.com 6
geom-2-9.com 7
geom-3-10.com 7
geom-3-11.com 8
geom-3-4.com 1
geom-3-5.com 2
geom-3-6.com 3
geom-3-7.com 4
geom-3-8.com 5
geom-3-9.com 6
geom-4-10.com 6
geom-4-11.com 7
geom-4-5.com 1
geom-4-6.com 2
geom-4-7.com 3
geom-4-8.com 4
geom-4-9.com 5
geom-5-10.com 5
geom-5-11.com 6
geom-5-6.com 1
geom-5-7.com 2
geom-5-8.com 3
geom-5-9.com 4
geom-6-10.com 4
geom-6-11.com 5
geom-6-7.com 1
geom-6-8.com 2
geom-6-9.com 3
geom-7-10.com 3
geom-7-11.com 4
geom-7-8.com 1
geom-7-9.com 2
geom-8-10.com 2
geom-8-11.com 3
geom-8-9.com 1
geom-9-10.com 1
geom-9-11.com 2
So I used sort -k1.6 -k2 -n and I got
geom-1-2.com 1
geom-1-3.com 2
geom-1-4.com 3
geom-1-5.com 4
geom-1-6.com 5
geom-1-7.com 6
geom-1-8.com 7
geom-1-9.com 8
geom-1-10.com 9
geom-1-11.com 10
geom-2-3.com 1
geom-2-4.com 2
geom-2-5.com 3
geom-2-6.com 4
geom-2-7.com 5
geom-2-8.com 6
geom-2-9.com 7
geom-2-10.com 8
geom-2-11.com 9
geom-3-4.com 1
geom-3-5.com 2
geom-3-6.com 3
geom-3-7.com 4
geom-3-8.com 5
geom-3-9.com 6
geom-3-10.com 7
geom-3-11.com 8
geom-4-5.com 1
geom-4-6.com 2
geom-4-7.com 3
geom-4-8.com 4
geom-4-9.com 5
geom-4-10.com 6
geom-4-11.com 7
geom-5-6.com 1
geom-5-7.com 2
geom-5-8.com 3
geom-5-9.com 4
geom-5-10.com 5
geom-5-11.com 6
geom-6-7.com 1
geom-6-8.com 2
geom-6-9.com 3
geom-6-10.com 4
geom-6-11.com 5
geom-7-8.com 1
geom-7-9.com 2
geom-7-10.com 3
geom-7-11.com 4
geom-8-9.com 1
geom-8-10.com 2
geom-8-11.com 3
geom-9-10.com 1
geom-9-11.com 2
geom-10-11.com 1
But when I tried use uniq -f1 or sort -k1.6 -k2 -n -u I got same long sorted output. So I used
sort -k1.6 -k2 -n -c
and get message that this file is disordered
(sort: glist2:2: disorder: geom-1-2.com 1).
I tried use just sort -k2 -n -u but got
geom-10-11.com 1
geom-1-3.com 2
geom-1-4.com 3
geom-1-5.com 4
geom-1-6.com 5
geom-1-7.com 6
geom-1-8.com 7
geom-1-9.com 8
geom-1-10.com 9
geom-1-11.com 10
That is not what I need, I need to have
geom-1-2.com 1
geom-1-3.com 2
geom-1-4.com 3
geom-1-5.com 4
geom-1-6.com 5
geom-1-7.com 6
geom-1-8.com 7
geom-1-9.com 8
geom-1-10.com 9
geom-1-11.com 10
So I need to have at begening geom-1-X and not geom-10-X. It would be great use juste uniq because I have more bigger files with more geometries (about thousands of lines) but with same structure. Thank you for your answers.
You can use this:
grep -E '^geom-1-' file | sort -k1.8n
grep is filtering the lines you want. sort is sorting numerically the first field starting at the 8th character.

How do I calculate the probability of every value in a dataframe column quickly in Python?

I want to calculate the probability of all the data in a column dataframe according to its own distribution.For example,my data like this:
data
0 1
1 1
2 2
3 3
4 2
5 2
6 7
7 8
8 3
9 4
10 1
And the output I expect like this:
data pro
0 1 0.155015
1 1 0.155015
2 2 0.181213
3 3 0.157379
4 2 0.181213
5 2 0.181213
6 7 0.048717
7 8 0.044892
8 3 0.157379
9 4 0.106164
10 1 0.155015
I also refer to another question(How to compute the probability ...) and get an example of the above.My code is as follows:
import scipy.stats
samples = [1,1,2,3,2,2,7,8,3,4,1]
samples = pd.DataFrame(samples,columns=['data'])
print(samples)
kde = scipy.stats.gaussian_kde(samples['data'].tolist())
samples['pro'] = kde.pdf(samples['data'].tolist())
print(samples)
But what I can't stand is that if my column is too long, it makes the operation slow.Is there a better way to do it in pandas?Thanks in advance.
Its own distribution does not mean kde. You can use value_counts with normalize=True
df.assign(pro=df.data.map(df.data.value_counts(normalize=True)))
data pro
0 1 0.272727
1 1 0.272727
2 2 0.272727
3 3 0.181818
4 2 0.272727
5 2 0.272727
6 7 0.090909
7 8 0.090909
8 3 0.181818
9 4 0.090909
10 1 0.272727

how to calculate standard deviation from different colums in shell script

I have a datafile with 10 columns as given below
ifile.txt
2 4 4 2 1 2 2 4 2 1
3 3 1 5 3 3 4 5 3 3
4 3 3 2 2 1 2 3 4 2
5 3 1 3 1 2 4 5 6 8
I want to add 11th column which will show the standard deviation of each rows along 10 columns. i.e. STDEV(2 4 4 2 1 2 2 4 2 1) and so on.
I am able to do by taking tranpose, then using the following command and again taking transpose
awk '{x[NR]=$0; s+=$1} END{a=s/NR; for (i in x){ss += (x[i]-a)^2} sd = sqrt(ss/NR); print sd}'
Can anybody suggest a simpler way so that I can do it directly along each row.
You can do the same with one pass as well.
awk '{for(i=1;i<=NF;i++){s+=$i;ss+=$i*$i}m=s/NF;$(NF+1)=sqrt(ss/NF-m*m);s=ss=0}1' ifile.txt
Do you mean something like this ?
awk '{for(i=1;i<=NF;i++)s+=$i;M=s/NF;
for(i=1;i<=NF;i++)sd+=(($i-M)^2);$(NF+1)=sqrt(sd/NF);M=sd=s=0}1' file
2 4 4 2 1 2 2 4 2 1 1.11355
3 3 1 5 3 3 4 5 3 3 1.1
4 3 3 2 2 1 2 3 4 2 0.916515
5 3 1 3 1 2 4 5 6 8 2.13542
You just use the fields instead of transposing and using the rows.

Resources