Average of multiple files with unequal row sizes in Shell - linux

I have 15 datafiles with unequal row sizes, but number of columns in each file is same. e.g.
ifile1.dat ifile2.dat ifile3.dat and so on ............
0 0 0 0 1 6
1 2 5 3 2 7
2 5 6 10 4 6
5 2 8 9 5 9
10 2 10 3 8 2
In each file 1st column represents the index number.
I would like to compute average of all these files for each index number in column 1. i.e.
ofile.txt
0 0 [This is computed as (0+0)/2]
1 4 [This is computed as (2+6)/2]
2 6 [This is computed as (5+7)/2]
3 [no value]
4 6 [This is computed as (6)/1]
5 4.66 [This is computed as (2+3+9)/3]
6 10
7
8 5.5
9
10 2.5
I can't think of any simple method to do it. I was thinking of a method, but seems very lengthy. Taking the average after converting all the files with same row size, .e.g.
ifile1.dat ifile2.dat ifile3.dat and so on ............
0 0 0 0 0 0
1 2 1 1 6
2 5 2 2 7
3 3 3
4 4 4 6
5 2 5 3 5 9
6 6 10 6
7 7 7
8 8 9 8 2
9 9 9
10 2 10 3 10

$ awk '{s[$1]+=$2; c[$1]++;} END{for (i in s) print i,s[i]/c[i];}' ifile*.dat
0 0
1 4
2 6
4 6
5 4.66667
6 10
8 5.5
10 2.5
In the above code, there are two arrays, s and c. s[i] is the sum of all entries with index i and c[i] is the number of entries with index i. After we have read all the files, we print the average, s[i]/c[i], for each index i.

Related

Add ordinal number column to output of custom verb in J

If I type !i.10 it gives the factorial of first 10 numbers.
However if I try to add a column of ordinal numbers >/.!i.10, 1+i.10, then J freezes or I get an "Out of memory" error. How do I create custom tables?
I think that what is happening is that you are creating something much bigger than you expect. Taking it in steps:
1+ i. 10 NB. list of 1 to 10
1 2 3 4 5 6 7 8 9 10
10 , 1+ i. 10 NB. 10 prepended
10 1 2 3 4 5 6 7 8 9 10
i. 10 , 1+ i. 10 NB. creates an 11 dimension array with shape 10 1 2 3 4 5 6 7 8 9 10 and largest value of 36287999
When you apply ! to that i. 10 , 1+ i. 10 you get some very large numbers. I am not sure what you are trying to do with the leading >/.
Is this what you had in mind?
(!1 + i.10),. (1+i.10) NB. using parenthesis to isolate operations
1 1
2 2
6 3
24 4
120 5
720 6
5040 7
40320 8
362880 9
3.6288e6 10
To give extended type and get rid of the 3.6288e6 we can use x:
(x:!1 + i.10),. (1+i.10)
1 1
2 2
6 3
24 4
120 5
720 6
5040 7
40320 8
362880 9
3628800 10
or tacit
(x:#! ,. ]) # (1+i.) 10
1 1
2 2
6 3
24 4
120 5
720 6
5040 7
40320 8
362880 9
3628800 10
Or a version I find a little better
([: (,.~ !) 1x + i.) 10
1 1
2 2
6 3
24 4
120 5
720 6
5040 7
40320 8
362880 9
3628800 10

count Total rows of an Id from another column

I have a dataframe
Intialise data of lists.
data = {'Id':['1', '2', '3', '4','5','6','7','8','9','10'], 'reply_id':[2, 2,2, 5,5,6,8,8,1,1]}
Create DataFrame
df = pd.DataFrame(data)
Id reply_id
0 1 2
1 2 2
2 3 2
3 4 5
4 5 5
5 6 6
6 7 8
7 8 8
8 9 1
9 10 1
I want to get total of reply_id in new for every Id.
Id=1 have 2 time occurrence in reply_id which i want in new column new
Desired output
Id reply_id new
0 1 2 2
1 2 2 3
2 3 2 0
3 4 5 0
4 5 5 2
5 6 6 1
6 7 8 0
7 8 8 2
8 9 1 0
9 10 1 0
I have done this line of code.
df['new'] = df.reply_id.eq(df.Id).astype(int).groupby(df.Id).transform('sum')
In this answer, I used Series.value_counts to count values in reply_id, and converted the result to a dict. Then, I used Series.map on the Id column to associate counts to Id. fillna(0) is used to fill values not present in reply_id
df['new'] = (df['Id']
.astype(int)
.map(df['reply_id'].value_counts().to_dict())
.fillna(0)
.astype(int))
Use, Series.groupby on the column reply_id, then use the aggregation function GroupBy.count to create a mapping series counts, finally use Series.map to map the values in Id column with their respective counts:
counts = df['reply_id'].groupby(df['reply_id']).count()
df['new'] = df['Id'].map(counts).fillna(0).astype(int)
Result:
# print(df)
Id reply_id new
0 1 2 2
1 2 2 3
2 3 2 0
3 4 5 0
4 5 5 2
5 6 6 1
6 7 8 0
7 8 8 2
8 9 1 0
9 10 1 0

How can we groupby selected row values from a column and assign it to a new column in pandas df?

Id B
1 6
2 13
1 6
2 6
1 6
2 6
1 10
2 6
2 6
2 6
I want a new columns say C where I can get a grouped value of B=6 at Id level
Jan18.loc[Jan18['Enquiry Purpose']==6].groupby(Jan18['Member Reference']).transform('count')
Id B No_of_6
1 6 3
2 13 5
1 6 3
2 6 5
1 6 3
2 6 5
1 10 3
2 6 5
2 6 5
2 6 5
Comapre values by Series.eq for ==, convert to integers and use GroupBy.transform for new column filled by sum per groups:
df['No_of_6'] = df['B'].eq(6).astype(int).groupby(df['Id']).transform('sum')
#alternative
#df['No_of_6'] = df.assign(B= df['B'].eq(6).astype(int)).groupby('Id')['B'].transform('sum')
print (df)
Id B No_of_6
0 1 6 3
1 2 13 5
2 1 6 3
3 2 6 5
4 1 6 3
5 2 6 5
6 1 10 3
7 2 6 5
8 2 6 5
9 2 6 5
Generally create boolean mask by your condition(s) and pass below:
mask = df['B'].eq(6)
#alternative
#mask = (df['B'] == 6)
df['No_of_6'] = mask.astype(int).groupby(df['Id']).transform('sum')
A solution using map. This solution will return NaN on groups of Id have no number of 6
df['No_of_6'] = df.Id.map(df[df.B.eq(6)].groupby('Id').B.count())
Out[113]:
Id B No_of_6
0 1 6 3
1 2 13 5
2 1 6 3
3 2 6 5
4 1 6 3
5 2 6 5
6 1 10 3
7 2 6 5
8 2 6 5
9 2 6 5

How do I calculate the probability of every value in a dataframe column quickly in Python?

I want to calculate the probability of all the data in a column dataframe according to its own distribution.For example,my data like this:
data
0 1
1 1
2 2
3 3
4 2
5 2
6 7
7 8
8 3
9 4
10 1
And the output I expect like this:
data pro
0 1 0.155015
1 1 0.155015
2 2 0.181213
3 3 0.157379
4 2 0.181213
5 2 0.181213
6 7 0.048717
7 8 0.044892
8 3 0.157379
9 4 0.106164
10 1 0.155015
I also refer to another question(How to compute the probability ...) and get an example of the above.My code is as follows:
import scipy.stats
samples = [1,1,2,3,2,2,7,8,3,4,1]
samples = pd.DataFrame(samples,columns=['data'])
print(samples)
kde = scipy.stats.gaussian_kde(samples['data'].tolist())
samples['pro'] = kde.pdf(samples['data'].tolist())
print(samples)
But what I can't stand is that if my column is too long, it makes the operation slow.Is there a better way to do it in pandas?Thanks in advance.
Its own distribution does not mean kde. You can use value_counts with normalize=True
df.assign(pro=df.data.map(df.data.value_counts(normalize=True)))
data pro
0 1 0.272727
1 1 0.272727
2 2 0.272727
3 3 0.181818
4 2 0.272727
5 2 0.272727
6 7 0.090909
7 8 0.090909
8 3 0.181818
9 4 0.090909
10 1 0.272727

Sort a group of data based on a column

I have an input file that contains following data:
1 2 3 4
4 6
8 9
10
2 1 5 7
3
3 4 2 9
2 7
11
I'm trying to sort the group of data based on the third column and get such an output:
2 1 5 7
3
1 2 3 4
4 6
8 9
10
3 4 2 9
2 7
11
Could you tell me how to do so?
sort -nk3r
will sort in reverse order based on 3rd column. Note however, that this outputs
2 1 5 7
1 2 3 4
3 4 2 9
10
11
2 7
3
4 6
8 9
because of the way bash sort functions, and this produces a different result than the output you posted, but correct according to the question.

Resources