how to calculate standard deviation from different colums in shell script - linux

I have a datafile with 10 columns as given below
ifile.txt
2 4 4 2 1 2 2 4 2 1
3 3 1 5 3 3 4 5 3 3
4 3 3 2 2 1 2 3 4 2
5 3 1 3 1 2 4 5 6 8
I want to add 11th column which will show the standard deviation of each rows along 10 columns. i.e. STDEV(2 4 4 2 1 2 2 4 2 1) and so on.
I am able to do by taking tranpose, then using the following command and again taking transpose
awk '{x[NR]=$0; s+=$1} END{a=s/NR; for (i in x){ss += (x[i]-a)^2} sd = sqrt(ss/NR); print sd}'
Can anybody suggest a simpler way so that I can do it directly along each row.

You can do the same with one pass as well.
awk '{for(i=1;i<=NF;i++){s+=$i;ss+=$i*$i}m=s/NF;$(NF+1)=sqrt(ss/NF-m*m);s=ss=0}1' ifile.txt

Do you mean something like this ?
awk '{for(i=1;i<=NF;i++)s+=$i;M=s/NF;
for(i=1;i<=NF;i++)sd+=(($i-M)^2);$(NF+1)=sqrt(sd/NF);M=sd=s=0}1' file
2 4 4 2 1 2 2 4 2 1 1.11355
3 3 1 5 3 3 4 5 3 3 1.1
4 3 3 2 2 1 2 3 4 2 0.916515
5 3 1 3 1 2 4 5 6 8 2.13542
You just use the fields instead of transposing and using the rows.

Related

Pull Values from a Table in Excel

I am working on creating a user friendly character sheet for the new Pathfinder Playtest in Excel. I have run into an issue with a section and I have come here for help, not sure if it's the right place.
I want to have a cell return a value from a table (below) based on two other cell's values, e.g., if A1=19 and B1=4th it would pull the number from the appropriate area (3 in this case).
1st 2nd 3rd 4th 5th 6th 7th 8th 9th
1 2
2 3
3 3 2
4 3 3
5 3 3 2
6 3 3 3
7 3 3 3 2
8 3 3 3 3
9 3 3 3 3 2
10 3 3 3 3 3
11 3 3 3 3 3 2
12 3 3 3 3 3 3
13 3 3 3 3 3 3 2
14 3 3 3 3 3 3 3
15 3 3 3 3 3 3 3 2
16 3 3 3 3 3 3 3 3
17 3 3 3 3 3 3 3 3 2
18 3 3 3 3 3 3 3 3 3
19 3 3 3 3 3 3 3 3 3
20 3 3 3 3 3 3 3 3 3
I have tried using the below as well as just Indexing and I can't figure this out. Any help is appreciated, thanks!
=INDEX(P137:X156,MATCH(B2,O137:O156,1),MATCH(A10,P137:P156,1))
=INDEX(O137:O156,MATCH(1,(J125=P137:P156)*(J126=Q137:Q156)*(J127=R137:R156)*(J128=S137:S156)*(J129=T137:T156)*(J130=U137:U156)*(J131=V137:V156)*(J132=W137:W156)*(J133=X137:X156),0))
Let's say your data starts at A1 like image below:
I Added 2 simple cells where user chooses the row and the column. Both cells use data validation lists related to your data, so no wrong info can be entered.
The formula is:
=INDEX($1:$1048576;MATCH($C$25;$A:$A;0);MATCH($C$26;$1:$1;0))
Hope you can adapt this to your needs.
You can download the sample from Google Drive if you wish:
https://drive.google.com/open?id=1QXFmmEPMtJeiHDjKKM0o6kclpMIzaw_i

Counting the pairs that come together with a high value in a dataset

I have a set of data with column headings A, B, C, D, E ... K. and in the cells, there are values between 0-6. I am looking for a way to count and list the pairs or triples that have high values (4,5,6).
For example, if A and B columns have 5 and 6 in the same row respectively, then it should be counted in the calculation of the occurrences. If it is 1 and 6, 1 and 5, etc, then it should be skipped. It should be only counted if both (can be more than 2 columns) have high values on the same row.
Basically, I want to count and list the columns if they have high values in the same row. I am open for all types of solutions. I'd really appreciate if someone guide me how to do this. thanks.
Example Output:
Pairs Number of Occurrences (can be (5,6), (4,6),(5,5), (4,5), (6,6))
AB 10
BC 20
CE 30
Here is a picture of my data.
This is just a part of my actual data. Not the complete list. I am sorry, I said values between 0 and 6. I deleted 0s, and they are all blank now.
A B C D E F G H I J K L M
3 3 2 4 2 4 5 4 2 2 4 3 3
2 4 3 3 3 3 6 4 2 3 3 2 4
3 3 2 4 2 4 3 3 3 3 3 3 3
3 3 4 2 4 2 4 3 3 5 1 3 3
2 4 4 2 4 2 3 6 4 2 2 4
2 4 2 4 2 4 3 3 3 3 3 2 4
3 3 2 4 2 4 3 3 3 3 3 3 3
5 1 2 4 2 4 3 3 3 3 3 5 1
2 4 1 5 1 5 3 4 2 3 3 2 4
3 3 2 4 2 4 3 3 3 3 3 3 3
5 1 2 4 2 4 2 3 3 3 3 5 1
3 3 2 4 2 4 3 4 2 4 2 3 3
4 2 3 3 3 3 3 3 3 4 2 4 2
3 3 3 3 3 3 3 3 3 6 0 3 3
2 4 3 3 3 3 3 4 2 5 1 2 4
4 2 2 4 2 4 3 1 5 3 3 4 2
2 4 4 2 4 2 4 3 3 3 3 2 4
3 3 2 4 2 4 3 2 4 4 2 3 3
3 3 4 2 4 2 4 3 3 3 3 3 3
4 2 2 4 2 4 3 3 3 3 3 4 2
2 4 3 3 3 3 3 3 3 4 2 2 4
2 4 2 4 2 4 2 2 4 4 2 2 4
4 2 3 3 3 3 5 4 2 1 5 4 2
3 3 3 3 3 3 3 4 2 3 3 3 3
1 5 2 4 2 4 3 4 2 2 4 1 5
5 1 4 2 4 2 6 1 5 3 3 5 1
4 2 1 5 1 5 3 3 3 2 4 4 2
1 5 2 4 2 4 1 3 3 3 3 1 5
2 4 4 2 4 2 1 2 4 2 4 2 4
4 2 5 1 5 1 2 4 2 3 3 4 2
4 2 1 5 1 5 4 1 5 4 2 4 2
2 4 3 3 3 3 3 3 3 6 0 2 4
4 2 2 4 2 4 3 3 3 3 3 4 2
I made two helper columns that list the pairs of columns, then used this formula to calculate the pairs of (4,5), (4,6), and (5,6).
= SUMPRODUCT(COUNTIFS(INDEX($A:$M,0,MATCH(O2,$A$1:$M$1,0)),{4,4,5,5,6,6},
INDEX($A:$M,0,MATCH(P2,$A$1:$M$1,0)),{5,6,6,4,4,5}))
EDIT Based on your most recent comment, formula is updated to this:
= COUNTIFS(INDEX($A:$M,0,MATCH(O2,$A$1:$M$1,0)),">3",
INDEX($A:$M,0,MATCH(P2,$A$1:$M$1,0)),">3"))
See example below, I didn't do it for every single of columns, but gave it a good start:
Note your original data is to the left in my spreadsheet, I didn't show it here just to save space.
here goes a VBA solution exploiting Dictionary (which requires to add reference to Microsoft Scripting Runtime library):
Option Explicit
Sub main()
Dim col As Range
Dim cell As Range
Dim pairDict As Scripting.Dictionary
Set pairDict = New Scripting.Dictionary
With Worksheets("rates")
With .Range("a1").CurrentRegion
For Each col In .Columns.Resize(, .Columns.Count - 1) 'loop through referenced range columns except the last one
.AutoFilter Field:=col.Column, Criteria1:=">4" 'filter reference range on current column with values > 4
If Application.WorksheetFunction.Subtotal(103, col) > 1 Then ' if any filtered cells except header
For Each cell In Intersect(.Offset(, col.Column).Resize(, .Columns.Count - col.Column), .Resize(.Rows.Count - 1).Offset(1).SpecialCells(xlCellTypeVisible).EntireRow) 'loop through each row of filtered cells from one column right of current one to the last one
If cell.Value > 4 Then pairDict(.Cells(1, col.Column).Value & .Cells(1, cell.Column).Value) = pairDict(.Cells(1, col.Column).Value & .Cells(1, cell.Column).Value) + 1 ' if current cell value is >4 then update dictionary with key=combination of columns pair first row content and value=value+1
Next
End If
.AutoFilter 'remove current filter
Next
End With
.AutoFilterMode = False 'remove filters headers
End With
If pairDict.Count > 0 Then ' if any pair found
Dim key As Variant
For Each key In pairDict.Keys 'loop through each dictionary key
Debug.Print key, pairDict(key) 'print the key (i.e. the pair of matching columns first row content) and the value ( i.e. the number of occurrences found)
Next
End If
End Sub

Repeating elements in a dataframe

Hi all I have the following dataframe:
A | B | C
1 2 3
2 3 4
3 4 5
4 5 6
And I am trying to only repeat the last two rows of the data so that it looks like this:
A | B | C
1 2 3
2 3 4
3 4 5
3 4 5
4 5 6
4 5 6
I have tried using append, concat and repeat to no avail.
repeated = lambda x:x.repeat(2)
df.append(df[-2:].apply(repeated),ignore_index=True)
This returns the following dataframe, which is incorrect:
A | B | C
1 2 3
2 3 4
3 4 5
4 5 6
3 4 5
3 4 5
4 5 6
4 5 6
You can use numpy.repeat for repeating index and then create df1 by loc, last append to original, but before filter out last 2 rows by iloc:
df1 = df.loc[np.repeat(df.index[-2:].values, 2)]
print (df1)
A B C
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
print (df.iloc[:-2])
A B C
0 1 2 3
1 2 3 4
df = df.iloc[:-2].append(df1,ignore_index=True)
print (df)
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 3 4 5
4 4 5 6
5 4 5 6
If want use your code add iloc for filtering only last 2 rows:
repeated = lambda x:x.repeat(2)
df = df.iloc[:-2].append(df.iloc[-2:].apply(repeated),ignore_index=True)
print (df)
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 3 4 5
4 4 5 6
5 4 5 6
Use pd.concat and index slicing with .iloc:
pd.concat([df,df.iloc[-2:]]).sort_values(by='A')
Output:
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
I'm partial to manipulating the index into the pattern we are aiming for then asking the dataframe to take the new form.
Option 1
Use pd.DataFrame.reindex
df.reindex(df.index[:-2].append(df.index[-2:].repeat(2)))
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
Same thing in multiple lines
i = df.index
idx = i[:-2].append(i[-2:].repeat(2))
df.reindex(idx)
Could also use loc
i = df.index
idx = i[:-2].append(i[-2:].repeat(2))
df.loc[idx]
Option 2
Reconstruct from values. Only do this is all dtypes are the same.
i = np.arange(len(df))
idx = np.append(i[:-2], i[-2:].repeat(2))
pd.DataFrame(df.values[idx], df.index[idx])
0 1 2
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
Option 3
Can also use np.array in iloc
i = np.arange(len(df))
idx = np.append(i[:-2], i[-2:].repeat(2))
df.iloc[idx]
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6

Sort a group of data based on a column

I have an input file that contains following data:
1 2 3 4
4 6
8 9
10
2 1 5 7
3
3 4 2 9
2 7
11
I'm trying to sort the group of data based on the third column and get such an output:
2 1 5 7
3
1 2 3 4
4 6
8 9
10
3 4 2 9
2 7
11
Could you tell me how to do so?
sort -nk3r
will sort in reverse order based on 3rd column. Note however, that this outputs
2 1 5 7
1 2 3 4
3 4 2 9
10
11
2 7
3
4 6
8 9
because of the way bash sort functions, and this produces a different result than the output you posted, but correct according to the question.

Data fill in specific pattern

I am trying to fill data in MS Excel. I am given following pattern:
1 2
1
1
2 5
2 5
2
3
3 6
3
4
4
5 4
And I want my output in following format:
1 2
1 2
1 2
2 5
2 5
2 5
3 6
3 6
3 6
4
4
5 4
I tried using if(b2,b2,c1) in column 3. but that doesn't solve the problem for a=3 and a=4.
Any idea how to do this in Excel?
With sorting thus:
(the effect of which in this case is merely to move 6 up once cell) and a blank row above:
=IF(AND(A2<>A1,B2=""),"",IF(B2<>"",B2,C1))
In C2 and copied down should get the result you ask for from the data sample provided.

Resources