Renaming first column to distinguish between startsites script - linux

Im looking for a way to convert the file as below called INPUT to OUTPUT. The file INPUT consists of columns consisting the unique ID, ID and the value. I would like to convert the ID to separated IDs based on the value as distinction. I tried some basic commands but could not manage to make it work for the main input file which is 20,000 rows and has 15,000 IDs.
Does anyone has some nice ideas/suggestions how to handle this problem?
INPUT OUTPUT
unique ID VALUE unique ID VALUE
A1 GENEA 10 -> A1 GENEAp1 10
A2 GENEA 5 -> A2 GENEAp2 5
A3 GENEA 2 -> A3 GENEAp3 2
A4 GENEB 4 -> A4 GENEBp4 4
A5 GENEB 5 -> A5 GENEBp3 5
A6 GENEB 8 -> A6 GENEBp2 8
A7 GENEB 70 -> A7 GENEBp1 70
A8 GENEC 5 -> A8 GENECp1 5
A9 GENED 50 -> A9 GENEDp2 50
A10 GENED 10 -> A10 GENEDp3 10
Preferably the numbering of p based on the value. With p1 with the highest value, p2 second highest etc.

Here's a crazy one-liner that does it:
head -1 file; tail -n+2 file| nl| sort -nrk4| awk '{ ++m[$3]; print($1" "$2" "$3"p"m[$3]" "$4); }'| sort -n| cut -d' ' -f2-4| column -to' ';
Output:
unique ID VALUE
A1 GENEAp1 10
A2 GENEAp2 5
A3 GENEAp3 2
A4 GENEBp4 4
A5 GENEBp3 5
A6 GENEBp2 8
A7 GENEBp1 70
A8 GENECp1 5
A9 GENEDp1 50
A10 GENEDp2 10
It involves sorting the file by the VALUE column, and then processing it sequentially in awk, counting occurrences of each distinct ID in an associative array, so you can build up the p# count.
Additional notes:
I printed the header line (head -1) separately from the data lines (tail -n+2) so the main processing pipeline would only apply to the data lines.
I added a call to nl before the initial sort to capture the original line order in a new leading numbering column, and then sorted by that column afterward (and then cut out that numbering column) to return to the original order.
I added column -to' ' at the end to align the data lines, don't know if you want/need that. If you want to align the header line with the data lines, you can surround the head statement and main pipeline with a braced block and move the column -to' ' filter outside the braced block to align the whole thing.

Related

Average of each entry from a column values based on unique value of another column

I have a file which looks like:
E1 32 45 + Apple
E2 54 76 + Apple
...
...
-E2 300 400 + Apple
-E1 540 760 + Apple
E1 560 600 - Orange
E2 340 440 - Orange
...
...
-E2 30 40 - Orange
-E1 20 7 - Orange
Here E for each unique value from last column can range from 1 to 100. And the last column can go till several thousand unique fruits. I want to estimate the average of difference in first E's (E1) from each unique value of last column.
E1 ((45-32)+(600-560))/2 = 26.5
E2 ((76-54)+(440-340))/2 = 61
I want the calculate E1, E2 and E3, and also -E3, -E2, -E1, where -E1 is last E in every unique entry from last column, similarly -E2 and -E3 are second and third last Es.
I tried groupby from pandas to approach this problem:
df1.groupby(str(line[4]))[line[2]-line[1]].mean()
I dont not know whether groupby is the right approach or not, and I am having hard time making a loop for this case. `

Need excel formula to prescribe which 6 of 8 cells to use for average

How can I prescribe which 6 of 8 cells excel uses to make an average?
e.g.:
A1 Art 86
A2 English 88
A3 Law 89
A4 Chemistry 83
A5 Biology 81
A6 Math 1 87
A7 Math 2 67
A8 PhysEd 72
e.g. Average 1 is A1:A6 / 6
e.g. Average 2 is [top 6 highest] / 6
e.g. Average 3 is Chemistry, English and the next top 4 highest / 6
I want to define the list where Average 1 = A1:A6/6, Average 2 = ?, Average 3 = ?
I assume the course descriptions are in column A and the scores are in column B and not Column A
B1:B6 =AVERAGE(B1:B6)
Top 6 =AVERAGE(LARGE(B1:B8,{1,2,3,4,5,6}))
Chemistry + English + top 4
=(SUM(SUMIF(A1:A8,{"Chemistry","English"},B1:B8))+SUM(LARGE((A1:A8<>"Chemistry")*(A1:A8<>"English")*B1:B8,{1,2,3,4})))/6
This last is an array formula and must be entered by holding down Ctrl + Shift while holding down Enter
A longer formula with the same result, but which can be normally entered:
=(SUM(AGGREGATE(14,4,((A1:A8="Chemistry")+(A1:A8="English"))*B1:B8,{1,2}))+SUM(AGGREGATE(14,4,(A1:A8<>"Chemistry")*(A1:A8<>"English")*B1:B8,{1,2,3,4})))/6
For #2, You can use Rank() to identify the top 6, then AverageIf() to average those <=6.
For #3, give Chem & English a value of 1, then use Rank() on the remaining 6 items. Then AverageIf() <=4. This is where it helps to put Chem & English at the top of the list.
Here's an example file showing this:
Click here for the file
Anyway, that's ONE way to solve it... hope it helps!

Analysing data using the column name

I have the following data.
Z Y Z Z
A1 A2 A3 A4 Total
1 2 5 10 16
2 3 5 11 18
3 4 6 12 21
4 4 7 12 23
I want to sum the rows using Just Zs ( the name of columns). I have a big data set, so I want to write a function to find out Zs in the whole sheet and the then sum them (Total).
Any Help would be appreciated
This can be done with SUMIF():
=SUMIF($1:$1,"Z",2:2)
NOTE
With the above formula it is using full row references so DO NOT put "Z" on top of the total column or it will throw circular errors.
If you want to have "Z" above it then you need to define the ranges:
=SUMIF(A$1:D$1,"Z",A2:D2)
How does this work:
=Sum(If(A$1:D$1="Z",$A$2:$D$100))
Assuming your headers are in row 1, columns A to D, and your data starts in row 2, and goes to row 100.

Excel: Sum columns and rows if criteria is met

I have a sheet with product names in column I and then dates from there on. For each date there are numbers of how many pieces of a certain product have to be made. I'm trying to sum all those numbers based on a product type, i.e.:
I K L M ...
30.8. 31.8. 1.9. ...
MAD23 2 0 45 ...
MMR32 5 7 33 ...
MAD17 17 56 0 ...
MAD: 120 (2+0+45+17+56+0)
MMR: 45 (5+7+33)
What I'm doing now is sum the row first:
=SUM(K6:GN6)
MAD23 = 47
MMR32 = 45
MAD32 = 73
And then sum those numbers in column J based on part of the product name in column I:
=SUMIF(Sheet1!I6:I775;"MAD*";Sheet1!J6:J775)
MAD = 120
MMR = 45
Is it possible to do this with just one formula per criteria?
Just trying it on those three rows, I get
=SUM($K$6:$M$8*(LEFT($I$6:$I$8,LEN(I10)-1)=LEFT(I10,LEN(I10)-1)))
which is an array formula and must be entered with CtrlShiftEnter
That's assuming that I10 is going to contain some characters followed by a colon and you want to match those with the first characters of I6:I8.
=SUM(IF(MID(Sheet1!I6:I775,1,3)="MAD",Sheet1!k6:gn775,""))
With ctrl +shift+enter

Ranking in Excel, when I want the top n I get the top n+1 (or n+x)

I have a list in excel and I'd like it to select the higher 3 results, and only 3. It seems easy to do with conditional formatting in this example:
1 | 2 | 3 | 4 | 5 | 6
44 | 78 | 98 | 45 | 52 | 98
Where the 2nd, 3rd and 6th number will be highlighted.
The problem appears with something like this:
1 | 2 | 3 | 4 | 5 | 6
44 | 78 | 78 | 45 | 87 | 98
Excel will highlight the 2nd, 3rd, 5th and 6th number, because the first two (the third on a sorted list) are equal.
How can I make Excel select only one of them?
(the solution doesn't need to use conditional formatting, any tools available on Excel can be used, from formulas to VB, but simplicity it's desirable).
for your conditional formatting, we need to break the tie. One way to do this is to count the number of times a value appears in the previous comparisons.
This would change your conditional formula to
=(RANK(A2,$A$2:$F$2)+COUNTIF($A$2:A$2,A2)-1)<=3
Note that absolute positions are used in some cases, and relative in others.
Breakdown:
RANK(A2,$A$2:$F$2) - Rank formula. You know this one, as you're using it now
COUNTIF($A$2:A$2,A2) - count the number of times the value appears - note that the reference does not have a $ in front of the A after the colon - this is to ensure that the range gets bigger as we process the formula along the row (1st count: A2:A2, 2nd count: A2:B2, etc)
-1 - as the count will always match one number (itself)
so, for your second example,
44 78 78 45 87 98
The new ranks are
6 3 4 5 2 1
and the formulas convert to
=RANK(A2,$A$2:$F$2)+COUNTIF($A$2:A$2,A2)-1
=RANK(B2,$A$2:$F$2)+COUNTIF($A$2:B$2,B2)-1
=RANK(C2,$A$2:$F$2)+COUNTIF($A$2:C$2,C2)-1
=RANK(D2,$A$2:$F$2)+COUNTIF($A$2:D$2,D2)-1
=RANK(E2,$A$2:$F$2)+COUNTIF($A$2:E$2,E2)-1
=RANK(F2,$A$2:$F$2)+COUNTIF($A$2:F$2,F2)-1
for the conditional formatting
I couldn't think of one cell solution, but here is my take with a helper column:
Rank_Row column, check if the Input is duplicated, if yes then add fraction of row number. C2=IF(COUNTIF($B$2:$B$7,B2)>1,RANK(B2,$B$2:$B$7)+ROW()/100,RANK(B2,$B$2:$B$7))
Filter column is using simple RANK in ascending order, you can use this column to filter top 3. D2=RANK(C2,$C$2:$C$7,1)
Drag down the both formulas to copy down.

Resources