I have an excel chart with thousands of Samples in one column. The sample appears in a row three times matching up to a different column called Target which is normal. I am looking to seeing if the sample duplicates 4,5 or 6 times and add a "-r2" to the end of the Sample name, and if the Sample duplicates 7,8, or 9 times and add a "-r3" to the end of the Sample name. I am trying to find a formula that ignores the first three times a Sample is counted and only count duplicates from ranges 4-6 as "-r2" and 7-9 as "-r3".
Below is an example of what I would want my new column (called "New Sample") to look like. It's worth noting that not every sample is duplicated more than 3 times, sometimes 6 or 9.
Sample
New Sample
Target
1
1
top
1
1
middle
1
1
bottom
1
1-r2
top
1
1-r2
middle
1
1-r2
bottom
1
1-r3
top
1
1-r3
middle
1
1-r3
bottom
2
2
top
2
2
middle
2
2
bottom
3
3
top
3
3
middle
3
3
bottom
3
3-r2
top
3
3-r2
middle
3
3-r2
bottom
Use COUNTIFS:
=A2&IF(COUNTIFS($A$2:A2,A2,$C$2:C2,C2)>1,"-r"&COUNTIFS($A$2:A2,A2,$C$2:C2,C2),"")
If the third column does not exist or we cannot refer to it:
=A2&IF(COUNTIFS($A$2:A2,A2)>3,"-r"&INT((COUNTIFS($A$2:A2,A2)-1)/3)+1,"")
Like so:
1
1
1
1
1
1
1
2
2
2
2
2
2
2
...
Something like =IF(A1=4,1,A1+1) would work if the sequence wasn't all the same value.
I believe you are looking for division:
=INT((ROW()-1)/7)
NOTE: You will have to play with the 1 and 7 in the formula above to adapt to your needs. -1 is like an offset, and 7 is the number of times for the repeat. Use -2 if the numbers should start on row 2 for example. Lastly, if you want to start with 1 instead of zero, simply add 1.
=(the previous row's value) + IF(the previous 7 rows all have the same value, 1, 0 )
Obviously(?) this will not work for the first 7 rows; the first row could be the starting value, the next 6 just a straight copy of that.
This is the ultimate Pandas challenge, in my mind, though it may be elementary to some of you out there...
I am trying to link a particular job position with the survey items to which it corresponds. For example, the president of site A would be attributed to results from a survey item for which respondents from site A provide feedback (i.e "To what degree do you agree with the following statement?: "I think the quality of site A is sufficient overall"").
Each site has 5 stations (0 through 4). Each job position is assigned to one or more station(s) at one or more site(s).
For example, the president of a site works at all stations within that site, while a contractor might only work at a couple of stations, maybe at 2 different sites.
Survey data was collected on the quality of each station within each site.
Some survey items pertain to certain stations within one or more sites.
For example, the "Positions" table looks like this:
import pandas as pd
import numpy as np
pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'],
'Position':['Contractor','President'],
'Site(s)':['A,B','A'],
'Item(s)':['1','1,2']
})
pos[['Position','Site(s)','Station(s)','Item(s)']]
Position Site(s) Station(s) Item(s)
0 Contractor A,B ,1,2,, 1
1 President A 0,1,2,3,4 1,2
The survey data table looks like this:
sd = pd.DataFrame({'Site(s)':['A','B','B','C','A','A'],
'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
'Item 1':[1,1,0,0,1,np.nan],
'Item 2':[1,0,0,1,0,1]})
sd[['Site','Station(s)','Item 1','Item 2']]
Site Station(s) Item 1 Item 2
0 A ,1,2,, 1 1
1 B ,1,2,, 1 0
2 B ,,,, 0 0
3 C ,1,2,, 0 1
4 A 0,1,2,, 1 0
5 A ,,2,, NaN 1
2 side notes:
The item data has been coded to 1 and 0 for unimportant reasons.
The comma separated responses are actually condensed from columns (one column per station and item). I only mention that because if it is better to not condense them, that can be done (or not).
So here's what I need:
This (I think):
Contractor President Site(s) Station(s) Item 1 Item 2
0 1 1 A ,1,2,, 1 1
1 1 0 B ,1,2,, 1 0
2 0 0 B ,,,, 0 0
3 0 0 C ,1,2,, 0 1
4 0 1 A 0,1,2,, 1 0
5 1 1 A ,,2,, NaN 1
Logic:
The contractor works at site A and B and should only be associated with respondents who work either of those sites.
Within those respondents, he should only associated with those who work at stations 1 or 2 but none that also work at
any others (i.e. station 0).
Therefore, the contractor's rows of interest in df2 are indices 0, 1, and 5.
The president's rows of interest are from indices 0, 4, and 5.
...and, ultimately, this:
Position Overall%
0 Contractor 100
1 President 80
Logic:
Because the president is concerned with items 1 and 2, there are 5 numbers to consider: (1 and 1) from item 1 and (1, 0, and 1) from item 2.
The sum across items is 4 and the count across items is 5 (again, do not count 'NaN'), which gives 80%.
Because the contractor is only concerned with item 1, there are 2 numbers to consider: 1 and 1 - 'NaN' should not be counted - (from the rows of interest, respectively). Therefore, the sum is 2 out of the count, which is 2, which gives 100%
Thanks in advance!
Update
I know this works (top answer just under question), but how can that be applied to this situation? I tried this (just to try the first part of the logic):
for i in pos['Position']:
sd[i]=[x for x in pos.loc['Site(s)'] if x in sd['Site']]
...but it threw this error:
KeyError: 'the label [Site(s)] is not in the [index]'
...so I'm still wrestling with it.
If I understand correctly, you want to add one column to sd for each job position in pos. (This is the first task.)
So, for each row index i in pos (we iterate over rows in pos), we can create a unique boolean column:
# PSEUDOCODE:
sd[position_name_i] = (sd['Site'] IS_CONTAINED_IN pos.loc[i,'Site(s)']) and (sd['Station(s)'] IS_CONTAINED_IN pos.loc[i,'Station(s)'])
I hope the logic here is clear and consistent with your goal.
The expression X IS_CONTAINED_IN Y may be implemented in many different ways. (I can think of X and Y being sets and then it's X.subset(Y). Or in terms of bitmasks X, Y and bitwise_xor.)
The name of the column, position_name_i may be simply an integer i. (Or something more meaningful as pos.loc[i,'Position'] if this column consists of unique values.)
If this is done, we can do the other task. Now, df[df[position_name_i]] will return only the rows of df for which position_name_i is True.
We iterate over all positions (i.e. rows in pos). For each position:
# number of non-nan entries in 'Item 1' and 'Item 2' relevant for the position:
total = df.loc[df['position_name_i'], ['Item 1', 'Item 2']].count().sum()
# number of 1's among these entries:
partial = df.loc[df['position_name_i'], ['Item 1', 'Item 2']].sum().sum()
The final Overall% for the given position is 100*partial/total.
I'm not sure how to ask this question without illustrating it, so here goes:
I have results from a test which has tested peoples attitudes (scores 1-5) to different voices on a 16 different scales. The data set looks like this (where P1,P2,P3 are participants, A, B, C are voices)
Aformal Apleasant Acool Bformal etc
P1 2 3 1 4
P2 5 4 2 4
P3 1 2 4 3
However, I want to rearrange my data to look like this:
formal pleasant cool
P1A 3 3 5
P1B 2 1 6
P1C etc
P1D
This would mean a lot more rows (multiple rows per participant), and a lot fewer columns. Is it doable without having to manually reenter all the scores in a new excel file?
Sure, no problem. I just hacked this solution:
L M N O P Q
person# voice# formal pleasant cool
1 1 P1A 2 3 1
1 2 P1B 4 5 2
1 3 P1C 9 9 9
2 1 P2A 5 4 2
2 2 P2B 4 4 1
2 3 P2C 9 9 9
3 1 P3A 1 2 4
3 2 P3B 3 3 2
3 3 P3C 9 9 9
Basically, in columns L and M, I made two columns with index numbers. Voice numbers go from 1 to 3 and repeat every 3 rows because there are nv=3 voices (increase this if you have voices F, G, H...). Person numbers are also repeated for 3 rows each, because there are nv=3 voices.
To make the row headers (P1A, etc.), I used this formula: ="P" & L2 & CHAR(64+M2) at P1A and copied down.
To make the new table of values, I used this formula: =OFFSET(B$2,$L2-1,($M2-1)*3) at P1A-formal, and copied down and across. Note that B$2 corresponds to the cell address for P1-Aformal in the original table of values (your example).
I've used this indexing trick I don't know how many times to quickly rearrange tables of data inherited from other people.
EDIT: Note that the index columns are also made (semi)automatically using a formula. The first nv=3 rows are made manually, and then subsequent rows refer to them. For example, the formula in L5 is =L2+1 and the formula in M5 is =M2. Both are copied down to the end of the table.