Pandas Link Rows to Rows Based on Multiple Criteria - python-3.x

This is the ultimate Pandas challenge, in my mind, though it may be elementary to some of you out there...
I am trying to link a particular job position with the survey items to which it corresponds. For example, the president of site A would be attributed to results from a survey item for which respondents from site A provide feedback (i.e "To what degree do you agree with the following statement?: "I think the quality of site A is sufficient overall"").
Each site has 5 stations (0 through 4). Each job position is assigned to one or more station(s) at one or more site(s).
For example, the president of a site works at all stations within that site, while a contractor might only work at a couple of stations, maybe at 2 different sites.
Survey data was collected on the quality of each station within each site.
Some survey items pertain to certain stations within one or more sites.
For example, the "Positions" table looks like this:
import pandas as pd
import numpy as np
pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'],
'Position':['Contractor','President'],
'Site(s)':['A,B','A'],
'Item(s)':['1','1,2']
})
pos[['Position','Site(s)','Station(s)','Item(s)']]
Position Site(s) Station(s) Item(s)
0 Contractor A,B ,1,2,, 1
1 President A 0,1,2,3,4 1,2
The survey data table looks like this:
sd = pd.DataFrame({'Site(s)':['A','B','B','C','A','A'],
'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
'Item 1':[1,1,0,0,1,np.nan],
'Item 2':[1,0,0,1,0,1]})
sd[['Site','Station(s)','Item 1','Item 2']]
Site Station(s) Item 1 Item 2
0 A ,1,2,, 1 1
1 B ,1,2,, 1 0
2 B ,,,, 0 0
3 C ,1,2,, 0 1
4 A 0,1,2,, 1 0
5 A ,,2,, NaN 1
2 side notes:
The item data has been coded to 1 and 0 for unimportant reasons.
The comma separated responses are actually condensed from columns (one column per station and item). I only mention that because if it is better to not condense them, that can be done (or not).
So here's what I need:
This (I think):
Contractor President Site(s) Station(s) Item 1 Item 2
0 1 1 A ,1,2,, 1 1
1 1 0 B ,1,2,, 1 0
2 0 0 B ,,,, 0 0
3 0 0 C ,1,2,, 0 1
4 0 1 A 0,1,2,, 1 0
5 1 1 A ,,2,, NaN 1
Logic:
The contractor works at site A and B and should only be associated with respondents who work either of those sites.
Within those respondents, he should only associated with those who work at stations 1 or 2 but none that also work at
any others (i.e. station 0).
Therefore, the contractor's rows of interest in df2 are indices 0, 1, and 5.
The president's rows of interest are from indices 0, 4, and 5.
...and, ultimately, this:
Position Overall%
0 Contractor 100
1 President 80
Logic:
Because the president is concerned with items 1 and 2, there are 5 numbers to consider: (1 and 1) from item 1 and (1, 0, and 1) from item 2.
The sum across items is 4 and the count across items is 5 (again, do not count 'NaN'), which gives 80%.
Because the contractor is only concerned with item 1, there are 2 numbers to consider: 1 and 1 - 'NaN' should not be counted - (from the rows of interest, respectively). Therefore, the sum is 2 out of the count, which is 2, which gives 100%
Thanks in advance!
Update
I know this works (top answer just under question), but how can that be applied to this situation? I tried this (just to try the first part of the logic):
for i in pos['Position']:
sd[i]=[x for x in pos.loc['Site(s)'] if x in sd['Site']]
...but it threw this error:
KeyError: 'the label [Site(s)] is not in the [index]'
...so I'm still wrestling with it.

If I understand correctly, you want to add one column to sd for each job position in pos. (This is the first task.)
So, for each row index i in pos (we iterate over rows in pos), we can create a unique boolean column:
# PSEUDOCODE:
sd[position_name_i] = (sd['Site'] IS_CONTAINED_IN pos.loc[i,'Site(s)']) and (sd['Station(s)'] IS_CONTAINED_IN pos.loc[i,'Station(s)'])
I hope the logic here is clear and consistent with your goal.
The expression X IS_CONTAINED_IN Y may be implemented in many different ways. (I can think of X and Y being sets and then it's X.subset(Y). Or in terms of bitmasks X, Y and bitwise_xor.)
The name of the column, position_name_i may be simply an integer i. (Or something more meaningful as pos.loc[i,'Position'] if this column consists of unique values.)
If this is done, we can do the other task. Now, df[df[position_name_i]] will return only the rows of df for which position_name_i is True.
We iterate over all positions (i.e. rows in pos). For each position:
# number of non-nan entries in 'Item 1' and 'Item 2' relevant for the position:
total = df.loc[df['position_name_i'], ['Item 1', 'Item 2']].count().sum()
# number of 1's among these entries:
partial = df.loc[df['position_name_i'], ['Item 1', 'Item 2']].sum().sum()
The final Overall% for the given position is 100*partial/total.

Related

Compare multiple data from rows

I'm looking for a way to compare multiple rows with data to each other, trying to find the best possible match. Each number in every column must be an approximately match the other numbers in the same column.
Example:
Customer #1: 1 5 10 9 7 7 8 2 3
Customer #2: 10 5 9 3 5 7 4 3 2
Customer #3: 1 4 10 9 8 7 6 2 2
Customer #4: 9 5 6 7 2 1 10 5 6
In this example customer #1 and #3 is quite similar, and I need to find a way to highlight or sort the rows so I can easily find the best match.
I've tried using conditional formatting to highlight the numbers that are the similar, but that is quite confusing, because the amount of data is quite big.
Any ideas of how I could solve this?
Thanks!
The following formula entered in (say) L1 and pulled down gives the best match with the current row based on the sum of the absolute differences between corresponding cells:-
=MIN(IF(ROW($C$1:$K$4)<>ROW(),(MMULT(ABS($C1:$K1-$C$1:$K$4),TRANSPOSE(COLUMN($C$1:$K$4))^0))))
It is an array formula and must be entered with CtrlShiftEnter.
You can then sort on column L to bring the customers with lowest similarity scores to the top or use conditional formatting to highlight rows with a certain similarity value.
EDIT
If you wanted to penalise large differences in individual columns more heavily than small differences to try and avoid pairs of customers which are fairly similar except for having some columns very different, you could try something like the square of the differences:-
=MIN(IF(ROW($C$1:$K$4)<>ROW(),(MMULT(($C1:$K1-$C$1:$K$4)^2,TRANSPOSE(COLUMN($C$1:$K$4))^0))))
then the scores for your test data would come out as 7,127,7,127.
I'm assuming you want to compare customers 2-4 with customer 1 and that you are comparing only within each column. In this case, you could implement a 'scoring system' using multiple IFs. For example,:
A B C D E
1 Customer 1 1 1 2
2 Customer 2 1 2 2
3 Customer 3 0 1 0
you could use in E2
=if(B2=$B$1,1,0)+if(C2=$C$1,1,0)+if(D2=$D$1,1,0)
This will return a 'score' of 1 when you have a match and a 'score' of 0 when you don't. It then adds up the scores and your highest value will be your best match. Copying down would then give
A B C D E
1 Customer 1 1 1 2
2 Customer 2 1 2 2 2
3 Customer 3 0 1 0 1
so customer 2 is the best match.

Group people based on their hobbies in Spark

I am working with PySpark on a case where I need to group people based on their interests. Let's say I have n persons:
person1, movies, sports, dramas
person2, sports, trekking, reading, sleeping, movies, dramas
person3, movies, trekking
person4, reading, trekking, sports
person5, movies, sports, dramas
.
.
.
Now I want to group people based on their interests.
Group people who have at least m common interests (m is user input, it could be 2, 3, 4...)
Let's assume m=3
Then the groups are:
(person1, person2, person5)
(person2, person4)
User who belongs to x groups (x is user input)
Let's assume x=2
Then
person2 is in two groups
My response will be algebraic and not Spark/Python specific, but can be implemented in Spark.
How can we express the data in your problem?
I will go with matrix - each row represents a person, each column represents interest. So following your example:
movies,sports,trekking,reading,sleeping,dramas
P1: 1 1 0 0 0 1
P2: 1 1 1 1 1 1
P3: 1 0 1 0 0 0
P4: 0 0 1 1 0 1
P5: 1 1 0 0 0 1
What if we would like to investigate similarity of P2 and P3 - check how many interests do they share - we could use the following formula:
(movies)+(sports)+(trekking)+(reading)+(sleeping)+(dramas)
1*1 + 1*0 + 1*1 + 1*0 + 1*0 + 1*0 = 2
It may look familiar to you - it looks like part of matrix multiplication calculation.
To get full usage of the fact that we observed, we have to transpose the matrix - it will look like that:
P1,P2,P3,P4,P5
movies 1 1 1 0 1
sports 1 1 0 0 1
trekking 0 1 1 1 0
reading 0 1 0 1 0
sleeping 0 1 0 0 0
dramas 1 1 0 1 1
Now if we multiply the matrices (original and transposed) you would get new matrix:
P1 P2 P3 P4 P5
P1 3 3 1 1 3
P2 3 6 2 3 4
P3 1 2 2 1 1
P4 1 3 1 2 1
P5 3 3 1 1 3
What you see here is the result you are looking for - check the value on the row/column crossing and you will get number of shared interests.
How many interests do P2 share with P4? Answer: 3
Who shares 3 interests with P1? Answer: P2 and P5
Who shares 2 interests with P3? Answer: P2
Some hints on how to apply this idea into Apache Spark
How to operate on matrices using Apache Spark?
Matrix Multiplication in Apache Spark
How to transpose matrix using Apache Spark?
Matrix Transpose on RowMatrix in Spark
EDIT 1: Adding more realistic method (after the comments)
We have a table/RDD/Dataset "UserHobby":
movies,sports,trekking,reading,sleeping,dramas
P1: 1 1 0 0 0 1
P2: 1 1 1 1 1 1
P3: 1 0 1 0 0 0
P4: 0 0 1 1 0 1
P5: 1 1 0 0 0 1
Now to find all the people that share 2 groups with P1 you would have to execute:
SELECT * FROM UserHobby
WHERE movies*1 + sports*1 + sports*0 +
trekking*0 + reading*0+ sleeping*0 + dramas*1 = 2
Now you would have to repeat this query for all the users (changing 0s and 1s to the actual values). The algorithm complexity is O(n^2 * m) - n number of users, m number of hobbies
What is nice about this method is that you don't have to generate subsets.
Probably my answer might not be the best, but will do the work. If you know the total list of hobbies in prior, you can write a piece of code which can compute the combinations before going into spark's part.
For example:
Filter out the people whose hobbies_count < input_number at the the very start to scrap out unwanted records.
If the total list of hobbies are {a,b,c,d,e,f} and the input_number is 2, the list of combinations in this case would be
{(ab)(ac)(ad)(ae)(af)(bc)(bd)(be)(bf)(cd)(ce)(cf)(de)(df)(ef)}
So will need to generate the possible combinations in prior for the input_number
Later, perform a filter for each combination and track the record count.
If the number of users is large, you can't possibly think about going for any User x User approach.
Step 0. As a first step, we should ignore all the users who don't have at least m interests (since they cannot have at least m common interests with anyone).
Step 1. Possible approaches:
i)Brute Force: If the maximum number of interests is small, say 10, you can generate all the possible interest combinations in a HashMap, and assign an interest group id to each of these. You will need just one pass over the interest set of a user to find out which interest groups do they qualify for. This will solve the problem in one pass.
ii) Locality Sensitive Hashing: After step 0 is done, we know that we only have users that have a minimum of m hobbies. If you are fine with an approximate answer, locality sensitive hashing can help. How to understand Locality Sensitive Hashing?
A sketch of the LSH approach:
a) First map each of the hobbies to an integer (which you should do anyways, if the dataset is large). So we have User -> Set of integers (hobbies)
b) Obtain a signature for each user, by taking the min hash (Hash each of the integers by a number k, and take the min. This gives User -> Signature
c) Explore the users with the same signature in more detail, if you want (or you can be done with an approximate answer).

Optimal to lookup and display column/row names from a table of binary data

Job Coach ConsumerName Monthly General Goals
Anna Joe 0 0 0
Sam John 0 0 0
Veron Jane 0 0 0
Bill Jack 1 1 1
Anna Jill 1 1 1
Jim 0 0 0
Bill Jiang 1 1 1
Jolly 0 1 1
Sam Jiant 0 0 0
Jap 0 1 1
Joule 1 1 1
Aardvark 0 1 0
Drake Darding 0 0 0
Hello, as you can see above I have two columns of strings; one column is "job coach" the other is "consumer name". There are three columns of 1's and 0's; monthly, general, and goals.
I'm trying to find the specific pattern of 1's and 0's in each of the rows, and to report it. For instance, the data says:
Consumer Jolly still has a a monthly which needs to be completed;
Consumer Aardvark still has a monthly which needs to be completed;
Consumer Aardvark still has a monthly and a goals which needs to be completed.
Lookup doesn't really work, because it only will return the first instance of the corresponding variable and not additionally instances.
I've tried a index function like this:
{=INDEX($C$2:$E$14,SMALL(IF($C$2:$C$14=0,ROW($C$2:$C$14)),ROW(1:1)),3)}
But that only would look up for a single column at a time, which makes the report rather cumbersome. I'm open to doing a loop in excel without formulas, however its not a simple looping formula, because I'm trying to look at each cell and to output the specific column name.
Any thoughts on how to best do it?
It's not exactly clear what the condition you want to check is, but if you want to check for a specific given pattern and return the customer name you can use this adjusted formula:
=INDEX($B$2:$B$14,SMALL(IF($C$2:$C$14&$D$2:$D$14&$E$2:$E$14="010",ROW($C$2:$C$14)-1),ROW(1:1)),1)
In your formula you checked only the first binary column and returned the last. You also had a mistake of returning the row number and not the index in the list which is row-1 in your case.
So notice:
The INDEX returns values from column B.
The IF checks a pattern of C&D&E equals a pattern like 010 which can be changed or set to a reference.
Then return the ROW()-1 in case your list starts at row 2, to return the index in the data and not the actual row.

Matrix with boolean values from a list of paired observations

In the below spreadsheet, the cell values represent an ID for a person. The person in column A likes the person in column B, but it may not be mutual. So, in the first row with data, person 1 likes 2. In the second row with data person 1 likes 3.
A B
1 2
1 3
2 1
2 4
3 4
4 1
I'm looking for a way to have a 4 x 4 matrix with an entry of 1 in (i,j) to indicate person i likes person j and an entry of 0 to indicate they don't. The example above should like this after performing the task:
1 2 3 4
1 0 1 1 0
2 1 0 0 1
3 0 0 0 1
4 1 0 0 0
So, reading the first row of the matrix we would interpret it like this: person 1 does not like person 1 (cell value = 0), person 1 likes person 2 (cell value = 1), person 1 likes person 3 (cell value =1), person 1 does not like person 4 (cell value = 0)
Note that order of pairing matter so [4 2] does not equal [2 4].
How could this be done?
Assuming your existing data is in A1:B6, then in A10 enter:
=COUNTIFS($A$1:$A$6, ROW()-9,$B$1:$B$6, COLUMN())
This will return a 1 or a 0 depending on whether person 1 likes person 1. They don't so you get a 0. It uses Row()-9 to return 1 and COLUMN() to return 1 to find the match.
Copy this formula over 4 columns and down 4 rows and that ROW()-9 and COLUMN() formula will return the appropriate values for the check into the COUNTIFS() formula which will look for the matching pair.
Personally, if this was something I had to do and my matrix was of indeterminate size, I would probably stick these formulas on a second tab, starting at A1 and use ROW() where I don't have to adjust it by 9. But for a one off on the same tab, to help check the results, the above is fine.

Counting unique occurrences

I am trying the count the number of unique sensors (column 1) that are present by visit duration (column 2). Here is a small portion of the data:
Sensor ID Implant duration
13113 1
13113 1
13113 1
13144 1
13144 1
13144 2
13144 2
13144 2
13144 2
13144 2
14018 1
12184 2
13052 1
13052 1
12155 2
12155 3
12155 3
13069 2
13069 2
13018 1
13018 1
13019 1
13019 1
13049 1
13054 3
13060 3
13108 2
13108 2
So the count for:
Visit 1 should be 6 (13113, 14018, 13052, 13018, 13019, 13049),
Visit 2 should be 5 (13144, 12184, 12155, 13069, 13108), and
Visit 3 should be 3 (12155, 13054, 13060).
I tried DCOUNTA but it doesn't return the count for the first occurrence, just the total number of entries with an implant duration of 1, 2, or 3. So for example it returns 13 for Visit 1, 11 for Visit 2, and 4 for Visit 3.
I have a lot of data that needs to be preserved and counted so I don't want to apply a filter or remove duplicates.
I can do it in a step-wise way .. not sure if this helps:
I put your data in columns A and B.
In Column C:
=CONCATENATE(A2,"#",B2)
Column D:
=IF(C2=C1,"",A2)
Column E:
=IF(C2=C1,"",B2)
Column F:
=IF(D2="","",COUNTIF(D:D,D2))
Column G:
=IF(E2="","",COUNTIF(E:E,E2))
At that point, you have the data "flagged" as you need it .. just need to extract it.
If I understand the issue, the results should be:
Visit 1: 7 (13113,13144,14018,13052,13018,13019,13049)
Visit 2: 5 (13144,12184,12155,13069,13108)
Visit 3: 3 (12155,13054,13060)
There would appear to be 7 unique Sensors in your sample data (13113, 13144, 14018, 13052, 13018, 13019, 13049) for Visit ID=1, not 6.
=SUMPRODUCT((B2:B29=1)/(COUNTIFS(B2:B29, 1, A2:A29, A2:A29&"")+(B2:B29<>1)))
=SUMPRODUCT((B2:B29=2)/(COUNTIFS(B2:B29, 2, A2:A29, A2:A29&"")+(B2:B29<>2)))
=SUMPRODUCT((B2:B29=3)/(COUNTIFS(B2:B29, 3, A2:A29, A2:A29&"")+(B2:B29<>3)))
It would probably be best to put the Visit ID into a cell and reference the cell in all three places.
You might use a PivotTable with Sensor ID for ROWS and VALUES (Count of) and Implant Duration for COLUMNS then apply =COUNT() on the columns. Shows which sensor (in order), which duration and the instances of the combinations:

Resources