Group people based on their hobbies in Spark - apache-spark

I am working with PySpark on a case where I need to group people based on their interests. Let's say I have n persons:
person1, movies, sports, dramas
person2, sports, trekking, reading, sleeping, movies, dramas
person3, movies, trekking
person4, reading, trekking, sports
person5, movies, sports, dramas
.
.
.
Now I want to group people based on their interests.
Group people who have at least m common interests (m is user input, it could be 2, 3, 4...)
Let's assume m=3
Then the groups are:
(person1, person2, person5)
(person2, person4)
User who belongs to x groups (x is user input)
Let's assume x=2
Then
person2 is in two groups

My response will be algebraic and not Spark/Python specific, but can be implemented in Spark.
How can we express the data in your problem?
I will go with matrix - each row represents a person, each column represents interest. So following your example:
movies,sports,trekking,reading,sleeping,dramas
P1: 1 1 0 0 0 1
P2: 1 1 1 1 1 1
P3: 1 0 1 0 0 0
P4: 0 0 1 1 0 1
P5: 1 1 0 0 0 1
What if we would like to investigate similarity of P2 and P3 - check how many interests do they share - we could use the following formula:
(movies)+(sports)+(trekking)+(reading)+(sleeping)+(dramas)
1*1 + 1*0 + 1*1 + 1*0 + 1*0 + 1*0 = 2
It may look familiar to you - it looks like part of matrix multiplication calculation.
To get full usage of the fact that we observed, we have to transpose the matrix - it will look like that:
P1,P2,P3,P4,P5
movies 1 1 1 0 1
sports 1 1 0 0 1
trekking 0 1 1 1 0
reading 0 1 0 1 0
sleeping 0 1 0 0 0
dramas 1 1 0 1 1
Now if we multiply the matrices (original and transposed) you would get new matrix:
P1 P2 P3 P4 P5
P1 3 3 1 1 3
P2 3 6 2 3 4
P3 1 2 2 1 1
P4 1 3 1 2 1
P5 3 3 1 1 3
What you see here is the result you are looking for - check the value on the row/column crossing and you will get number of shared interests.
How many interests do P2 share with P4? Answer: 3
Who shares 3 interests with P1? Answer: P2 and P5
Who shares 2 interests with P3? Answer: P2
Some hints on how to apply this idea into Apache Spark
How to operate on matrices using Apache Spark?
Matrix Multiplication in Apache Spark
How to transpose matrix using Apache Spark?
Matrix Transpose on RowMatrix in Spark
EDIT 1: Adding more realistic method (after the comments)
We have a table/RDD/Dataset "UserHobby":
movies,sports,trekking,reading,sleeping,dramas
P1: 1 1 0 0 0 1
P2: 1 1 1 1 1 1
P3: 1 0 1 0 0 0
P4: 0 0 1 1 0 1
P5: 1 1 0 0 0 1
Now to find all the people that share 2 groups with P1 you would have to execute:
SELECT * FROM UserHobby
WHERE movies*1 + sports*1 + sports*0 +
trekking*0 + reading*0+ sleeping*0 + dramas*1 = 2
Now you would have to repeat this query for all the users (changing 0s and 1s to the actual values). The algorithm complexity is O(n^2 * m) - n number of users, m number of hobbies
What is nice about this method is that you don't have to generate subsets.

Probably my answer might not be the best, but will do the work. If you know the total list of hobbies in prior, you can write a piece of code which can compute the combinations before going into spark's part.
For example:
Filter out the people whose hobbies_count < input_number at the the very start to scrap out unwanted records.
If the total list of hobbies are {a,b,c,d,e,f} and the input_number is 2, the list of combinations in this case would be
{(ab)(ac)(ad)(ae)(af)(bc)(bd)(be)(bf)(cd)(ce)(cf)(de)(df)(ef)}
So will need to generate the possible combinations in prior for the input_number
Later, perform a filter for each combination and track the record count.

If the number of users is large, you can't possibly think about going for any User x User approach.
Step 0. As a first step, we should ignore all the users who don't have at least m interests (since they cannot have at least m common interests with anyone).
Step 1. Possible approaches:
i)Brute Force: If the maximum number of interests is small, say 10, you can generate all the possible interest combinations in a HashMap, and assign an interest group id to each of these. You will need just one pass over the interest set of a user to find out which interest groups do they qualify for. This will solve the problem in one pass.
ii) Locality Sensitive Hashing: After step 0 is done, we know that we only have users that have a minimum of m hobbies. If you are fine with an approximate answer, locality sensitive hashing can help. How to understand Locality Sensitive Hashing?
A sketch of the LSH approach:
a) First map each of the hobbies to an integer (which you should do anyways, if the dataset is large). So we have User -> Set of integers (hobbies)
b) Obtain a signature for each user, by taking the min hash (Hash each of the integers by a number k, and take the min. This gives User -> Signature
c) Explore the users with the same signature in more detail, if you want (or you can be done with an approximate answer).

Related

What if my dataset contains only 0 and 1? Can I check for correlation for them and get the significant results in Excel?

My data set looks like this:
P T O
1 1 0
1 0 1
1 1 1
0 1 0
1 1 0
My doubt is that we only have two values i.e. zero and one. That would logically mean that correlation can not compute the level of significance. My assumption is in this case it would be than calculating the coefficient based on occurrence of 4 combination i.e. {(1,1),(1,0),(0,1),(0,0)} rather than calculating in the magnitude of change in variables. But conversely if coefficient works only on magnitude of change, than is this the right method for my data set?
Could anyone tell me if I am on right track of thoughts or calculating such coefficient yields no significance?

Excel How do I fill a matrix by MAXIFS comparison

I have this dataset:
Groups A A B B
location a b c d
3 4 0 5
I also have a transformed version for better clarification:
Groups location
A a 3
A b 4
B c 0
B d 5
What I want is a simple matrix that fills binary.
The function should check each row and column wether it has the MAXIF from it's respective group and then compare it to the second value, which is a second MAXIF from it's group. Therefor the combination b and d has to resolve to 1.
The intended output is as following:
I have a dataset with a-n locations that are grouped in a-n groups. So group A has the locations a,b; group B the locations c,d. The columns represent different features at each location.
I want to build a matrix out of it, but not the "usual" distance matrix but one that incorporates the following questions:
-When building the matrix, the maximum values of each group get compared
-I want to find out, if the value I am looking at is the maximum value in this group and if so, compare it to the second groups maximum -> if this number is larger -> set it to 1
-this should automatically fill all fields in the matrix
I need this for a network analysis of my data, to wipe out not needed connections
My current input is somewhat like this:
=IF(AND(>0(MAXIF()=value)>(AND(>0(MAXIF()=value);1;0)
How it looks like in excel:
=IF(AND(A$1<>$A7; A$3>0;(MAXIFS($A$3:$D$3;$A$1:$D$1;A$1)=A$3))<(AND(A$1<>$A7; $C7>0;MAXIFS($C$7:$C$10;$A$7:$A$10;$A7)=$C7));1;0)
However I think internally it does not actually compare values but TRUES and FALSE. Therefore connections that are smaller than MAX are getting 1s. My output currently:
A A B B
a b c d
A a 0 0 0 0
A b 0 0 1 0
B c 0 0 0 0
B d 1 0 0 0
As you can see, the value a and d resolve to 1.
The output should look like this:
(the matrix is generally speaking 0, but when beacons like d (5) and b (4) meet, it gets "1" since both are the highest within their group. Only here's a connection between the two groups.
A A B B
a b c d
A a 0 0 0 0
A b 1 0 0 1
B c 0 0 0 0
B d 0 1 0 0
I understand the problem but don't know how to fix that.
I'm fairly sure this doesn't work properly, but it may help. I've restructured your data slightly to make it easier to write the formula.
The formula in C3 etc is
=IF(AND(MAXIFS($G$3:$G$6,$A$3:$A$6,$A3)=$G3,MAXIFS($C$7:$F$7,$C$1:$F$1,C$1)=C$7,$G3>MAXIFS($G$3:$G$6,$A$3:$A$6,"<>" & $A3)),1,0)
It's just checking if the value in G is the max for the group in column A, and if the value in row 7 is the max for group in row 1, and if the value in G is greater than the max of the other group. If they're all satisfied it inserts a '1'.

Pandas Link Rows to Rows Based on Multiple Criteria

This is the ultimate Pandas challenge, in my mind, though it may be elementary to some of you out there...
I am trying to link a particular job position with the survey items to which it corresponds. For example, the president of site A would be attributed to results from a survey item for which respondents from site A provide feedback (i.e "To what degree do you agree with the following statement?: "I think the quality of site A is sufficient overall"").
Each site has 5 stations (0 through 4). Each job position is assigned to one or more station(s) at one or more site(s).
For example, the president of a site works at all stations within that site, while a contractor might only work at a couple of stations, maybe at 2 different sites.
Survey data was collected on the quality of each station within each site.
Some survey items pertain to certain stations within one or more sites.
For example, the "Positions" table looks like this:
import pandas as pd
import numpy as np
pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'],
'Position':['Contractor','President'],
'Site(s)':['A,B','A'],
'Item(s)':['1','1,2']
})
pos[['Position','Site(s)','Station(s)','Item(s)']]
Position Site(s) Station(s) Item(s)
0 Contractor A,B ,1,2,, 1
1 President A 0,1,2,3,4 1,2
The survey data table looks like this:
sd = pd.DataFrame({'Site(s)':['A','B','B','C','A','A'],
'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
'Item 1':[1,1,0,0,1,np.nan],
'Item 2':[1,0,0,1,0,1]})
sd[['Site','Station(s)','Item 1','Item 2']]
Site Station(s) Item 1 Item 2
0 A ,1,2,, 1 1
1 B ,1,2,, 1 0
2 B ,,,, 0 0
3 C ,1,2,, 0 1
4 A 0,1,2,, 1 0
5 A ,,2,, NaN 1
2 side notes:
The item data has been coded to 1 and 0 for unimportant reasons.
The comma separated responses are actually condensed from columns (one column per station and item). I only mention that because if it is better to not condense them, that can be done (or not).
So here's what I need:
This (I think):
Contractor President Site(s) Station(s) Item 1 Item 2
0 1 1 A ,1,2,, 1 1
1 1 0 B ,1,2,, 1 0
2 0 0 B ,,,, 0 0
3 0 0 C ,1,2,, 0 1
4 0 1 A 0,1,2,, 1 0
5 1 1 A ,,2,, NaN 1
Logic:
The contractor works at site A and B and should only be associated with respondents who work either of those sites.
Within those respondents, he should only associated with those who work at stations 1 or 2 but none that also work at
any others (i.e. station 0).
Therefore, the contractor's rows of interest in df2 are indices 0, 1, and 5.
The president's rows of interest are from indices 0, 4, and 5.
...and, ultimately, this:
Position Overall%
0 Contractor 100
1 President 80
Logic:
Because the president is concerned with items 1 and 2, there are 5 numbers to consider: (1 and 1) from item 1 and (1, 0, and 1) from item 2.
The sum across items is 4 and the count across items is 5 (again, do not count 'NaN'), which gives 80%.
Because the contractor is only concerned with item 1, there are 2 numbers to consider: 1 and 1 - 'NaN' should not be counted - (from the rows of interest, respectively). Therefore, the sum is 2 out of the count, which is 2, which gives 100%
Thanks in advance!
Update
I know this works (top answer just under question), but how can that be applied to this situation? I tried this (just to try the first part of the logic):
for i in pos['Position']:
sd[i]=[x for x in pos.loc['Site(s)'] if x in sd['Site']]
...but it threw this error:
KeyError: 'the label [Site(s)] is not in the [index]'
...so I'm still wrestling with it.
If I understand correctly, you want to add one column to sd for each job position in pos. (This is the first task.)
So, for each row index i in pos (we iterate over rows in pos), we can create a unique boolean column:
# PSEUDOCODE:
sd[position_name_i] = (sd['Site'] IS_CONTAINED_IN pos.loc[i,'Site(s)']) and (sd['Station(s)'] IS_CONTAINED_IN pos.loc[i,'Station(s)'])
I hope the logic here is clear and consistent with your goal.
The expression X IS_CONTAINED_IN Y may be implemented in many different ways. (I can think of X and Y being sets and then it's X.subset(Y). Or in terms of bitmasks X, Y and bitwise_xor.)
The name of the column, position_name_i may be simply an integer i. (Or something more meaningful as pos.loc[i,'Position'] if this column consists of unique values.)
If this is done, we can do the other task. Now, df[df[position_name_i]] will return only the rows of df for which position_name_i is True.
We iterate over all positions (i.e. rows in pos). For each position:
# number of non-nan entries in 'Item 1' and 'Item 2' relevant for the position:
total = df.loc[df['position_name_i'], ['Item 1', 'Item 2']].count().sum()
# number of 1's among these entries:
partial = df.loc[df['position_name_i'], ['Item 1', 'Item 2']].sum().sum()
The final Overall% for the given position is 100*partial/total.

How to compute the maximum series of a specific condition returning true

i have a slight issue to count the MAX frequency of where the third colmn is bigger than the second. This is just a statistic with scores.
The issue is that i want to have it in one single formula without a macro.
B C
------
2 0
1 2
2 1
2 3
0 1
1 2
0 1
3 3
0 2
0 2
i have tried it with:
{=MAX(FREQUENCY(B3:B100;B3:B100>=C3:C100))} to get 1 for B
{=MAX(FREQUENCY(C3:C100;C3:C100>=B3:B100))} to get 7 for C
I excpected it to deliver me the longest series where the value in the one column was bigger than in the other one, but i failed hard...
Try this version to get 7
=MAX(FREQUENCY(IF(C3:C100>=B3:B100,IF(B3:B100<>"",ROW(B3:B100))),IF(C3:C100<B3:B100,ROW(B3:B100))))
confirmed with CTRL+SHIFT+ENTER
obviously reverse the ranges to get your other result
See example here

MS Excel - obtaining and presenting results by combining logical operators, sums, pivot tables, and forms

Before I present the question, I would like to state that I have spent time researching for a solution for this problem, with partial luck.
So, here's the scenario:
There are X number of users, using Y number of products. Each user may use one or more products. Some of the users work from home and others from the office.
e.g.
ID Office Home Dept P1 P2 P3
1 loc1 1 dep1 1 1 0
2 loc1 0 dep1 0 0 1
3 loc2 1 dep1 1 1 1
4 loc4 1 dep2 1 0 1
5 loc3 0 dep2 1 1 0
6 loc3 0 dep1 1 1 0
7 loc1 0 dep3 1 0 0
What I want to achieve:
I would like to calculate and present, for various combinations of products, the number of users in common and the total number of users. I know that I can use the logical functions 'or' and 'and' to determine 'union' and 'intersection', and calculate a column sum to get the total number of users and number of users in common. However, I would like to be able to select various combinations of products without creating a new formula each time; something with check-boxes.
What I want to present:
As for the presentation of the data, would I be able to create something useful via a pivot table, or should I look at custom forms, which include check-boxes for P1, P2, P3, ...? Any hints and suggestions would be much appreciated.
Thank you,
amx13

Resources