Find feature or combination of features that has an effect - statistics

I am looking for a statistical model or test to answer following question and would be grateful for some help:
I have m products p1,...,p5 that my customers can subscribe to.
I have divided my customers into groups A1,...,A and for each group and each combination of products, I have counted how many customers have this combination of products, and how it has affected their sales:
Customer_group has_p1 has_p2 [...] has_p5 cust_count total_sales
A1 0 0 0 124 1234
A1 1 0 0 315 999
A1 1 1 0 199 7777
[...]
An 1 1 1 233 663
Now I want to find out which group of customers benefit from which product or combination of products.
My first idea was to use a paired t test for the group of customers that had a product versus the group that does not have a product in the same combination with other products, i.e. for measuring the effect of p1 I would pair {A1, 1, 0, 0, 1, 0} with {A1, 0, 0, 0, 1, 0} and compare the series of the two values of total_sales/cust_count.
However, with this test I just find out which of the products has an effect, not which group it has an effect for or if it is significant that the product is sold in combination with another product.
Any good ideas?

So after thinking a day, I found a way:
First I did a one-hot encoding of the groups, so I replaced the customer_group column with n columns containing 0 and 1's.
Then I made a linear regression model with mixed terms:
product_i * product_j + group_k * product_i + group_k * product_i * product_j
And by reducing the model I found which product x product combinations and which group x product and group x product x product combinations were significant

Related

How to find correlation between two categorical variable num_chicken_pox and how many time vaccine given

The problem is how to find out the correlation between two categorical [series] items?
the situation is like that i have to find out the correlation between HAVING_CPOX and NUM_VECILLA_veccine
Given among children
the main catch is that in HAVING CPOX COLUMNS have 4 unique value
1-Having cpox
2-not having cpox
99- may be NULL
7 i don't know
in df['P_NUMVRC'] : unique value is [1, 2, 3, 0, Nan,]
two different distinct series SO how do find put them together and find the correlation
I use value_counts for having frequency of each?
1 13781
2 213
3 1
Name: P_NUMVRC, dtype: int64
For having_cpox columns
2 27955
1 402
77 105
99 3
Name: HAD_CPOX, dtype: int64
the requirement is like this
A positive correlation (e.g., corr > 0) means that an increase in had _ch
ickenpox_column (which means more no’s) would also increase the values of
um_chickenpox_vaccine_column (which means more doses of vaccine). If there
is a negative correlation (e.g., corr < 0), it indicates that having had
chickenpox is related to an increase in the number of vaccine doses.
I think what you are looking for is using np.corrcoef. It receives two (in your case - 1 dimensional) arrays, and returns the Pearson Correlation (for more details see: https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html).
So basically:
valid_df = df.query('HAVING_CPOX < 3')
valid_df['HAVING_CPOX'].apply(lambda x: x == 1, inplace=True)
corr = np.corrcoef(valid_df['HAVING_CPOX'], valid_df['P_NUMVRC'])
What I did is first get rid of the 99's and 7's since you can't really rely on those. Then I changed the HAVING_CPOX to be binary (0 is "has no cpox" and 1 is "has cpox"), so that the correlation makes sense. Then I used corrcoef from numpy's implementation.

How to get weighted sum depending on two conditions in Excel?

I have this table in Excel:
I am trying to get weighted sum depending on two conditions:
Whether it is Company 1 or Company 2 (shares quantity differ)
Whether column A (Company 1) and column B (Company 2) has 0 or 1 (multipliers differ)
Example:
Lets calculate weighted sum for row 2:
Sum = 2 (multiplier 1) * 50 (1 share price) * 3 (shares quantity for Company 1) +
+0.5 (multiplier 0) * 50 (1 share price) * 6 (shares quantity for Company 2) = 450
So, Sum for Row 2 = 450.
For now I am checking only for multipliers (1 or 0) using this code:
=COUNTIF(A2:B2,0)*$B$9*$B$8 + COUNTIF(A2:B2,1)*$B$9*$B$7
But it does not take into account the shares quantities for Company 1 or Company 2. I only multiply 1 share price with multipliers, but not with shares quantity).
How can I also check whether it is Company 1 or Company 2 in order to multiply by corresponding Shares quantity?
Upd:
Rasmus0607 gave a solution when there are only two companies:
=$B$9*$E$8*IF(A2=1;$B$7;$B$8)+$B$9*$E$9*IF(B2=1;$B$7;$B$8)
Tom Sharpe gave a more general solution (number of companies can be greater than 2)
I uploaded my Excel file to DropBox:
Excel file
I can offer a more general way of doing it with the benefit of hindsight that you can apply to more than two columns by altering the second CHOOSE statement:-
=SUM(CHOOSE(2-A2:B2,$B$7,$B$8)*CHOOSE(COLUMN(A:B),$E$8,$E$9))*$B$9
Unfortunately it's an array formula that you have to enter with CtrlShiftEnter. But it's a moot point whether or not it would be better just to use one of the other answers with some repetition and keep it simple.
You could also try this:-
=SUMPRODUCT(N(OFFSET($B$6,2-A2:B2,0)),N(OFFSET($E$7,COLUMN(A:B),0)))*$B$9
Here's how it would be for three companies
=SUM(CHOOSE(2-A2:C2,$B$7,$B$8)*CHOOSE(COLUMN(A:C),$F$8,$F$9,$F$10))*$B$9
(array formula) or
=SUMPRODUCT(N(OFFSET($B$6,2-A2:C2,0)),N(OFFSET($F$7,COLUMN(A:C),0)))*$B$9
=$B$9*$E$8*IF(A2=1;$B$7;$B$8)+$B$9*$E$9*IF(B2=1;$B$7;$B$8)
Since in the COUNTIF function, you don't know beforehand, which company column contains a 0, or a 1, I would suggest a longer, but more systematic solution using IF:
=$B$9*$E$8*IF(A2=1;2;0,5)+$B$9*$E$9*IF(B2=1;2;0,5)
This is a bit less general, but should produce the result you expect in this case.

Group people based on their hobbies in Spark

I am working with PySpark on a case where I need to group people based on their interests. Let's say I have n persons:
person1, movies, sports, dramas
person2, sports, trekking, reading, sleeping, movies, dramas
person3, movies, trekking
person4, reading, trekking, sports
person5, movies, sports, dramas
.
.
.
Now I want to group people based on their interests.
Group people who have at least m common interests (m is user input, it could be 2, 3, 4...)
Let's assume m=3
Then the groups are:
(person1, person2, person5)
(person2, person4)
User who belongs to x groups (x is user input)
Let's assume x=2
Then
person2 is in two groups
My response will be algebraic and not Spark/Python specific, but can be implemented in Spark.
How can we express the data in your problem?
I will go with matrix - each row represents a person, each column represents interest. So following your example:
movies,sports,trekking,reading,sleeping,dramas
P1: 1 1 0 0 0 1
P2: 1 1 1 1 1 1
P3: 1 0 1 0 0 0
P4: 0 0 1 1 0 1
P5: 1 1 0 0 0 1
What if we would like to investigate similarity of P2 and P3 - check how many interests do they share - we could use the following formula:
(movies)+(sports)+(trekking)+(reading)+(sleeping)+(dramas)
1*1 + 1*0 + 1*1 + 1*0 + 1*0 + 1*0 = 2
It may look familiar to you - it looks like part of matrix multiplication calculation.
To get full usage of the fact that we observed, we have to transpose the matrix - it will look like that:
P1,P2,P3,P4,P5
movies 1 1 1 0 1
sports 1 1 0 0 1
trekking 0 1 1 1 0
reading 0 1 0 1 0
sleeping 0 1 0 0 0
dramas 1 1 0 1 1
Now if we multiply the matrices (original and transposed) you would get new matrix:
P1 P2 P3 P4 P5
P1 3 3 1 1 3
P2 3 6 2 3 4
P3 1 2 2 1 1
P4 1 3 1 2 1
P5 3 3 1 1 3
What you see here is the result you are looking for - check the value on the row/column crossing and you will get number of shared interests.
How many interests do P2 share with P4? Answer: 3
Who shares 3 interests with P1? Answer: P2 and P5
Who shares 2 interests with P3? Answer: P2
Some hints on how to apply this idea into Apache Spark
How to operate on matrices using Apache Spark?
Matrix Multiplication in Apache Spark
How to transpose matrix using Apache Spark?
Matrix Transpose on RowMatrix in Spark
EDIT 1: Adding more realistic method (after the comments)
We have a table/RDD/Dataset "UserHobby":
movies,sports,trekking,reading,sleeping,dramas
P1: 1 1 0 0 0 1
P2: 1 1 1 1 1 1
P3: 1 0 1 0 0 0
P4: 0 0 1 1 0 1
P5: 1 1 0 0 0 1
Now to find all the people that share 2 groups with P1 you would have to execute:
SELECT * FROM UserHobby
WHERE movies*1 + sports*1 + sports*0 +
trekking*0 + reading*0+ sleeping*0 + dramas*1 = 2
Now you would have to repeat this query for all the users (changing 0s and 1s to the actual values). The algorithm complexity is O(n^2 * m) - n number of users, m number of hobbies
What is nice about this method is that you don't have to generate subsets.
Probably my answer might not be the best, but will do the work. If you know the total list of hobbies in prior, you can write a piece of code which can compute the combinations before going into spark's part.
For example:
Filter out the people whose hobbies_count < input_number at the the very start to scrap out unwanted records.
If the total list of hobbies are {a,b,c,d,e,f} and the input_number is 2, the list of combinations in this case would be
{(ab)(ac)(ad)(ae)(af)(bc)(bd)(be)(bf)(cd)(ce)(cf)(de)(df)(ef)}
So will need to generate the possible combinations in prior for the input_number
Later, perform a filter for each combination and track the record count.
If the number of users is large, you can't possibly think about going for any User x User approach.
Step 0. As a first step, we should ignore all the users who don't have at least m interests (since they cannot have at least m common interests with anyone).
Step 1. Possible approaches:
i)Brute Force: If the maximum number of interests is small, say 10, you can generate all the possible interest combinations in a HashMap, and assign an interest group id to each of these. You will need just one pass over the interest set of a user to find out which interest groups do they qualify for. This will solve the problem in one pass.
ii) Locality Sensitive Hashing: After step 0 is done, we know that we only have users that have a minimum of m hobbies. If you are fine with an approximate answer, locality sensitive hashing can help. How to understand Locality Sensitive Hashing?
A sketch of the LSH approach:
a) First map each of the hobbies to an integer (which you should do anyways, if the dataset is large). So we have User -> Set of integers (hobbies)
b) Obtain a signature for each user, by taking the min hash (Hash each of the integers by a number k, and take the min. This gives User -> Signature
c) Explore the users with the same signature in more detail, if you want (or you can be done with an approximate answer).

Pandas Link Rows to Rows Based on Multiple Criteria

This is the ultimate Pandas challenge, in my mind, though it may be elementary to some of you out there...
I am trying to link a particular job position with the survey items to which it corresponds. For example, the president of site A would be attributed to results from a survey item for which respondents from site A provide feedback (i.e "To what degree do you agree with the following statement?: "I think the quality of site A is sufficient overall"").
Each site has 5 stations (0 through 4). Each job position is assigned to one or more station(s) at one or more site(s).
For example, the president of a site works at all stations within that site, while a contractor might only work at a couple of stations, maybe at 2 different sites.
Survey data was collected on the quality of each station within each site.
Some survey items pertain to certain stations within one or more sites.
For example, the "Positions" table looks like this:
import pandas as pd
import numpy as np
pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'],
'Position':['Contractor','President'],
'Site(s)':['A,B','A'],
'Item(s)':['1','1,2']
})
pos[['Position','Site(s)','Station(s)','Item(s)']]
Position Site(s) Station(s) Item(s)
0 Contractor A,B ,1,2,, 1
1 President A 0,1,2,3,4 1,2
The survey data table looks like this:
sd = pd.DataFrame({'Site(s)':['A','B','B','C','A','A'],
'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
'Item 1':[1,1,0,0,1,np.nan],
'Item 2':[1,0,0,1,0,1]})
sd[['Site','Station(s)','Item 1','Item 2']]
Site Station(s) Item 1 Item 2
0 A ,1,2,, 1 1
1 B ,1,2,, 1 0
2 B ,,,, 0 0
3 C ,1,2,, 0 1
4 A 0,1,2,, 1 0
5 A ,,2,, NaN 1
2 side notes:
The item data has been coded to 1 and 0 for unimportant reasons.
The comma separated responses are actually condensed from columns (one column per station and item). I only mention that because if it is better to not condense them, that can be done (or not).
So here's what I need:
This (I think):
Contractor President Site(s) Station(s) Item 1 Item 2
0 1 1 A ,1,2,, 1 1
1 1 0 B ,1,2,, 1 0
2 0 0 B ,,,, 0 0
3 0 0 C ,1,2,, 0 1
4 0 1 A 0,1,2,, 1 0
5 1 1 A ,,2,, NaN 1
Logic:
The contractor works at site A and B and should only be associated with respondents who work either of those sites.
Within those respondents, he should only associated with those who work at stations 1 or 2 but none that also work at
any others (i.e. station 0).
Therefore, the contractor's rows of interest in df2 are indices 0, 1, and 5.
The president's rows of interest are from indices 0, 4, and 5.
...and, ultimately, this:
Position Overall%
0 Contractor 100
1 President 80
Logic:
Because the president is concerned with items 1 and 2, there are 5 numbers to consider: (1 and 1) from item 1 and (1, 0, and 1) from item 2.
The sum across items is 4 and the count across items is 5 (again, do not count 'NaN'), which gives 80%.
Because the contractor is only concerned with item 1, there are 2 numbers to consider: 1 and 1 - 'NaN' should not be counted - (from the rows of interest, respectively). Therefore, the sum is 2 out of the count, which is 2, which gives 100%
Thanks in advance!
Update
I know this works (top answer just under question), but how can that be applied to this situation? I tried this (just to try the first part of the logic):
for i in pos['Position']:
sd[i]=[x for x in pos.loc['Site(s)'] if x in sd['Site']]
...but it threw this error:
KeyError: 'the label [Site(s)] is not in the [index]'
...so I'm still wrestling with it.
If I understand correctly, you want to add one column to sd for each job position in pos. (This is the first task.)
So, for each row index i in pos (we iterate over rows in pos), we can create a unique boolean column:
# PSEUDOCODE:
sd[position_name_i] = (sd['Site'] IS_CONTAINED_IN pos.loc[i,'Site(s)']) and (sd['Station(s)'] IS_CONTAINED_IN pos.loc[i,'Station(s)'])
I hope the logic here is clear and consistent with your goal.
The expression X IS_CONTAINED_IN Y may be implemented in many different ways. (I can think of X and Y being sets and then it's X.subset(Y). Or in terms of bitmasks X, Y and bitwise_xor.)
The name of the column, position_name_i may be simply an integer i. (Or something more meaningful as pos.loc[i,'Position'] if this column consists of unique values.)
If this is done, we can do the other task. Now, df[df[position_name_i]] will return only the rows of df for which position_name_i is True.
We iterate over all positions (i.e. rows in pos). For each position:
# number of non-nan entries in 'Item 1' and 'Item 2' relevant for the position:
total = df.loc[df['position_name_i'], ['Item 1', 'Item 2']].count().sum()
# number of 1's among these entries:
partial = df.loc[df['position_name_i'], ['Item 1', 'Item 2']].sum().sum()
The final Overall% for the given position is 100*partial/total.

MS Excel - obtaining and presenting results by combining logical operators, sums, pivot tables, and forms

Before I present the question, I would like to state that I have spent time researching for a solution for this problem, with partial luck.
So, here's the scenario:
There are X number of users, using Y number of products. Each user may use one or more products. Some of the users work from home and others from the office.
e.g.
ID Office Home Dept P1 P2 P3
1 loc1 1 dep1 1 1 0
2 loc1 0 dep1 0 0 1
3 loc2 1 dep1 1 1 1
4 loc4 1 dep2 1 0 1
5 loc3 0 dep2 1 1 0
6 loc3 0 dep1 1 1 0
7 loc1 0 dep3 1 0 0
What I want to achieve:
I would like to calculate and present, for various combinations of products, the number of users in common and the total number of users. I know that I can use the logical functions 'or' and 'and' to determine 'union' and 'intersection', and calculate a column sum to get the total number of users and number of users in common. However, I would like to be able to select various combinations of products without creating a new formula each time; something with check-boxes.
What I want to present:
As for the presentation of the data, would I be able to create something useful via a pivot table, or should I look at custom forms, which include check-boxes for P1, P2, P3, ...? Any hints and suggestions would be much appreciated.
Thank you,
amx13

Resources