How to find correlation between two categorical variable num_chicken_pox and how many time vaccine given - python-3.x

The problem is how to find out the correlation between two categorical [series] items?
the situation is like that i have to find out the correlation between HAVING_CPOX and NUM_VECILLA_veccine
Given among children
the main catch is that in HAVING CPOX COLUMNS have 4 unique value
1-Having cpox
2-not having cpox
99- may be NULL
7 i don't know
in df['P_NUMVRC'] : unique value is [1, 2, 3, 0, Nan,]
two different distinct series SO how do find put them together and find the correlation
I use value_counts for having frequency of each?
1 13781
2 213
3 1
Name: P_NUMVRC, dtype: int64
For having_cpox columns
2 27955
1 402
77 105
99 3
Name: HAD_CPOX, dtype: int64
the requirement is like this
A positive correlation (e.g., corr > 0) means that an increase in had _ch
ickenpox_column (which means more no’s) would also increase the values of
um_chickenpox_vaccine_column (which means more doses of vaccine). If there
is a negative correlation (e.g., corr < 0), it indicates that having had
chickenpox is related to an increase in the number of vaccine doses.

I think what you are looking for is using np.corrcoef. It receives two (in your case - 1 dimensional) arrays, and returns the Pearson Correlation (for more details see: https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html).
So basically:
valid_df = df.query('HAVING_CPOX < 3')
valid_df['HAVING_CPOX'].apply(lambda x: x == 1, inplace=True)
corr = np.corrcoef(valid_df['HAVING_CPOX'], valid_df['P_NUMVRC'])
What I did is first get rid of the 99's and 7's since you can't really rely on those. Then I changed the HAVING_CPOX to be binary (0 is "has no cpox" and 1 is "has cpox"), so that the correlation makes sense. Then I used corrcoef from numpy's implementation.

Related

Sequentially comparing groupby values conditionally

Given a dataframe
data = [['Bob','25'],['Alice','46'],['Alice','47'],['Charlie','19'],
['Charlie','19'],['Charlie','19'],['Doug','23'],['Doug','35'],['Doug','35.5']]
df = pd.DataFrame(data, columns = ['Customer','Sequence'])
Calculate the following:
First Sequence in each group is assigned a GroupID of 1.
Compare first Sequence to subsequent Sequence values in each group.
If difference is greater than .5, increment GroupID.
If GroupID was incremented, instead of comparing subsequent values to the first, use the current Sequence.
In the desired results table below...
Bob only has 1 record so the GroupID is 1.
Alice has 2 records and the difference between the two Sequence values (46 & 47) is greater than .5 so the GroupID is incremented.
Charlie's Sequence values are all the same, so all records get GroupID 1.
For Doug, the difference between the first two Sequence values (23 & 35) is greater than .5, so the GroupID for the second Sequence becomes 2. Now, since the GroupID was incremented, I want to compare the next value of 35.5 to 35, not 23, which means the last two rows share the same GroupID.
Desired results:
CustomerID
Sequence
GroupID
Bob
25
1
Alice
46
1
Alice
47
2
Charlie
19
1
Charlie
19
1
Charlie
19
1
Doug
23
1
Doug
35
2
Doug
35.5
2
My implementation:
# generate unique ID based on each customers Sequence
df['EventID'] = df.groupby('Customer')[
'Sequence'].transform(lambda x: pd.factorize(x)[0]) + 1
# impute first Sequence for each customer for comparison
df['FirstSeq'] = np.where(
df['EventID'] == 1, df['Sequence'], np.nan
)
# groupby and fill first Sequence forward
df['FirstSeq'] = df.groupby('Customer')[
'FirstSeq'].transform(lambda v: v.ffill())
# get difference of first Sequence and all others
df['FirstSeqDiff'] = abs(df['FirstSeq'] - df['Sequence'])
# create unique GroupID based on Sequence difference from first Sequence
df["GroupID"] = np.cumsum(df.FirstSeqDiff > 0.5) + 1
The above works for cases like Bob, Alice and Charlie but not Doug because it is always comparing to the first Sequence. How can I modify the code to change the compared Sequence value if the GroupID is incremented?
EDIT:
The dataframe will always be sorted by Customer and Sequence. I guess a better way to explain my goal is to assign a unique ID to all Sequence values whose difference are .5 or less, grouping by Customer.
The code has errors -> add df = df.astype({'Customer':str,'Sequence':np.float64}) would fix it. But still you cannot get what you want with this design. Try to define your own lambda function myfunc, which solves your problem directly:
data = [['Bob','25'],['Alice','46'],['Alice','47'],['Charlie','19'],
['Charlie','19'],['Charlie','19'],['Doug','23'],['Doug','35'],['Doug','35.5']]
df = pd.DataFrame(data, columns = ['Customer','Sequence'])
df = df.astype({'Customer':str,'Sequence':np.float64})
def myfunc(series):
ret = []
series = series.sort_values().values
for i,val in enumerate(series):
if i==0:
ret.append(1)
else:
ret.append(ret[-1]+(series[i]-series[i-1]>0.5))
return ret
df['EventID'] = df.groupby('Customer')[
'Sequence'].transform(lambda x: myfunc(x))
print (df)
Happy coding my friend.

How to find columns/Features for which there are at least X percentage of rows with Identical Values? [Python]

Let us say I have an extremely large dataset with 'N' rows and 'M' features. I also have two inputs.
'm' : defines the number of features to check(m
'support' = Identical Rows/ Total rows for the 'm' subset of features. This is basically the minimum percentage of identical rows considering an 'm' number of features
I need to return the groups of features for which the 'support' value is greater than a predefined value.
For Example, let us take this dataset:
d = {
'A': [100, 200, 200, 400,400], 'B': [1,2,2,4,5],
'C':['2018-11-19','2018-11-19','2018-12-19','2018-11-19','2018-11-19']
}
df = pd.DataFrame(data=d)
A B C
0 100 1 2018-11-19
1 200 2 2018-11-19
2 200 2 2018-12-19
3 400 4 2018-11-19
4 400 5 2018-11-19
dataset
In the above example if let us say that
'm' = 2
'support' = 0.4
Then the function should return both ['A','B] and ['A','C'] as both these features when considered together have at least 2 identical rows out of a total of 5 rows (>= 0.4).
I realize that a naive solution would be to to compare all combinations of 'm' features out of 'M' and check the percentage of identical rows. However this will get incredibly complex after the magnitude of features crosses double digits, especially with thousands of rows. What would be an optimized code to tackling this problem?

Find feature or combination of features that has an effect

I am looking for a statistical model or test to answer following question and would be grateful for some help:
I have m products p1,...,p5 that my customers can subscribe to.
I have divided my customers into groups A1,...,A and for each group and each combination of products, I have counted how many customers have this combination of products, and how it has affected their sales:
Customer_group has_p1 has_p2 [...] has_p5 cust_count total_sales
A1 0 0 0 124 1234
A1 1 0 0 315 999
A1 1 1 0 199 7777
[...]
An 1 1 1 233 663
Now I want to find out which group of customers benefit from which product or combination of products.
My first idea was to use a paired t test for the group of customers that had a product versus the group that does not have a product in the same combination with other products, i.e. for measuring the effect of p1 I would pair {A1, 1, 0, 0, 1, 0} with {A1, 0, 0, 0, 1, 0} and compare the series of the two values of total_sales/cust_count.
However, with this test I just find out which of the products has an effect, not which group it has an effect for or if it is significant that the product is sold in combination with another product.
Any good ideas?
So after thinking a day, I found a way:
First I did a one-hot encoding of the groups, so I replaced the customer_group column with n columns containing 0 and 1's.
Then I made a linear regression model with mixed terms:
product_i * product_j + group_k * product_i + group_k * product_i * product_j
And by reducing the model I found which product x product combinations and which group x product and group x product x product combinations were significant

How do I use Cosine similarity for this use case?

If I have a query vector A and an item vector B, it would be great if someone can guide me how to weigh/normalize the vectors (strategies for the same).
Vector A would have the following components ( property1 (binary), property2 (binary), property 3 (int from range 0 to 50), property4 (int from range(0 to 10)
Vector B would have the same properties
I know that the angle between these 2 vectors using cosine similarity would give me the distance between the 2 vectors. I want to create a recommendation based on the similarity.
But i am not clear on how to normalize the properties and or the vectors in this case since it is binary+binary_int range +int range. Also, if I want to grant higher weightage to one property than the other, how do i do so. what options do i have.
I find examples of cosine similarity online with documents, but in this case the Vectors A and B are not documents so i am not using TF-idf in this case.
Please advise,
Thanks
If you want to use the traditional cosine similarity between the two vectors for td/idf, then each term is a dimension in your vector. That is, you need to form two new Vectors A' and B' and perform the similarity between these two.
These vectors have a dimension for each term, and you have 65 terms:
property 1: true and false
property 2: true and false
property 3: 0 through 50
property 4: 0 through 10
So A' and B' will be vectors of length 65 and each element will be either 0 or 1:
A'(0) = 1 if A(0) = true, and 0 otherwise
A'(1) = 1 if A(0) = false, and 0 otherwise
etc.
Clearly, you can see that this is inefficient. You don't actually need to calculate A' or B' to use cosine similarity with td/idf; you can just pretend you calculated them and perform the calculation on A and B. Note that length(A') = length(B') = sqrt(4) because there will be exactly 4 ones in A' and B'.
td/idf may not be your best bet though, if you want to take care of similarities within properties 3 and 4. That is, with td/idf, a property 3 value of 40 is different than a property 3 value of 41 and different than a property 3 value of 12. However, 41 is not considered "farther away" from 40 than 12; they are all just different terms.
So, if you want property 3 and 4 to incorporate a distance (1 is really close to 2 and 50 is far form 2) then you have to define a distance metric. And if you want to weigh the Boolean values more or less than properties 3 and 4, you will have to define a different distance metric too. If these are things you want to do, forget about cosine and just come up with a value.
Here's an example:
distance = abs(A.property1 - B.property1) * 5 +
abs(A.property2 - B.property2) * 5 +
abs(A.property3 - B.property3) / 51 * 1 +
abs(A.property4 - B.property4) / 10 * 2
And then the similarity = (the maximum of all distances) - distance;
Or, if you like, similarity = 1 / distance.
You can really define it how ever you like. And if you need the similarity to be between 0 and 1, then normalize by dividing by the maximum possible distance.

Replace Number that falls Between Two Values (Pandas,Python3)

Simple Question Here:
b = 8143.1795845088482
d = 14723.523658084257
My Df called final:
Words score
This 90374.98788
is 80559.4495
a 43269.67002
sample 34535.01172
output Very Low
I want to replace all the scores with either 'very low', 'low', 'medium', or 'high' based on whether they fall between quartile ranges.
something like this works:
final['score'][final['score'] <= b] = 'Very Low' #This is shown in the example above
but when I try to play this immediately after it doesn't work:
final['score'][final['score'] >= b] and final['score'][final['score'] <= d] = 'Low'
This gives me the error: cannot assign operator. Anyone know what I am missing?
Firstly you must use the bitwise (e.g. &, | instead of and , or) operators as you are comparing arrays and therefore all the values and not a single value (it becomes ambiguoous to compare arrays like this plus you cannot override the global and operator to behave like you want), secondly you must use parentheses around multiple conditions due to operator precendence.
Finally you are performing chain indexing which may or may not work and will raise a warning, to set your column value use loc like this:
In [4]:
b = 25
d = 50
final.loc[(final['score'] >= b) & (final['score'] <= d), 'score'] = 'Low'
final
Out[4]:
Words score
0 This 10
1 is Low
2 for Low
3 You 704
If your DataFrame's scores were all floats,
In [234]: df
Out[234]:
Words score
0 This 90374.98788
1 is 80559.44950
2 a 43269.67002
3 sample 34535.01172
then you could use pd.qcut to categorize each value by its quartile:
In [236]: df['quartile'] = pd.qcut(df['score'], q=4, labels=['very low', 'low', 'medium', 'high'])
In [237]: df
Out[237]:
Words score quartile
0 This 90374.98788 high
1 is 80559.44950 medium
2 a 43269.67002 low
3 sample 34535.01172 very low
DataFrame columns have a dtype. When the values are all floats, then it has a float dtype, which can be very fast for numerical calculations. When the values are a mixture of floats and strings then the dtype is object, which mean each value is a Python object. While this gives the values a lot of flexibility, it is also very slow since every operation ultimately resorts back to calling a Python function instead of a NumPy/Panda C/Fortran/Cython function. Thus you should try to avoid mixing floats and strings in a single column.

Resources