I have extracted a table from a database and wish to do some topic analysis on some entries. I have created an empty matrix with unique topic names and I have duplicate rows because there are potentially multiple topics associated with each 'name' entry. Ultimately, I would like a dataframe that has 1's across the row where a topic was associated with it. I will then remove the 'topic label' column, and at some point remove duplicate rows. The actual dataframe is much larger, but here I am just showing an illustration.
Here is my data:
topic_label name Misconceptions Long-term health issues Reproductive disease Inadequate research Unconscious bias
0 Misconceptions When is menstrual bleeding too much? 0 0 0 0 0
1 Long-term health issues When is menstrual bleeding too much? 0 0 0 0 0
2 Reproductive disease 10% of reproductive age women have endometriosis 0 0 0 0 0
3 Inadequate research 10% of reproductive age women have endometriosis 0 0 0 0 0
4 Unconscious bias Male bias threatens women's health 0 0 0 0 0
And I would like it to look like this:
topic_label name Misconceptions Long-term health issues Reproductive disease Inadequate research Unconscious bias
0 Misconceptions When is menstrual bleeding too much? 1 1 0 0 0
1 Long-term health issues When is menstrual bleeding too much? 1 1 0 0 0
2 Reproductive disease 10% of reproductive age women have endometriosis 0 0 1 1 0
3 Inadequate research 10% of reproductive age women have endometriosis 0 0 1 1 0
4 Unconscious bias Male bias threatens women's health 0 0 0 0 1
I have tried to use .loc in a loop to first slice the data by name, and then assign the values (after setting name as index), but this doesn't work when a row is unique:
name_set = list(set(df['name']))
df = df.set_index('name')
for i in name_set:
df.loc[i, list(df.loc[i]['topic_label'])] = 1
I feel like I am going round in circles here... is there a better way to do this?
One option is to use get_dummies to the dummy variables for each topic_label; then call sum in groupby.transform to aggregate the dummy variables for names:
cols = df['topic_label'].tolist()
out = df.drop(columns=cols).join(pd.get_dummies(df['topic_label']).groupby(df['name']).transform('sum').reindex(df['topic_label'], axis=1))
print(out)
The above returns a new DataFrame out. If you want to update df instead, then you can use update:
df.update(pd.get_dummies(df['topic_label']).groupby(df['name']).transform('sum').reindex(df['topic_label'], axis=1))
print(df)
Output:
topic_label name Misconceptions Long-term health issues Reproductive disease Inadequate research Unconscious bias
0 Misconceptions When is menstrual bleeding too much? 1 1 0 0 0
1 Long-term health issues When is menstrual bleeding too much? 1 1 0 0 0
2 Reproductive disease 10% of reproductive age women have endometriosis 0 0 1 1 0
3 Inadequate research 10% of reproductive age women have endometriosis 0 0 1 1 0
4 Unconscious bias Male bias threatens women's health 0 0 0 0 1
Related
My data looks like this:
It is grouped by "name"
name star atm food foodcp drink drinkcp clean cozy service
___Backyard Jr. (__Xinyi) 4 4 4 4 4 0 4 0 0
___Backyard Jr. (__Xinyi) 3 0 3 0 3 0 0 0 3
___Backyard Jr. (__Xinyi) 4 0 0 0 4 0 0 0 0
___Backyard Jr. (__Xinyi) 3 0 0 0 0 0 0 3 3
I want to calculate the mean of all columns except for name, which will ignore the "0" and it will be done within groups. How can I do it?
I've tried use
df.groupby('name',as_index=False).mean()
but it dose calculate the "0".
Thank you for your help!!
You can first replace all the zeros by NaN:
df = df.replace(0, np.nan)
These nan values will be excluded from your mean.
As cosine similarity works on vectors, I want to know, if it can be used get similarity between numeric or categorical data?
For example:
A data of customers shopping from supermarket has only categorical or numerical values
CustID Gender online milk bread egg diapers
234 1 1 0 1 1 0
235 2 1 0 1 0 0
234 1 0 1 0 0 1
234 1 0 0 1 0 1
238 3 0 0 0 0 0
239 1 0 1 1 0 1
240 2 1 0 0 1 1
Gender is categorical and rest variable are int64.
How can I use cosine similarity to see the similarity between the data (specifically, similarity between shopping of a single customer as there are multiple entry of a single customer)?
Also, which other similarity method I should use?
Cosine similarity is used for categorical data.
# download scratch package if it's not installed on compute
# pip3 install scratch
from scratch.linear_algebra import dot, Vector
def cosine_similarity(vec1: Vector, vec2: Vector) -> float:
return dot(vec1, vec2) / (dot(vec1, vec1) * dot(vec2, vec2))**0.5
I have a dataset like this,
sample = {'Theme': ['never give a ten','interaction speed','no feedback,premium'],
'cat1': [0,0,0],
'cat2': [0,0,0],
'cat3': [0,0,0],
'cat4': [0,0,0]
}
pd.DataFrame(sample,columns = ['Theme','cat1','cat2','cat3','cat4'])
Theme cat1 cat2 cat3 cat4
0 never give a ten 0 0 0 0
1 interaction speed 0 0 0 0
2 no feedback,premium 0 0 0 0
Now, I need to replace the values in cat columns based on value in Theme. If the Theme column has 'never give a ten', then change cat1 as 1, similarly if the theme column has 'interaction speed', then change cat2 as 1, if the theme column has 'no feedback' in it, change 'cat3' as 1 and for 'premium' change cat4 as 1.
In this sample I have provided 4 categories, I have in total 21 categories. I can do if word in string 21 times for 21 categories, but I am looking for an efficient way to write this in a function, loop every row and go through the logic and update the corresponding columns, can anyone help please?
Thanks in advance.
Here is possible set columns names by categories with Series.str.get_dummies - columns names are sorted:
df1 = df['Theme'].str.get_dummies(',')
print (df1)
interaction speed never give a ten no feedback premium
0 0 1 0 0
1 1 0 0 0
2 0 0 1 1
If need first column in output add DataFrame.join:
df11 = df[['Theme']].join(df['Theme'].str.get_dummies(','))
print (df11)
Theme interaction speed never give a ten no feedback \
0 never give a ten 0 1 0
1 interaction speed 1 0 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1
If order of columns is important add DataFrame.reindex:
#removed posible duplicates with remain ordering
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df['Theme'].str.get_dummies(',').reindex(cols, axis=1)
print (df2)
never give a ten interaction speed no feedback premium
0 1 0 0 0
1 0 1 0 0
2 0 0 1 1
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df[['Theme']].join(df['Theme'].str.get_dummies(',').reindex(cols, axis=1))
print (df2)
Theme never give a ten interaction speed no feedback \
0 never give a ten 1 0 0
1 interaction speed 0 1 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1
I have data for around 2 million active customers and around 2-5 years worth of transaction data by customer. This data includes features such as what item that customer bought, what store they bought it from, the date they purchased that item, how much they bought, how much they paid, etc.
I need to predict which of our customers will shop in the next 2 weeks.
Right now my data is set up like this
item_a item_b item_c item_d customer_id visit
dates
6/01 1 0 0 0 cust_123 1
6/02 0 0 0 0 cust_123 0
6/03 0 1 0 0 cust_123 1
6/04 0 0 0 0 cust_123 0
6/05 1 0 0 0 cust_123 1
6/06 0 0 0 0 cust_123 0
6/07 0 0 0 0 cust_123 0
6/08 1 0 0 0 cust_123 1
6/01 0 0 0 0 cust_456 0
6/02 0 0 0 0 cust_456 0
6/03 0 0 0 0 cust_456 0
6/04 0 0 0 0 cust_456 0
6/05 1 0 0 0 cust_456 1
6/06 0 0 0 0 cust_456 0
6/07 0 0 0 0 cust_456 0
6/08 0 0 0 0 cust_456 0
6/01 0 0 0 0 cust_789 0
6/02 0 0 0 0 cust_789 0
6/03 0 0 0 0 cust_789 0
6/04 0 0 0 0 cust_789 0
6/05 0 0 0 0 cust_789 0
6/06 0 0 0 0 cust_789 0
6/07 0 0 0 0 cust_789 0
6/08 0 1 1 0 cust_789 1
should I make the target variable be something like
df['target_variable']='no_purchase'
for cust in list(set(df['customer'])):
df['target_variable']=np.where(df['visit']>0,cust,df['target_variable'])
or have my visit feature be my target variable? If it's the latter, should I OHE all 2 million customers? If not, how should I set this up on Keras so that it classifies visits for all 2 million customers?
I think you should better understand your problem -- your problem requires strong domain knowledge to correct model it, and it can be modeled in many different ways, and below are just some examples:
Regression problem: given a customer's purchase record only containing relative date, e.g.
construct a sequence like [date2-date1, date3-date2, date4-date3, ...] from your data.
[6, 7, 5, 13, ...] means a customer is likely to buy things on the weekly or biweekly basis
[24, 30, 33, ...] means a customer is likely to buy things on the monthly basis.
If you organize problem in this way, all you need is to predict what is the next number in a given sequence. You may easily get such data by
randomly select a full sequence, say [a, b, c, d, e, f, ..., z]
randomly select a position to predict, say x
pick K (say K=6) proceeding sequence [r, s, t, u, v, w]as your network input, and x as your network target.
Once you have this model been trained, your ultimate task can be easily resolved by checking whether the predicted number is greater than 60.
Classification problem: given a customer's purchase record of K months, predict how many purchase will a customer have in the next two month.
Again, you need to create training data from your raw data, but this time the target for a customer is how many items does he purchased in month K+1 and K+2, and you may organize your input data of K-month record in your own way.
Please note, the number of items a customer purchased is a discrete number, but way below 1M. In fact, like in problem of face image based age estimation, people often quantilize the target into bins, e.g. 0-8, 9-16, 17-24, etc. You may do the same thing for your problem. Of course, you may also formulate this target as a regression problem to directly predict how many items.
Why you need to know your problem better?
as you can see, you may come up a number of problem formulations that might all look reasonable at the first glance or very difficult for you to say which one is the best.
it is worthy noting the dependence between a problem set-up and its hidden premise, (you may not notice such things until you think the problem carefully). For example, the regression problem set-up to predict the gap of the next purchase implies that the number of items a customer purchased does not matter. This claim may or may not be fair in your problem.
you may come up a much simpler but more effective solution if you know your problem well.
In most of problems like yours, you don't have to use deep learning or at least not at the first place. Classic approaches may work better.
I have trained a Neural Network to make prediction of outcome(Win/Lose) of a hockey game based on a few metrics.
The data I have been feeding it looks like this:
Each row represents a team in one game, so two specific rows make a match.
Won/Lost Home Away metric2 metric3 metric4 team1 team2 team3 team4
1 1 0 10 10 10 1 0 0 0
0 0 1 10 10 10 0 1 0 0
1 1 0 10 10 10 0 0 1 0
0 0 1 10 10 10 0 0 0 1
The predictions from the NN looks like this.
[0.12921564 0.87078434]
[0.63811845 0.3618816 ]
[5.8682327e-04 9.9941313e-01]
[0.97831124 0.02168871]
[0.04394475 0.9560553 ]
[0.76859254 0.23140742]
[0.45620263 0.54379743]
[0.01509337 0.9849066 ]
I believe I understand that the first column is for Lost(0), and second is for Won(1),
but what I don't understand is: Who won against who?
I don't now what to make of these predictions, do they even mean anything to me this way?
According to the Data Set you show here, it seems that the results of the network would show the probability of wining or losing a team in a match depending on the Race Host. I think you should add one more feature to your data set which shows the rival team in the match if you want your network to show the probability of wining the game against other team and the hosting situation( And if hosting is not important for you then you should remove Home and Away columns).
Let us take first two rows of your dataset,
Won/Lost Home Away metric2 metric3 metric4 team1 team2 team3 team4
1 1 0 10 10 10 1 0 0 0
0 0 1 10 10 10 0 1 0 0
#predictions
[0.12921564 0.87078434]
[0.63811845 0.3618816 ]
Team 1 played a game in its home and won the match. Model prediction also aligns with it because it has assigned high probability in the second column, which is the probability of winning as you mentioned.
Similarly Team 2 played a game away and lost the match. Model prediction aligns here as well!
you just mentioned that two specific rows make a match but with available information we cannot say who played with whom. Its just a model to predict the probability of winning for a particular team independently.
EDIT:
Assuming that you have data like this!
gameID Won/Lost Home Away metric2 metric3 metric4 team1 team2 team3 team4
2017020001 1 1 0 10 10 10 1 0 0 0
2017020001 0 0 1 10 10 10 0 1 0 0
You could transform the data as follows, which can improve the model.
Won/Lost metric2 metric3 metric4 h_team1 h_team2 h_team3 h_team4 a_team1 a_team2 a_team3 a_team4
1 10 10 10 1 0 0 0 0 1 0 0
Note: won/Lost value would be for home team, which is mentioned by h_team.