pandas assign value in multiple columns based on value in one - python-3.x

I have a dataset like this,
sample = {'Theme': ['never give a ten','interaction speed','no feedback,premium'],
'cat1': [0,0,0],
'cat2': [0,0,0],
'cat3': [0,0,0],
'cat4': [0,0,0]
}
pd.DataFrame(sample,columns = ['Theme','cat1','cat2','cat3','cat4'])
Theme cat1 cat2 cat3 cat4
0 never give a ten 0 0 0 0
1 interaction speed 0 0 0 0
2 no feedback,premium 0 0 0 0
Now, I need to replace the values in cat columns based on value in Theme. If the Theme column has 'never give a ten', then change cat1 as 1, similarly if the theme column has 'interaction speed', then change cat2 as 1, if the theme column has 'no feedback' in it, change 'cat3' as 1 and for 'premium' change cat4 as 1.
In this sample I have provided 4 categories, I have in total 21 categories. I can do if word in string 21 times for 21 categories, but I am looking for an efficient way to write this in a function, loop every row and go through the logic and update the corresponding columns, can anyone help please?
Thanks in advance.

Here is possible set columns names by categories with Series.str.get_dummies - columns names are sorted:
df1 = df['Theme'].str.get_dummies(',')
print (df1)
interaction speed never give a ten no feedback premium
0 0 1 0 0
1 1 0 0 0
2 0 0 1 1
If need first column in output add DataFrame.join:
df11 = df[['Theme']].join(df['Theme'].str.get_dummies(','))
print (df11)
Theme interaction speed never give a ten no feedback \
0 never give a ten 0 1 0
1 interaction speed 1 0 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1
If order of columns is important add DataFrame.reindex:
#removed posible duplicates with remain ordering
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df['Theme'].str.get_dummies(',').reindex(cols, axis=1)
print (df2)
never give a ten interaction speed no feedback premium
0 1 0 0 0
1 0 1 0 0
2 0 0 1 1
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df[['Theme']].join(df['Theme'].str.get_dummies(',').reindex(cols, axis=1))
print (df2)
Theme never give a ten interaction speed no feedback \
0 never give a ten 1 0 0
1 interaction speed 0 1 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1

Related

Pandas: how to calculate average ignoring 0 within groups?

My data looks like this:
It is grouped by "name"
name star atm food foodcp drink drinkcp clean cozy service
___Backyard Jr. (__Xinyi) 4 4 4 4 4 0 4 0 0
___Backyard Jr. (__Xinyi) 3 0 3 0 3 0 0 0 3
___Backyard Jr. (__Xinyi) 4 0 0 0 4 0 0 0 0
___Backyard Jr. (__Xinyi) 3 0 0 0 0 0 0 3 3
I want to calculate the mean of all columns except for name, which will ignore the "0" and it will be done within groups. How can I do it?
I've tried use
df.groupby('name',as_index=False).mean()
but it dose calculate the "0".
Thank you for your help!!
You can first replace all the zeros by NaN:
df = df.replace(0, np.nan)
These nan values will be excluded from your mean.

Pandas in Python 3 - Return list of highest sum

I want to find the sum of each column in the dataframe below and return a list of the highest sums. I've tried to use the code below however it only reports the max number. How do I update to include the column label (or labels if there are multiple columns if more than one column equals the max).
grouped = df.sum()
mostPurchased = grouped.max()
print(grouped)
snow suit
gloves
coat
boots
january
1
0
0
0
february
1
0
1
0
march
0
0
0
0
april
0
0
1
0
may
0
0
1
1
june
0
0
0
1
july
0
1
0
1
I want this to return:
Coat 3, Boots 3
Select the columns where the column sum equals the max column sum:
grouped = df.sum()
grouped[grouped == grouped.max()]
#coat 3
#boots 3
#dtype: int64

How to return all rows that have equal number of values of 0 and 1?

I have dataframe that has 50 columns each column have either 0 or 1. How do I return all rows that have an equal (tie) in the number of 0 and 1 (25 "0" and 25 "1").
An example on a 4 columns:
A B C D
1 1 0 0
1 1 1 0
1 0 1 0
0 0 0 0
based on the above example it should return the first and the third row.
A B C D
1 1 0 0
1 0 1 0
Because you have four columns, we assume you must have atleast two sets of 1 in a row. So, please try
df[df.mean(1).eq(0.5)]

I have DataFrame's columns and data in list i want to put the relevant data to relevant column

suppose you have given list of all item you can have and separately you have list of data and whose shape of list is not fixed it may contain any number of item you wished to create a dataframe from it and you have to put it on write column
for example
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
# and from this I wants to create a dummy variable like this
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
If want indicator columns filled by 0 and 1 only use MultiLabelBinarizer with DataFrame.reindex if want change ordering of columns by list and if possible some value not exist add only 0 column:
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
.reindex(columns, axis=1, fill_value=0))
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
Or Series.str.get_dummies:
df = pd.Series(data).str.join('|').str.get_dummies().reindex(columns, axis=1, fill_value=0)
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
This is one approach using collections.Counter.
Ex:
from collections import Counter
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt']]
data = map(Counter, data)
#df = pd.DataFrame(data, columns=columns)
df = pd.DataFrame(data, columns=columns).fillna(0).astype(int)
print(df)
Output:
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
You can try converting data to a dataframe:
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
df = pd.DataFrame(data)
df
0 1 2
0 hat tie None
1 shoe tie shirt
2 tie shirt None
Them use:
pd.get_dummies(df.stack()).groupby(level=0).agg('sum')
hat shirt shoe tie
0 1 0 0 1
1 0 1 1 1
2 0 1 0 1
Explanation:
df.stack() returns a MultiIndex Series:
0 0 hat
1 tie
1 0 shoe
1 tie
2 shirt
2 0 tie
1 shirt
dtype: object
If we get the dummy values of this series we get:
hat shirt shoe tie
0 0 1 0 0 0
1 0 0 0 1
1 0 0 0 1 0
1 0 0 0 1
2 0 1 0 0
2 0 0 0 0 1
1 0 1 0 0
Then you just have to groupby the index and merge them using sum (because we know that there will only be one or zero after get_dummies):
df = pd.get_dummies(df.stack()).groupby(level=0).agg('sum')

Create counter column based on values in 2 dataframe columns

I am looking to create a counter column based on row values in 2 dataframe columns, represented here at Col1 and Col2.
An example of the dataset is as follows:
Col1 Col2
a 0
a 0
a 0
a 1
a 0
a 0
a 0
a 1
a 1
b 0
b 0
b 1
b 1
b 0
b 0
Where Col1 is an identification variable, and where I want the counter to start over when a new identification variable comes across (so when 'a' switches to 'b', the counter returns to 0).
Col2 is an indication of a new input in the data. When a 1 arises, a new input arises, and the 0s after that correspond to measurements in that input. Each time a 1 arises, I want the counter variable to increment 1. Each time the 1 returns to a 0 (and vice versa), I also want the counter to increment 1. Based on the above dataset, I want the output to look like the following in Col3:
Col1 Col2 Col3
a 0 0
a 0 0
a 0 0
a 1 1
a 0 2
a 0 2
a 0 2
a 1 3
a 1 4
b 0 0
b 0 0
b 1 1
b 1 2
b 0 3
b 0 3
So basically every time Col2 switches from a 0 to a 1, and each time a 1 arises, I want the counter to increment. Each time a 0 is present in Col2, I want the counter to remain the same value. And every time Col1 changes to a new ID (in this case, from 'a' to 'b') I want the counter to start over at 0.
I've been mainly doing this with conditional statements, but there are a ton of them and I'm looking to run this on a large dataset, which would take hours to run. Is there a quick and easy way to run something like this, with these conditions on both columns? Or does anyone have suggestions on transformations to this data that would make running a categorization like this easier?
I understand that this is a slightly confusing request, so please let me know if there is anything I can do to provide more clarity into what I'm looking for.
Thanks!
df.assign(Col4=df1.groupby('Col1').Col2.apply(lambda x:
pd.Series(pd.np.r_[False,(x[1:]==1) |(x.values[1:] != x.values[:-1])].cumsum())).values)
Col1 Col2 Col3 Col4
0 a 0 0 0
1 a 0 0 0
2 a 0 0 0
3 a 1 1 1
4 a 0 2 2
5 a 0 2 2
6 a 0 2 2
7 a 1 3 3
8 a 1 4 4
9 b 0 0 0
10 b 0 0 0
11 b 1 1 1
12 b 1 2 2
13 b 0 3 3
14 b 0 3 3

Resources