This question already has answers here:
Convert pandas DataFrame column of comma separated strings to one-hot encoded
(3 answers)
Closed 4 years ago.
I have a survey response sheet which has questions which can have multiple answers, selected using a set of checkboxes.
When I get the data from the response sheet and import it into pandas I get this:
Timestamp Sports you like Age
0 23/11/2013 13:22:30 Football, Chess, Cycling 15
1 23/11/2013 13:22:34 Football 25
2 23/11/2013 13:22:39 Swimming,Football 22
3 23/11/2013 13:22:45 Chess, Soccer 27
4 23/11/2013 13:22:48 Soccer 30
There can be any number of sport values in sports column (further rows has basketball,volleyball etc.) and there are still some other columns. I'd like to do statistics on the results of the question (how many people liked Football,etc). The problem is, that all of the answers are within one column, so grouping by that column and asking for counts doesn't work.
Is there a simple way within Pandas to convert this sort of data frame into one where there are multiple columns called Sports-Football, Sports-Volleyball, Sports-Basketball, and each of those is boolean (1 for yes, 0 for no)? I can't think of a sensible way to do this
What I need is a new dataframe that looks like this (along with Age column) -
Timestamp Sports-Football Sports-Chess Sports-Cycling ....
0 23/11/2013 13:22:30 1 1 1
1 23/11/2013 13:22:34 1 0 0
2 23/11/2013 13:22:39 1 0 0
3 23/11/2013 13:22:45 0 1 0
I tried till this point can't proceed further.
df['Sports you like'].str.split(',\s*')
which splits into different columns but the first column may have any sport, I need only 1 in first column if the user likes Football or 0.
Problem is separator ,\s*, so solution is add str.split with str.join before str.get_dummies:
df1 = (df.pop('Sports you like').str.split(',\s*')
.str.join('|')
.str.get_dummies()
.add_prefix('Sports-'))
df = df.join(df1)
print (df)
Timestamp Age Sports-Chess Sports-Cycling Sports-Football \
0 23/11/2013 13:22:30 15 1 1 1
1 23/11/2013 13:22:34 25 0 0 1
2 23/11/2013 13:22:39 22 0 0 1
3 23/11/2013 13:22:45 27 1 0 0
4 23/11/2013 13:22:48 30 0 0 0
Sports-Soccer Sports-Swimming
0 0 0
1 0 0
2 0 1
3 1 0
4 1 0
Or use MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s = df.pop('Sports you like').str.split(',\s*')
df1 = pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_).add_prefix('Sports-')
print (df1)
Sports-Chess Sports-Cycling Sports-Football Sports-Soccer \
0 1 1 1 0
1 0 0 1 0
2 0 0 1 0
3 1 0 0 1
4 0 0 0 1
Sports-Swimming
0 0
1 0
2 1
3 0
4 0
df = df.join(df1)
print (df)
Timestamp Age Sports-Chess Sports-Cycling Sports-Football \
0 23/11/2013 13:22:30 15 1 1 1
1 23/11/2013 13:22:34 25 0 0 1
2 23/11/2013 13:22:39 22 0 0 1
3 23/11/2013 13:22:45 27 1 0 0
4 23/11/2013 13:22:48 30 0 0 0
Sports-Soccer Sports-Swimming
0 0 0
1 0 0
2 0 1
3 1 0
4 1 0
Related
My data looks like this:
It is grouped by "name"
name star atm food foodcp drink drinkcp clean cozy service
___Backyard Jr. (__Xinyi) 4 4 4 4 4 0 4 0 0
___Backyard Jr. (__Xinyi) 3 0 3 0 3 0 0 0 3
___Backyard Jr. (__Xinyi) 4 0 0 0 4 0 0 0 0
___Backyard Jr. (__Xinyi) 3 0 0 0 0 0 0 3 3
I want to calculate the mean of all columns except for name, which will ignore the "0" and it will be done within groups. How can I do it?
I've tried use
df.groupby('name',as_index=False).mean()
but it dose calculate the "0".
Thank you for your help!!
You can first replace all the zeros by NaN:
df = df.replace(0, np.nan)
These nan values will be excluded from your mean.
in this dataframe:
Feat1 Feat2 Feat3 Feat4 Labels
-46.220314 22.862856 -6.1573067 5.6060414 2
-23.80669 20.536781 -5.015675 4.2216353 2
-42.092365 25.680704 -5.0092897 5.665794 2
-35.29639 21.709473 -4.160352 5.578346 2
-37.075096 22.347767 -3.860426 5.6953945 2
-42.8849 28.03802 -7.8572545 3.3361 2
-32.3057 26.568039 -9.47018 3.4532788 2
-24.469942 27.005375 -9.301921 4.3995037 2
-97.89892 -0.38156664 6.4163384 7.234347 1
-81.96325 0.1821717 -1.2870358 4.703838 1
-78.41986 -6.766374 0.8001185 0.83444935 1
-100.68544 -4.5810957 1.6977689 1.8801615 1
-87.05412 -2.9231584 6.817379 5.4460077 1
-64.121056 -3.7892206 -0.283514 6.3084154 1
-94.504845 -0.9999217 3.2884297 6.881124 1
-61.951996 -8.960198 -1.5915259 5.6160254 1
-108.19452 13.909201 0.6966458 -1.956591 0
-97.4037 22.897585 -2.8488266 1.4105041 0
-92.641335 22.10624 -3.5110545 2.467166 0
-199.18787 3.3090565 -2.5994794 4.0802555 0
-137.5976 6.795896 1.6793671 2.2256763 0
-208.0035 -1.33229 -3.2078092 1.5177402 0
-108.225975 14.341716 1.02891 -1.8651972 0
-121.29299 18.274035 2.2891548 2.3360753 0
I wanted to sort the rows based on different column values in the "Labels" column.
I am able to sort in ascending such that the labels appear as [0 1 2] via the command
df2 = df1.sort_values(by = 'Labels', ascending = True)
Then ascending = False, where the labels appear [2 1 0].
How then do I go about sorting the labels as [1 0 2]?
Any help will be greatly appreciated!
Here's a way using Categorical:
df['Labels'] = pd.Categorical(df['Labels'],
categories = [1, 0, 2],
ordered=True)
df.sort_values('Labels')
Output:
Feat1 Feat2 Feat3 Feat4 Labels
11 -100.685440 -4.581096 1.697769 1.880162 1
15 -61.951996 -8.960198 -1.591526 5.616025 1
8 -97.898920 -0.381567 6.416338 7.234347 1
9 -81.963250 0.182172 -1.287036 4.703838 1
10 -78.419860 -6.766374 0.800118 0.834449 1
14 -94.504845 -0.999922 3.288430 6.881124 1
12 -87.054120 -2.923158 6.817379 5.446008 1
13 -64.121056 -3.789221 -0.283514 6.308415 1
21 -208.003500 -1.332290 -3.207809 1.517740 0
20 -137.597600 6.795896 1.679367 2.225676 0
19 -199.187870 3.309057 -2.599479 4.080255 0
18 -92.641335 22.106240 -3.511055 2.467166 0
17 -97.403700 22.897585 -2.848827 1.410504 0
16 -108.194520 13.909201 0.696646 -1.956591 0
23 -121.292990 18.274035 2.289155 2.336075 0
22 -108.225975 14.341716 1.028910 -1.865197 0
7 -24.469942 27.005375 -9.301921 4.399504 2
6 -32.305700 26.568039 -9.470180 3.453279 2
5 -42.884900 28.038020 -7.857254 3.336100 2
4 -37.075096 22.347767 -3.860426 5.695394 2
3 -35.296390 21.709473 -4.160352 5.578346 2
2 -42.092365 25.680704 -5.009290 5.665794 2
1 -23.806690 20.536781 -5.015675 4.221635 2
0 -46.220314 22.862856 -6.157307 5.606041 2
You can use an ordered Categorical, or if you don't want to change the DataFrame, the poor-man's variant, a mapping Series:
order = [1, 0, 2]
key = pd.Series({k:v for v,k in enumerate(order)}).get
# or
# pd.Series(range(len(order)), index=order).get
df1.sort_values(by='Labels', key=key)
Example:
df1 = pd.DataFrame({'Labels': [1,0,1,2,0,2,1]})
order = [1, 0, 2]
key = pd.Series({k:v for v,k in enumerate(order)}).get
print(df1.sort_values(by='Labels', key=key))
Labels
0 1
2 1
6 1
1 0
4 0
3 2
5 2
here is another way to do it
create a new column using map and map the new order sequence and then sort as usual
df['sort_label'] = df['Labels'].map({1:0, 0:1, 2:2 }) #).sort_values('sort_label', ascending=False)
df.sort_values('sort_label')
Feat1 Feat2 Feat3 Feat4 Labels sort_label
11 -100.685440 -4.581096 1.697769 1.880162 1 0
15 -61.951996 -8.960198 -1.591526 5.616025 1 0
8 -97.898920 -0.381567 6.416338 7.234347 1 0
9 -81.963250 0.182172 -1.287036 4.703838 1 0
10 -78.419860 -6.766374 0.800119 0.834449 1 0
14 -94.504845 -0.999922 3.288430 6.881124 1 0
12 -87.054120 -2.923158 6.817379 5.446008 1 0
13 -64.121056 -3.789221 -0.283514 6.308415 1 0
21 -208.003500 -1.332290 -3.207809 1.517740 0 1
20 -137.597600 6.795896 1.679367 2.225676 0 1
19 -199.187870 3.309057 -2.599479 4.080255 0 1
18 -92.641335 22.106240 -3.511054 2.467166 0 1
17 -97.403700 22.897585 -2.848827 1.410504 0 1
16 -108.194520 13.909201 0.696646 -1.956591 0 1
23 -121.292990 18.274035 2.289155 2.336075 0 1
22 -108.225975 14.341716 1.028910 -1.865197 0 1
7 -24.469942 27.005375 -9.301921 4.399504 2 2
6 -32.305700 26.568039 -9.470180 3.453279 2 2
5 -42.884900 28.038020 -7.857254 3.336100 2 2
4 -37.075096 22.347767 -3.860426 5.695394 2 2
3 -35.296390 21.709473 -4.160352 5.578346 2 2
2 -42.092365 25.680704 -5.009290 5.665794 2 2
1 -23.806690 20.536781 -5.015675 4.221635 2 2
0 -46.220314 22.862856 -6.157307 5.606041 2 2
I have a df as shown below
df:
Id Jan20 Feb20 Mar20 Apr20 May20 Jun20 Jul20 Aug20 Sep20 Oct20 Nov20 Dec20 Amount
1 20 0 0 12 1 3 1 0 0 2 2 0 100
2 0 0 2 1 0 2 0 0 1 0 0 0 500
3 1 2 1 2 3 1 1 2 2 3 1 1 300
From the above I would like to calculate Activeness value which is the number of non zero columns in the month columns as given below.
'Jan20', 'Feb20', 'Mar20', 'Apr20', 'May20', 'Jun20', 'Jul20',
'Aug20', 'Sep20', 'Oct20', 'Nov20', 'Dec20'
Expected Output:
Id Jan20 Feb20 Mar20 Apr20 May20 Jun20 Jul20 Aug20 Sep20 Oct20 Nov20 Dec20 Amount Activeness
1 20 0 0 12 1 3 1 0 0 2 2 0 100 7
2 0 0 2 1 0 2 0 0 1 0 0 0 500 4
3 1 2 1 2 3 1 1 2 2 3 1 1 300 12
I tried below code:
df['Activeness'] = pd.Series(index=df.index, data=np.count_nonzero(df[['Jan20', 'Feb20',
'Mar20', 'Apr20', 'May20', 'Jun20', 'Jul20',
'Aug20', 'Sep20', 'Oct20', 'Nov20', 'Dec20']], axis=1))
which is working well, but I would like to know is there any method that is faster than this.
You can try:
df['Activeness'] = df.filter(like = '20').ne(0, axis =1).sum(1)
I have a table like this
times v2
0 4 10
1 2 20
2 0 30/n30
3 1 40
4 0 9
What I want if change the values of v2 when times != 0, and the change consists in adding "\0" as many times as the times columns says.
times v2
0 4 10\n0\n0\n0\n0
1 2 20\n0\n0
2 0 30\n30
3 1 40\n0
4 0 9
You can do
df.v2+=df.times.map(lambda x : x*"\n0")
df
Out[325]:
times v2
0 4 10\n0\n0\n0\n0
1 2 20\n0\n0
2 0 30/n30
3 1 40\n0
4 0 9
I have data in excel which have two columns 'Peak Value' & 'Label'. I want to add value in 'Label' column based on 'Peak Value' column.
So, Input looks like below
Peak Value 0 0 0 88 0 0 88 0 0 88 0
Label 0 0 0 0 0 0 0 0 0 0 0
Input
Whenever the value in 'Peak Value' is greater than zero then it add 1 in 'Label' and replace all the zeros below it. For the next value greater than zero it should get incremented to 2 and replace all the zeros by 2.
So, the output will look like this:
Peak Value 0 0 0 88 0 0 88 0 0 88 0
Label 0 0 0 1 1 1 2 2 2 3 3
Output
and so on....
I tried writing function but I am only able to add 1 when the value is greater than 0 in 'Peak Value'.
def funct(row):
if row['Peak Value']>0:
val = 1
else:
val = 0
return val
df['Label']= df.apply(funct, axis=1)
May be you could try using cumsum and ffill:
import numpy as np
df['Labels'] = (df['Peak Value'] > 0).groupby(df['Peak Value']).cumsum()
df['Labels'] = df['Labels'].replace(0, np.nan).ffill().replace(np.nan, 0).astype(int)
Output:
Peak Value Labels
0 0 0
1 0 0
2 0 0
3 88 1
4 0 1
5 0 1
6 88 2
7 0 2
8 0 2
9 88 3
10 0 3