Is there anyway to make more than one dummies variable at a time? [duplicate] - python-3.x

How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?

With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0

Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.

Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")

Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.

The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2

Related

How to perform cumulative sum inside iterrows

I have a pandas dataframe as below:
df2 = pd.DataFrame({ 'b' : [1, 1, 1]})
df2
b
0 1
1 1
2 1
I want to create a column 'cumsum' with the cumulative sum of column b starting row 2. Also I want to use iterrows to perform this. I tried below code but it doesnot seem to work.
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b'].cumsum()
My expected output:
b cum_sum
0 1 NaN
1 1 2
2 1 3
As your requirement, you may try this
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[:row_index, 'b'].sum()
Out[10]:
b cumsum
0 1 NaN
1 1 2.0
2 1 3.0
To stick to iterrows():
i=0
df2['cumsum']=0
col=list(df2.columns).index('cumsum')
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b']+df2.iloc[i, col]
i+=1
Outputs:
b cumsum
0 1 0
1 1 1
2 1 2

Can I use lambda inside df.apply() to insert 1s into dataframe where combination of the index and column names are in another column of the dataframe?

I have this dataframe:
In [6]: import pandas as pd
In [7]: import numpy as np
In [8]: df = pd.DataFrame(data = np.nan,
...: columns = ['A', 'B', 'C', 'D', 'E'],
...: index = ['A', 'B', 'C', 'D', 'E'])
...:
...: df['list_of_codes'] = [['A' , 'B'],
...: ['A', 'B', 'E'],
...: ['C', 'D'],
...: ['B', 'D'],
...: ['E']]
...:
...: df
Out[8]:
A B C D E list_of_codes
A NaN NaN NaN NaN NaN [A, B]
B NaN NaN NaN NaN NaN [A, B, E]
C NaN NaN NaN NaN NaN [C, D]
D NaN NaN NaN NaN NaN [B, D]
E NaN NaN NaN NaN NaN [E]
And now I want to insert a '1' where both the index and column name are present inside of the list in the column df['list_of_codes']. The result would look like this:
A B C D E list_of_codes
A 1 1 0 0 0 [A, B]
B 1 1 0 0 1 [A, B, E]
C 0 0 1 1 0 [C, D]
D 0 1 0 1 0 [B, D]
E 0 0 0 0 1 [E]
I have tried something like this:
df.apply(lambda x: 1 if x[:-1] in (x[-1]) else 0, axis=1, result_type='broadcast')
but get the error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I don't think I understand this error exactly but then I try:
df.apply(lambda x: 1 if x[:-1].any() in (x[-1]) else 0, axis=1, result_type='broadcast')
This runs but does not give me the desired result. Instead it returns:
A B C D E list_of_codes
A 0 0 0 0 0 0
B 0 0 0 0 0 0
C 0 0 0 0 0 0
D 0 0 0 0 0 0
E 0 0 0 0 0 0
Can someone help me understand what I need in my pd.apply() and lambda functions in order to broadcast the '1's in the way that I am trying to? Thanks in advance!
IIUC, Series.explode and then Series.str.get_dummies to check . Finally, we can use groupby.max to assign to the original dataframe
df = df.assign(**df['list_of_codes'].explode()
.str.get_dummies()
.groupby(level=0).max())
print(df)
Output
A B C D E list_of_codes
A 1 1 0 0 0 [A, B]
B 1 1 0 0 1 [A, B, E]
C 0 0 1 1 0 [C, D]
D 0 1 0 1 0 [B, D]
E 0 0 0 0 1 [E]
Alternative without explode
df = df.assign(**pd.DataFrame(df['list_of_codes'].tolist(),
index = df.index).stack()
.str.get_dummies()
.groupby(level=0)
.max())
EDIT
I think explode is somewhat faster, since in the alternative I propose at the end we are creating a dataframe and then using stack. We can rely on this post : SO explode to use explode. On the other hand we can use the level accessor instead of groupby. Well try to explode by another method of publication and find the method that provides better performance.
index = df.index
df[index] = pd.get_dummies(pd.Series(data = np.concatenate(s.values),
index = index.repeat(s.str.len()))).sum(level=0)
Another approach with pd.Index.isin:
index=df.index
df[index] = [index.isin(l).astype(int) for l in df['list_of_codes']]
I think it could be the fastest
We could also consider writing only true or false. It would be faster.
index=df.index
df[index] = [index.isin(l) for l in df['list_of_codes']]
I can not make a comment "less than 50 reputation", but I do tested ansev's solution with a 15000*15000 size df here is the way I build a test df:
import numpy as np
import pandas as pd
nelem = 15000
elements = range(nelem)
x=np.random.randint(low=1, high=len(elements), size=nelem)
list_of_codes=[]
for i in range(nelem):
list_of_codes.append(np.random.choice(elements,size=x[i]))
df = pd.DataFrame(data = {"list_of_codes":list_of_codes})
for x in elements:
df[x]=np.nan
I tested it on the colab it gave me this outcome:
%timeit df[index] = [index.isin(l) for l in df['list_of_codes']]
The slowest run took 26.21 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 3.04 s per loop
So ansev's solution does work in your case.

iterating over a list of columns in pandas dataframe

I have a dataframe like below. I want to update the value of column C,D, E based on column A and B.
If column A < B, then C, D, E = A, else B. I tried the below code but I'm getting ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). error
import pandas as pd
import math
import sys
import re
data=[[0,1,0,0, 0],
[1,2,0,0,0],
[2,0,0,0,0],
[2,4,0,0,0],
[1,8,0,0,0],
[3,2, 0,0,0]]
df
Out[59]:
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 0 0 0
3 2 4 0 0 0
4 1 8 0 0 0
5 3 2 0 0 0
df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
list_1 = ['C', 'D', 'E']
for i in df[list_1]:
if df['A'] < df['B']:
df[i] = df['A']
else:
df['i'] = df['B']
I'm expecting below output:
df
Out[59]:
A B C D E
0 0 1 0 0 0
1 1 2 1 1 1
2 2 0 0 0 0
3 2 4 2 2 2
4 1 8 1 1 1
5 3 2 2 2 2
np.where
Return elements are chosen from A or B depending on condition.
df.assign
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
nums = np.where(df.A < df.B, df.A, df.B)
df = df.assign(C=nums, D=nums, E=nums)
Use DataFrame.mask:
df.loc[:,df.columns != 'B']=df.loc[:,df.columns != 'B'].mask(df['B']>df['A'],df['A'],axis=0)
print(df)
A B C D E
0 0 1 0 0 0
1 1 2 1 1 1
2 2 0 0 0 0
3 2 4 2 2 2
4 1 8 1 1 1
5 3 2 0 0 0
personally i always use .apply to modify columns based on other columns
list_1 = ['C', 'D', 'E']
for i in list_1:
df[i]=df.apply(lambda x: x.a if x.a<x.b else x.b, axis=1)
I don't know what you are trying to achieve here. Because condition df['A'] < df['B'] will always return same output in your loop. Just for sake of understanding:
When you do if df['A'] < df['B']:
The if condition expects a Boolean, but df['A'] < df['B'] gives a Series of Boolean values. So, it says either use something like
if (df['A'] < df['B']).all():
OR
if (df['A'] < df['B']).any():
What I would do is I would only create a DataFrame with columns 'A' and 'B', and then create column 'C' in the following way:
df['C'] = df.min(axis=1)
Columns 'D' and 'E' seem to be redundant.
If you have to start with all the columns and need to have all of them as output then you can do the following:
df['C'] = df[['A', 'B']].min(axis=1)
df['D'] = df['C']
df['E'] = df['C']
You can use the function where in numpy:
df.loc[:,'C':'E'] = np.where(df['A'] < df['B'], df['A'], df['B']).reshape(-1, 1)

Placing n rows of pandas a dataframe into their own dataframe

I have a large dataframe with many rows and columuns.
An example of the structure is:
a = np.random.rand(6,3)
df = pd.DataFrame(a)
I'd like to split the DataFrame into seperate data frames each consisting of 3 rows.
you can use groupby
g = df.groupby(np.arange(len(df)) // 3)
for n, grp in g:
print(grp)
0 1 2
0 0.278735 0.609862 0.085823
1 0.836997 0.739635 0.866059
2 0.691271 0.377185 0.225146
0 1 2
3 0.435280 0.700900 0.700946
4 0.796487 0.018688 0.700566
5 0.900749 0.764869 0.253200
to get it into a handy dictionary
mydict = {k: v for k, v in g}
You can use numpy.split() method:
In [8]: df = pd.DataFrame(np.random.rand(9, 3))
In [9]: df
Out[9]:
0 1 2
0 0.899366 0.991035 0.775607
1 0.487495 0.250279 0.975094
2 0.819031 0.568612 0.903836
3 0.178399 0.555627 0.776856
4 0.498039 0.733224 0.151091
5 0.997894 0.018736 0.999259
6 0.345804 0.780016 0.363990
7 0.794417 0.518919 0.410270
8 0.649792 0.560184 0.054238
In [10]: for x in np.split(df, len(df)//3):
...: print(x)
...:
0 1 2
0 0.899366 0.991035 0.775607
1 0.487495 0.250279 0.975094
2 0.819031 0.568612 0.903836
0 1 2
3 0.178399 0.555627 0.776856
4 0.498039 0.733224 0.151091
5 0.997894 0.018736 0.999259
0 1 2
6 0.345804 0.780016 0.363990
7 0.794417 0.518919 0.410270
8 0.649792 0.560184 0.054238

merging multiple columns in a dataframe

I have data frame like this one:
dataf = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': ['c', 'c',np.nan]})
get_dummies(df):
A_a A_b B_a B_b B_c C_c
0 1 0 0 1 0 1
1 0 1 1 0 0 1
2 1 0 0 0 1 0
I want all common attributes of dataframe to be in one column. Here for attribute 'a' we have two columns i.e. A_a & B_a. I want that in one column with name 'a' and values as UNION of A_a & B_a. And it should be applicable to all similar attributes. It should look like:
a b c
0 1 1 1
1 1 1 1
2 1 0 1
In original, I have hundreds of thousands of attributes in million+ rows. Therefore a generic formula will work. Thanks.
You can add parameters prefix and prefix_sep to get_dummies and then groupby by columns with sum:
import pandas as pd
import numpy as np
import io
dataf = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': ['c', 'c',np.nan]})
print dataf
A B C
0 a b c
1 b a c
2 a c NaN
df = pd.get_dummies(dataf, prefix="", prefix_sep="")
print df
a b a b c c
0 1 0 0 1 0 1
1 0 1 1 0 0 1
2 1 0 0 0 1 0
print df.groupby(df.columns, axis=1).sum()
a b c
0 1 1 1
1 1 1 1
2 1 0 1
EDIT by comment, thank you John Galt:
If values are lenght = 1 (as in sample):
df = pd.get_dummies(dataf)
print df
A_a A_b B_a B_b B_c C_c
0 1 0 0 1 0 1
1 0 1 1 0 0 1
2 1 0 0 0 1 0
print df.groupby(df.columns.str[-1:], axis=1).any().astype(int)
a b c
0 1 1 1
1 1 1 1
2 1 0 1

Resources