merging multiple columns in a dataframe - python-3.x

I have data frame like this one:
dataf = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': ['c', 'c',np.nan]})
get_dummies(df):
A_a A_b B_a B_b B_c C_c
0 1 0 0 1 0 1
1 0 1 1 0 0 1
2 1 0 0 0 1 0
I want all common attributes of dataframe to be in one column. Here for attribute 'a' we have two columns i.e. A_a & B_a. I want that in one column with name 'a' and values as UNION of A_a & B_a. And it should be applicable to all similar attributes. It should look like:
a b c
0 1 1 1
1 1 1 1
2 1 0 1
In original, I have hundreds of thousands of attributes in million+ rows. Therefore a generic formula will work. Thanks.

You can add parameters prefix and prefix_sep to get_dummies and then groupby by columns with sum:
import pandas as pd
import numpy as np
import io
dataf = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': ['c', 'c',np.nan]})
print dataf
A B C
0 a b c
1 b a c
2 a c NaN
df = pd.get_dummies(dataf, prefix="", prefix_sep="")
print df
a b a b c c
0 1 0 0 1 0 1
1 0 1 1 0 0 1
2 1 0 0 0 1 0
print df.groupby(df.columns, axis=1).sum()
a b c
0 1 1 1
1 1 1 1
2 1 0 1
EDIT by comment, thank you John Galt:
If values are lenght = 1 (as in sample):
df = pd.get_dummies(dataf)
print df
A_a A_b B_a B_b B_c C_c
0 1 0 0 1 0 1
1 0 1 1 0 0 1
2 1 0 0 0 1 0
print df.groupby(df.columns.str[-1:], axis=1).any().astype(int)
a b c
0 1 1 1
1 1 1 1
2 1 0 1

Related

How do you maintain on Python the value of the last row in a column, like on excel?

I have looked around and haven't found an 'elegant' solution. It can't be that it is not doable.
What I need is to have a column ('col A') on a dataframe that it is always 0, if the adjacent ('col B') column hits 1, then change the value to 1, and all further rows should be 1 (no matter what else happens on 'col B'), until another column ('col C') hits 1, then 'col A' returns to 0, until this repeats. The data has thousands of rows, and it gets updated regularly.
any ideas? I have tried shift, iloc and loops, but can't make it work.
the result should look something like this:
[sample data][1]
date col A col B col C
... 0 0 0
... 0 0 0
... 1 1 0
... 1 1 0
... 1 0 1
... 0 0 0
... 0 0 0
... 1 1 0
... 1 1 0
... 1 0 0
... 1 0 0
... 1 1 0
... 1 0 0
... 1 1 0
... 1 0 1
... 0 0 0
This is the base code I have been thinking about, but I can't get it to work:
df['B'] = df['A'].apply(lambda x: 1 if x == 1 else 0)
for i in range(1, len(df)):
if df.loc[i, 'C'] == 1:
df.loc[i, 'B'] = 0
else:
df.loc[i, 'B'] = df.loc[i-1, 'B']

Flag if there was an occurrence of a value in another column of pandas dataframe?

I have the following dataframe:
import pandas as pd
df = pd.DataFrame({'a': ['x', 'x', 'y','w', 'x', 'z', 'z', 'y', 'w'],
'Flag': [1, 0, 0, 0, 1, 0, 0, 0, 1]})
I want to add a column b that will flag if any entry of a has a flag of 1 or not:
a Flag b
x 1 1
x 0 1
y 0 0
w 0 1
x 1 1
z 0 0
z 0 0
y 0 0
w 1 1
What I did is: groupby a, cumsum Flag, every entry that > 0 will get 1, 0 otherwise.
Is there any simpler method or function to do this?
You could do it with isin and .astype(int):
df['b'] = df['a'].isin(df.loc[df['Flag'].eq(1), 'a']).astype(int)
>>> df
a Flag b
0 x 1 1
1 x 0 1
2 y 0 0
3 w 0 1
4 x 1 1
5 z 0 0
6 z 0 0
7 y 0 0
8 w 1 1
>>>
Or for other situations, you might need np.where:
df['b'] = np.where(df['a'].isin(df.loc[df['Flag'].eq(1), 'a']), 1, 0)

Is there anyway to make more than one dummies variable at a time? [duplicate]

How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2

iterating over a list of columns in pandas dataframe

I have a dataframe like below. I want to update the value of column C,D, E based on column A and B.
If column A < B, then C, D, E = A, else B. I tried the below code but I'm getting ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). error
import pandas as pd
import math
import sys
import re
data=[[0,1,0,0, 0],
[1,2,0,0,0],
[2,0,0,0,0],
[2,4,0,0,0],
[1,8,0,0,0],
[3,2, 0,0,0]]
df
Out[59]:
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 0 0 0
3 2 4 0 0 0
4 1 8 0 0 0
5 3 2 0 0 0
df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
list_1 = ['C', 'D', 'E']
for i in df[list_1]:
if df['A'] < df['B']:
df[i] = df['A']
else:
df['i'] = df['B']
I'm expecting below output:
df
Out[59]:
A B C D E
0 0 1 0 0 0
1 1 2 1 1 1
2 2 0 0 0 0
3 2 4 2 2 2
4 1 8 1 1 1
5 3 2 2 2 2
np.where
Return elements are chosen from A or B depending on condition.
df.assign
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
nums = np.where(df.A < df.B, df.A, df.B)
df = df.assign(C=nums, D=nums, E=nums)
Use DataFrame.mask:
df.loc[:,df.columns != 'B']=df.loc[:,df.columns != 'B'].mask(df['B']>df['A'],df['A'],axis=0)
print(df)
A B C D E
0 0 1 0 0 0
1 1 2 1 1 1
2 2 0 0 0 0
3 2 4 2 2 2
4 1 8 1 1 1
5 3 2 0 0 0
personally i always use .apply to modify columns based on other columns
list_1 = ['C', 'D', 'E']
for i in list_1:
df[i]=df.apply(lambda x: x.a if x.a<x.b else x.b, axis=1)
I don't know what you are trying to achieve here. Because condition df['A'] < df['B'] will always return same output in your loop. Just for sake of understanding:
When you do if df['A'] < df['B']:
The if condition expects a Boolean, but df['A'] < df['B'] gives a Series of Boolean values. So, it says either use something like
if (df['A'] < df['B']).all():
OR
if (df['A'] < df['B']).any():
What I would do is I would only create a DataFrame with columns 'A' and 'B', and then create column 'C' in the following way:
df['C'] = df.min(axis=1)
Columns 'D' and 'E' seem to be redundant.
If you have to start with all the columns and need to have all of them as output then you can do the following:
df['C'] = df[['A', 'B']].min(axis=1)
df['D'] = df['C']
df['E'] = df['C']
You can use the function where in numpy:
df.loc[:,'C':'E'] = np.where(df['A'] < df['B'], df['A'], df['B']).reshape(-1, 1)

pandas if else only on specific rows

I have a pandas dataframe as below. I want to apply below condition
Only for row where A =2, update the column 'C', 'D' TO -99.
I have a function like below which updates the value of C and D to -99.
def func(df):
for col in df.columns:
if ("C" in col) or ("D" in col):
df.loc[:,col] = -99
Now i just want to call that function, if A =2. I tried the below code but it updates all the rows of C and D to -99
import pandas as pd
import math
import sys
import re
data=[[0,1,0,0, 0],
[1,2,0,0,0],
[2,0,0,0,0],
[2,4,0,0,0],
[1,8,0,0,0],
[3,2, 0,0,0]]
df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
df
def func(df):
for col in df.columns:
if ("C" in col) or ("D" in col):
df.loc[:,col] = -99
if (df['A'] == 2).any():
func(df)
print(df)
My expected output:
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 -99 -99 0
3 2 4 -99 -99 0
4 1 8 0 0 0
5 3 2 0 0 0
You can do that by filtering:
df.loc[df['A'] == 2, ['C', 'D']] = -99
Here the first item of the filtering filters the rows, and we filter these such that we only select rows where the value for the column of 'A' is 2. We filter the columns by a list of names (C and D). We then assign -99 to these items.
For the given sample data, we obtain:
>>> df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
>>> df
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 0 0 0
3 2 4 0 0 0
4 1 8 0 0 0
5 3 2 0 0 0
>>> df.loc[df['A'] == 2, ['C', 'D']] = -99
>>> df
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 -99 -99 0
3 2 4 -99 -99 0
4 1 8 0 0 0
5 3 2 0 0 0

Resources