Cannot Reindex from a duplicate axis while subsetting - python-3.x

I have the following dataframe:
print(df)
Col Col Col Name
A B C Alex
B B C Jack
B A A Mark
I would like to get the following result, where at least one A appears:
Col Col Col Name
A B C Alex
B A A Mark
I tried:
final_df = df["Col"] == "A" but it gives me "ValueError: cannot reindex from a duplicate axis"

There is problem you have duplicated columns names, so if select df["Col"] get all columns called Col.
Possible solution is compare all columns with any for check at least one True per row:
df = df[(df == 'A').any(1)]
print (df)
Col Col Col
0 A B C
2 B A A
Details:
print ((df == 'A'))
Col Col Col
0 True False False
1 False False False
2 False True True
print ((df == 'A').any(1))
0 True
1 False
2 True
dtype: bool

Related

Pandas dataframe deduplicate rows with column logic

I have a pandas dataframe with about 100 million rows. I am interested in deduplicating it but have some criteria that I haven't been able to find documentation for.
I would like to deduplicate the dataframe, ignoring one column that will differ. If that row is a duplicate, except for that column, I would like to only keep the row that has a specific string, say X.
Sample dataframe:
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
Desired output:
>>> df_dedup
A B C
0 1 2 00X
1 1 3 010
So, alternatively stated, the row index 2 would be removed because row index 0 has the information in columns A and B, and X in column C
As this data is slightly large, I hope to avoid iterating over rows, if possible. Ignore Index is the closest thing I've found to the built-in drop_duplicates().
If there is no X in column C then the row should require that C is identical to be deduplicated.
In the case in which there are matching A and B in a row, but have multiple versions of having an X in C, the following would be expected.
df = pd.DataFrame(columns=["A","B","C"],
data = [[1,2,"0X0"],
[1,2,"X00"],
[1,2,"0X0"]])
Output should be:
>>> df_dedup
A B C
0 1 2 0X0
1 1 2 X00
Use DataFrame.duplicated on columns A and B to create a boolean mask m1 corresponding to condition where values in column A and B are not duplicated, then use Series.str.contains + Series.duplicated on column C to create a boolean mask corresponding to condition where C contains string X and C is not duplicated. Finally using these masks filter the rows in df.
m1 = ~df[['A', 'B']].duplicated()
m2 = df['C'].str.contains('X') & ~df['C'].duplicated()
df = df[m1 | m2]
Result:
#1
A B C
0 1 2 00X
1 1 3 010
#2
A B C
0 1 2 0X0
1 1 2 X00
Does the column "C" always have X as the last character of each value? You could try creating a column D with 1 if column C has an X or 0 if it does not. Then just sort the values using sort_values and finally use drop_duplicates with keep='last'
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
df['D'] = 0
df.loc[df['C'].str[-1] == 'X', 'D'] = 1
df.sort_values(by=['D'], inplace=True)
df.drop_duplicates(subset=['A', 'B'], keep='last', inplace=True)
This is assuming you also want to drop duplicates in case there is no X in the 'C' column among the duplicates of columns A and B
Here is another approach. I left 'count' (a helper column) in for transparency.
# use df as defined above
# count the A,B pairs
df['count'] = df.groupby(['A', 'B']).transform('count').squeeze()
m1 = (df['count'] == 1)
m2 = (df['count'] > 1) & df['C'].str.contains('X') # could be .endswith('X')
print(df.loc[m1 | m2]) # apply masks m1, m2
A B C count
0 1 2 00X 2
1 1 3 010 1

creating a new column from an existent categorical column in python

I have a data frame like this:
ID col
1 a
2 b
3 c
4 d
I want to create a new column so that if it is a or c, new column will give Y, otherwise N.
So, it will look like the following:
ID col col1
1 a Y
2 b N
3 c Y
4 d N
I am working in python3.
Try this code, simple effective
df = pd.DataFrame({'ID':[1,2,3,4],
'Col':['a', 'b', 'c', 'd']})
df['Col_2'] = df.apply(lambda row: 'Y' if (row.Col=='a' or row.Col=='c') else 'N' , axis = 1)

Search for value in all DataFrame columns (except first column !) and add new column with matching column name

I'd like to do a search on all columns (except the first column !) of a DataFrame and add a new column (like 'Column_Match') with the name of the matching column.
I tried something like this:
df.apply(lambda row: row.astype(str).str.contains('my_keyword').any(), axis=1)
But it's not excluding the first column and I don't know how to return and add the column name.
Any help much appreciated !
If want columns name of first matched value per rows add new column for match not exist values by DataFrame.assign and DataFrame.idxmax for column name:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'A':list('abcdef'),
'C':list('akabbe'),
'F':list('eakbbb')
})
f = lambda row: row.astype(str).str.contains('e')
df['new'] = df.iloc[:,1:].apply(f, axis=1).assign(missing=True).idxmax(axis=1)
print (df)
B A C F new
0 4 a a e F
1 5 b k a missing
2 4 c a k missing
3 5 d b b missing
4 5 e b b A
5 4 f e b C
If need all columns names of all matched values create boolean DataFrame and use dot product with columns names by DataFrame.dot and Series.str.rstrip:
f = lambda row: row.astype(str).str.contains('a')
df1 = df.iloc[:,1:].apply(f, axis=1)
df['new'] = df1.dot(df.columns[1:] + ', ').str.rstrip(', ').replace('', 'missing')
print (df)
B A C F new
0 4 a a e A, C
1 5 b k a F
2 4 c a k C
3 5 d b b missing
4 5 e b b missing
5 4 f e b missing

Comparing equality of groupby objects

Say we have dataframe one df1 and dataframe two df2.
import pandas as pd
dict1= {'group':['A','A','B','C','C','C'],'col2':[1,7,4,2,1,0],'col3':[1,1,3,4,5,3]}
df1 = pd.DataFrame(data=dict1).set_index('group')
dict2 = {'group':['A','A','B','C','C','C'],'col2':[1,7,400,2,1,0],'col3':[1,1,3,4,5,3500]}
df2 = pd.DataFrame(data=dict2).set_index('group')
df1
col2 col3
group
A 1 1
A 7 1
B 4 3
C 2 4
C 1 5
C 0 3
df2
col2 col3
group
A 1 1
A 7 1
B 400 3
C 2 4
C 1 5
C 0 3500
In pandas it is easy to compare the equality of these two dataframes with df1.equals(df2). In this case False.
However, we can see that some in this groups (A in the given toy example) are equal and some are not (groups B and C). I want to check for equality between these groups. In other words, check the equality between the dataframes with index A and B etc.
Here is my attempt. We wish to group the data
g1 = df1.groupby('group')
g2 = df2.groupby('group')
Naively trying g1.equals(g2) gives the error Cannot access callable attribute 'equals' of 'DataFrameGroupBy' objects, try using the 'apply' method.
However, if we try
g1.apply(lambda x: x.equals(g2))
We get a series
group
A False
B False
C False
dtype: bool
However the first entry should be True since the first case group A is equal between the two dataframes.
I can see that I could laboriously construct nested loops to do this, but that's slow. I feel there a way to do this in pandas without usings loops? I think I am misusing the apply method?
You can call get_group on g2 to retrieve the group to compare, you can access the group name using the attribute .name:
In[316]:
g1.apply(lambda x: x.equals(g2.get_group(x.name)))
Out[316]:
group
A True
B False
C False
dtype: bool
EDIT
To handle non-existent groups:
In[320]:
g1.apply(lambda x: x.equals(g2.get_group(x.name)) if x.name in g2.groups else False)
Out[320]:
group
A True
B False
C False
dtype: bool
Example:
In[323]:
dict1= {'group':['A','A','B','C','C','C','D'],'col2':[1,7,4,2,1,0,-1],'col3':[1,1,3,4,
5,3,-1]}
df1 = pd.DataFrame(data=dict1).set_index('group')
g1 = df1.groupby('group')
g1.apply(lambda x: x.equals(g2.get_group(x.name)) if x.name in g2.groups else False)
Out[323]:
group
A True
B False
C False
D False
dtype: bool
Here .groups returns a dict of the groups, the keys are the group name/labels, we can test for existence using x.name in g2.groups and modify the lambda to handle non-existent groups

replace values in pandas based on other two column

I have problem with replacement values in a column conditional other two columns.
For example we have three columns. A, B, and C
Columns A and B are both booleans, containing True and False, and column C contains three values: "Payroll", "Social", and "Other".
When in columns A and B are True in column C we have value "Payroll".
I want to change values in column C where both column A and B are True.
I tried following code: but gives me this error "'NoneType' object has no attribute 'where'":
data1.replace({'C' : { 'Payroll', 'Social'}},inplace=True).where((data1['A'] == True) & (data1['B'] == True))
but gives me this error "'NoneType' object has no attribute 'where'":
What can be done to this problem?
I think you need all for check if all Trues per rows and then assign output by filtered DataFrame by boolean mask:
data1 = pd.DataFrame({
'C': ['Payroll','Other','Payroll','Social'],
'A': [True, True, True, False],
'B':[False, True, True, False]
})
print (data1)
A B C
0 True False Payroll
1 True True Other
2 True True Payroll
3 False False Social
m = data1[['A', 'B']].all(axis=1)
#same output as
#m = data1['A'] & data1['B']
print (m)
0 False
1 True
2 True
3 False
dtype: bool
print (data1[m])
A B C
1 True True Other
2 True True Payroll
data1[m] = data1[m].replace({'C' : { 'Payroll':'Social'}})
print (data1)
A B C
0 True False Payroll
1 True True Other
2 True True Social
3 False False Social
Well you can use apply function to do this
def change_value(dataframe):
for index, row in df.iterrows():
if row['A'] == row['B'] == True:
row['C'] = # Change to whatever value you want
else:
row ['C'] = # Change how ever you want

Resources