Here's my dataset
Id Column_A Column_B Column_C
1 Null 7 Null
2 8 7 Null
3 Null 8 7
4 8 Null 8
Here's my expected output
Column_A Column_B Column_C Total
Null 2 1 2 5
Notnull 2 3 2 7
Assuming Null is NaN, here's one option. Using isna + sum to count the NaNs, then find the difference between df length and number of NaNs for Notnulls. Then construct a DataFrame.
nulls = df.drop(columns='Id').isna().sum()
notnulls = nulls.rsub(len(df))
out = pd.DataFrame.from_dict({'Null':nulls, 'Notnull':notnulls}, orient='index')
out['Total'] = out.sum(axis=1)
If you're into one liners, we could also do:
out = (df.drop(columns='Id').isna().sum().to_frame(name='Nulls')
.assign(Notnull=df.drop(columns='Id').notna().sum()).T
.assign(Total=lambda x: x.sum(axis=1)))
Output:
Column_A Column_B Column_C Total
Nulls 2 1 2 5
Notnull 2 3 2 7
Use Series.value_counts for non missing values:
df = (df.replace('Null', np.nan)
.set_index('Id', 1)
.notna()
.apply(pd.value_counts)
.rename({True:'Notnull', False:'Null'}))
df['Total'] = df.sum(axis=1)
print (df)
Column_A Column_B Column_C Total
Null 2 1 2 5
Notnull 2 3 2 7
Related
How do I perform the following dataframe operation going from Dataframe A to dataframe B in pandas for python? I have tried pivot and groupby but I keep getting errors. Any support is greatly appreciated.
DataFrame A
Col A
Col B
100
1
100
2
200
3
200
4
DataFrame B
Col A & B
1
2
100
3
4
200
One option using groupby:
out = (df
.groupby('Col A', group_keys=False, sort=False)
.apply(lambda d: d.iloc[:, ::-1].unstack().drop_duplicates())
.reset_index(drop=True).to_frame(name='Col A&B')
)
Another with concat:
out = (pd
.concat([df['Col B'], df['Col A'].drop_duplicates(keep='last')])
.sort_index().reset_index(drop=True).to_frame(name='Col A&B')
)
output:
Col A&B
0 1
1 2
2 100
3 3
4 4
5 200
If order does not matter, you can stack:
out = df.stack().drop_duplicates().reset_index(drop=True).to_frame(name='Col A&B')
output:
Col A&B
0 100
1 1
2 2
3 200
4 3
5 4
Another possible solution:
out = pd.DataFrame({'Col A&B': np.unique(df)})
out
Output:
Col A&B
0 1
1 2
2 3
3 4
4 100
5 200
I have a column of type datetime64[ns] (df.timeframe).
df has columns ['id', 'timeframe', 'type']
df['type'] can be 'A' or 'B'
I want to get the total number of unique dates per df.type == 'A' and per df.id
I tried this:
df = df.groupby(['id', 'type']).timeframe.apply(lambda x: x.dt.date()).unique().rename('test').reset_index()
But got error:
TypeError: 'Series' object is not callable
What should I do?
You could use value_counts:
df[df['type']=='A'].assign(timeframe=df['timeframe'].dt.date)
.value_counts(['id','type','timeframe'], sort=False)
.reset_index().rename(columns={0:'count'})
id type timeframe count
0 1 A 2022-06-06 2
1 1 A 2022-06-08 1
2 1 A 2022-06-10 2
3 2 A 2022-06-07 1
4 2 A 2022-06-09 1
5 2 A 2022-06-10 1
Here's my dataset
Id Column_A Column_B Column_C
1 Null 7 Null
2 8 7 Null
3 Null 8 7
4 8 Null 8
If at least one column is null, Combination will be Null
Id Column_A Column_B Column_C Combination
1 Null 7 Null Null
2 8 7 Null Null
3 Null 8 7 Null
4 8 Null 8 Null
Assuming Null is NaN, we could use isna + any:
df['Combination'] = df.isna().any(axis=1).map({True: 'Null', False: 'Notnull'})
If Null is a string, we could use eq + any:
df['Combination'] = df.eq('Null').any(axis=1).map({True: 'Null', False: 'Notnull'})
Output:
Id Column_A Column_B Column_C Combination
0 1 Null 7 Null Null
1 2 8 7 Null Null
2 3 Null 8 7 Null
3 4 8 Null 8 Null
Use DataFrame.isna with DataFrame.any and pass to numpy.where:
df['Combination'] = np.where(df.isna().any(axis=1), 'Null','Notnull')
df['Combination'] = np.where(df.eq('Null').any(axis=1), 'Null','Notnull')
I have two dataframes. The first one (df1) has a Multi-Index A,B.
The second one (df2) has those fields A and B as columns.
How do I filter df2 for a large dataset (2 million rows in each) to get only the rows in df2 where A and B are not in the multi index of df1
import pandas as pd
df1 = pd.DataFrame([(1,2,3),(1,2,4),(1,2,4),(2,3,4),(2,3,1)],
columns=('A','B','C')).set_index(['A','B'])
df2 = pd.DataFrame([(7,7,1,2,3),(7,7,1,2,4),(6,6,1,2,4),
(5,5,6,3,4),(2,7,2,2,1)],
columns=('X','Y','A','B','C'))
df1:
C
A B
1 2 3
2 4
2 4
2 3 4
3 1
df2 before filtering:
X Y A B C
0 7 7 1 2 3
1 7 7 1 2 4
2 6 6 1 2 4
3 5 5 6 3 4
4 2 7 2 2 1
df2 wanted result:
X Y A B C
3 5 5 6 3 4
4 2 7 2 2 1
Create MultiIndex in df2 by A,B columns and filter by Index.isin with ~ for invert boolean mask with boolean indexing:
df = df2[~df2.set_index(['A','B']).index.isin(df1.index)]
print (df)
X Y A B C
3 5 5 6 3 4
4 2 7 2 2 1
Another similar solution with MultiIndex.from_arrays:
df = df2[~pd.MultiIndex.from_arrays([df2['A'],df2['B']]).isin(df1.index)]
Another solution by #Sandeep Kadapa:
df = df2[df2[['A','B']].ne(df1.reset_index()[['A','B']]).any(axis=1)]
I'm trying to combine multiple data frames in pandas and I want the new dataframe to contain the maximum element within the various dataframes. All of the dataframes have the same row and column labels. How can I do this?
Example:
df1 = Date A B C
1/1/15 3 5 1
2/1/15 2 4 7
df2 = Date A B C
1/1/15 7 2 2
2/1/15 1 5 4
I'd like the result to look like this.
df = Date A B C
1/1/15 7 5 2
2/1/15 2 5 7
You can use np.where to return an array of the values that satisfy your boolean condition, this can then be used to construct a df:
In [5]:
vals = np.where(df1 > df2, df1, df2)
vals
Out[5]:
array([['1/1/15', 7, 5, 2],
['2/1/15', 2, 5, 7]], dtype=object)
In [6]:
pd.DataFrame(vals, columns = df1.columns)
Out[6]:
Date A B C
0 1/1/15 7 5 2
1 2/1/15 2 5 7
I don't know if Date is a column or index but the end result will be the same.
EDIT
Actually just use np.maximum:
In [8]:
np.maximum(df1,df2)
Out[8]:
Date A B C
0 1/1/15 7 5 2
1 2/1/15 2 5 7