How to do null combination in pandas dataframe - python-3.x

Here's my dataset
Id Column_A Column_B Column_C
1 Null 7 Null
2 8 7 Null
3 Null 8 7
4 8 Null 8
If at least one column is null, Combination will be Null
Id Column_A Column_B Column_C Combination
1 Null 7 Null Null
2 8 7 Null Null
3 Null 8 7 Null
4 8 Null 8 Null

Assuming Null is NaN, we could use isna + any:
df['Combination'] = df.isna().any(axis=1).map({True: 'Null', False: 'Notnull'})
If Null is a string, we could use eq + any:
df['Combination'] = df.eq('Null').any(axis=1).map({True: 'Null', False: 'Notnull'})
Output:
Id Column_A Column_B Column_C Combination
0 1 Null 7 Null Null
1 2 8 7 Null Null
2 3 Null 8 7 Null
3 4 8 Null 8 Null

Use DataFrame.isna with DataFrame.any and pass to numpy.where:
df['Combination'] = np.where(df.isna().any(axis=1), 'Null','Notnull')
df['Combination'] = np.where(df.eq('Null').any(axis=1), 'Null','Notnull')

Related

How to bulild a null notnull matrix in pandas dataframe

Here's my dataset
Id Column_A Column_B Column_C
1 Null 7 Null
2 8 7 Null
3 Null 8 7
4 8 Null 8
Here's my expected output
Column_A Column_B Column_C Total
Null 2 1 2 5
Notnull 2 3 2 7
Assuming Null is NaN, here's one option. Using isna + sum to count the NaNs, then find the difference between df length and number of NaNs for Notnulls. Then construct a DataFrame.
nulls = df.drop(columns='Id').isna().sum()
notnulls = nulls.rsub(len(df))
out = pd.DataFrame.from_dict({'Null':nulls, 'Notnull':notnulls}, orient='index')
out['Total'] = out.sum(axis=1)
If you're into one liners, we could also do:
out = (df.drop(columns='Id').isna().sum().to_frame(name='Nulls')
.assign(Notnull=df.drop(columns='Id').notna().sum()).T
.assign(Total=lambda x: x.sum(axis=1)))
Output:
Column_A Column_B Column_C Total
Nulls 2 1 2 5
Notnull 2 3 2 7
Use Series.value_counts for non missing values:
df = (df.replace('Null', np.nan)
.set_index('Id', 1)
.notna()
.apply(pd.value_counts)
.rename({True:'Notnull', False:'Null'}))
df['Total'] = df.sum(axis=1)
print (df)
Column_A Column_B Column_C Total
Null 2 1 2 5
Notnull 2 3 2 7

Remove rows from Dataframe where row above or below has same value in a specific column

Starting Dataframe:
A B
0 1 1
1 1 2
2 2 3
3 3 4
4 3 5
5 1 6
6 1 7
7 1 8
8 2 9
Desired result - eg. Remove rows where column A has values that match the row above or below:
A B
0 1 1
2 2 3
3 3 4
5 1 6
8 2 9
You can use boolean indexing, the following condition will return true if value of A is NOT equal to value of A's next row
new_df = df[df['A'].ne(df['A'].shift())]
A B
0 1 1
2 2 3
3 3 4
5 1 6
8 2 9

count Total rows of an Id from another column

I have a dataframe
Intialise data of lists.
data = {'Id':['1', '2', '3', '4','5','6','7','8','9','10'], 'reply_id':[2, 2,2, 5,5,6,8,8,1,1]}
Create DataFrame
df = pd.DataFrame(data)
Id reply_id
0 1 2
1 2 2
2 3 2
3 4 5
4 5 5
5 6 6
6 7 8
7 8 8
8 9 1
9 10 1
I want to get total of reply_id in new for every Id.
Id=1 have 2 time occurrence in reply_id which i want in new column new
Desired output
Id reply_id new
0 1 2 2
1 2 2 3
2 3 2 0
3 4 5 0
4 5 5 2
5 6 6 1
6 7 8 0
7 8 8 2
8 9 1 0
9 10 1 0
I have done this line of code.
df['new'] = df.reply_id.eq(df.Id).astype(int).groupby(df.Id).transform('sum')
In this answer, I used Series.value_counts to count values in reply_id, and converted the result to a dict. Then, I used Series.map on the Id column to associate counts to Id. fillna(0) is used to fill values not present in reply_id
df['new'] = (df['Id']
.astype(int)
.map(df['reply_id'].value_counts().to_dict())
.fillna(0)
.astype(int))
Use, Series.groupby on the column reply_id, then use the aggregation function GroupBy.count to create a mapping series counts, finally use Series.map to map the values in Id column with their respective counts:
counts = df['reply_id'].groupby(df['reply_id']).count()
df['new'] = df['Id'].map(counts).fillna(0).astype(int)
Result:
# print(df)
Id reply_id new
0 1 2 2
1 2 2 3
2 3 2 0
3 4 5 0
4 5 5 2
5 6 6 1
6 7 8 0
7 8 8 2
8 9 1 0
9 10 1 0

Get row with a symbol after a particular index

I have a df:
Index col1
1 Abc
2 xyz
3 $123
4 wer
5 exr
6 ert
7 $546
8 $456
Problem Statement:
Now I want to find the index of the row containing the dollar sign after the keyword wer.
My Code:
idx = df.col1.str.contains('\$').idxmax() ## this gives me index 3 but what i want is index 7
Help need to modify my code to get the desired output
You need to mask the wer as well
s = (df['col1'].str.contains('\$') # rows containing $
& df['col1'].eq('wer').cumsum().gt(0) # rows after the first 'wer'
).idxmax()
# s == 7
Use:
#df=df.set_index('Index') #if 'index' is a column
df2=df[df['col1'].eq('wer').cumsum()>0]
df2['col1'].str.contains('\$').idxmax()
or:
df[(df['col1'].eq('wer').cumsum()>0) & df['col1'].str.contains('\$')].index[0]
Output:
7
Details:
df['col1'].eq('wer').cumsum().eq(1)
Index
1 False
2 False
3 False
4 True
5 True
6 True
7 True
8 True
Name: col1, dtype: bool
print(df2)
col1
Index
4 wer
5 exr
6 ert
7 $546
8 $456

How to extract value of column based on value change in other column python

I have dataframe with two columns i want extract value of first column based on second column, if in last 3 rows of column 2 value change from 0 to any value then extract value of column 1.
df=pd.DataFrame({'column1':[1,5,6,7,8,11,12,14,18,20],'column2':[0,0,1,1,0,0,0,256,256,0]})
print(df)
column1 column2
0 1 0
1 5 0
2 6 1
3 7 1
4 8 0
5 11 0
6 12 0
7 14 256
8 18 256
9 20 0
out_put=pd.DataFrame({'column1':[20],'column2':[0]})
print(out_put)
column1 column2
0 20 0
I believe you need check difference with last values to first in last 3 values of second column:
df1 = df.tail(3)
df2 = df1[df1['column2'].eq(0).view('i1').diff().eq(1)]
print (df2)
column1 column2
9 20 0
Details:
#last 3 rows
print (df1)
column1 column2
7 14 256
8 18 256
9 20 0
#compare second colum for equality
print (df1['column2'].eq(0))
7 False
8 False
9 True
Name: column2, dtype: bool
#convert mask to integers
print (df1['column2'].eq(0).view('i1'))
7 0
8 0
9 1
Name: column2, dtype: int8
#get difference
print (df1['column2'].eq(0).view('i1').diff())
Name: column2, dtype: int8
7 NaN
8 0.0
9 1.0
Name: column2, dtype: float64
#compare by 1
print (df1['column2'].eq(0).view('i1').diff().eq(1))
7 False
8 False
9 True
Name: column2, dtype: bool
And last filter by boolean indexing.

Resources