I have a pandas dataframe as below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ORDER':["A", "A", "A", "A", "B","B"], 'A':[80, 23, np.nan, 60, 1,22], 'B': [80, 55, 5, 76, 67,np.nan]})
df
ORDER A B
0 A 80.0 80.0
1 A 23.0 55.0
2 A NaN 5.0
3 A 60.0 76.0
4 B 1.0 67.0
5 B 22.0 NaN
I want to create a column "new" as below:
If ORDER == 'A', then new=df['A']
If ORDER == 'B', then new=df['B']
This can be achieved using the below code:
df['new'] = np.where(df['ORDER'] == 'A', df['A'], np.nan)
df['new'] = np.where(df['ORDER'] == 'B', df['B'], df['new'])
The tweak here is if ORDER doesnot have the value "B", Then B will not be present in the dataframe.So the dataframe might look like below. And if we use the above code o this dataframe, it will give an error because column "B" is missing from this dataframe.
ORDER A
0 A 80.0
1 A 23.0
2 A NaN
3 A 60.0
4 A 1.0
5 A 22.0
Use DataFrame.lookup, so you dont need to hardcode df['B'], but it looksup the column value:
df['new'] = df.lookup(df.index, df['ORDER'])
ORDER A B new
0 A 80.0 80.0 80.0
1 A 23.0 55.0 23.0
2 A NaN 5.0 NaN
3 A 60.0 76.0 60.0
4 B 1.0 67.0 67.0
5 B 22.0 NaN NaN
Related
I'm trying to split a dataframe when NaN rows are found using grps = dfs.isnull().all(axis=1).cumsum().
But this is not working when some of the rows have NaN entry in a single column.
import pandas as pd
from pprint import pprint
import numpy as np
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, 0.3, np.nan, 2, 3, 1],
}
df = pd.DataFrame(d)
dup = df['t'].diff().lt(0).cumsum()
dfs = (
df.groupby(dup, as_index=False, group_keys=False)
.apply(lambda x: pd.concat([x, pd.Series(index=x.columns, name='').to_frame().T]))
)
pprint(dfs)
grps = dfs.isnull().all(axis=1).cumsum()
temp = [dfs.dropna() for _, dfs in dfs.groupby(grps)]
i = 0
dfm = pd.DataFrame()
for df in temp:
df["name"] = f'name{i}'
i=i+1
df = df.append(pd.Series(dtype='object'), ignore_index=True)
dfm = dfm.append(df, ignore_index=True)
print(dfm)
Input df:
t input type value
0 0.0 2.0 A 0.1
1 1.0 2.0 A 0.2
2 2.0 2.0 A 0.3
NaN NaN NaN NaN
3 0.0 2.0 B NaN
4 2.0 2.0 B 2.0
NaN NaN NaN NaN
5 0.0 2.0 B 3.0
6 1.0 4.0 A 1.0
Output obtained:
t input type value name
0 0.0 2.0 A 0.1 name0
1 1.0 2.0 A 0.2 name0
2 2.0 2.0 A 0.3 name0
3 NaN NaN NaN NaN NaN
4 2.0 2.0 B 2.0 name1
5 NaN NaN NaN NaN NaN
6 0.0 2.0 B 3.0 name2
7 1.0 4.0 A 1.0 name2
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN
Expected:
t input type value name
0 0.0 2.0 A 0.1 name0
1 1.0 2.0 A 0.2 name0
2 2.0 2.0 A 0.3 name0
3 NaN NaN NaN NaN NaN
4 0.0 2.0 B NaN name1
5 2.0 2.0 B 2.0 name1
6 NaN NaN NaN NaN NaN
7 0.0 2.0 B 3.0 name2
8 1.0 4.0 A 1.0 name2
9 NaN NaN NaN NaN NaN
I am basically doing this to append names to the last column of the dataframe after splitting df
using
dfs = (
df.groupby(dup, as_index=False, group_keys=False)
.apply(lambda x: pd.concat([x, pd.Series(index=x.columns, name='').to_frame().T]))
)
and appending NaN rows.
Again, I use the NaN rows to split the df into a list and add new column. But dfs.isnull().all(axis=1).cumsum() isn't working for me. And I also get an additional NaN row in the last row fo the output obtained.
Suggestions on how to get the expected output will be really helpful.
Setup
df = pd.DataFrame(d)
print(df)
t input type value
0 0 2 A 0.1
1 1 2 A 0.2
2 2 2 A 0.3
3 0 2 B NaN
4 2 2 B 2.0
5 0 2 B 3.0
6 1 4 A 1.0
Simplify your approach
# assign name column before splitting
m = df['t'].diff().lt(0)
df['name'] = 'name' + m.cumsum().astype(str)
# Create null dataframes to concat
nan_rows = pd.DataFrame(index=m[m].index)
last_nan_row = pd.DataFrame(index=df.index[[-1]])
# Concat and sort index
df_out = pd.concat([nan_rows, df, last_nan_row]).sort_index(ignore_index=True)
Result
t input type value name
0 0.0 2.0 A 0.1 name0
1 1.0 2.0 A 0.2 name0
2 2.0 2.0 A 0.3 name0
3 NaN NaN NaN NaN NaN
4 0.0 2.0 B NaN name1
5 2.0 2.0 B 2.0 name1
6 NaN NaN NaN NaN NaN
7 0.0 2.0 B 3.0 name2
8 1.0 4.0 A 1.0 name2
9 NaN NaN NaN NaN NaN
Alternatively if you still want to start with the initial input as dfs, here is another approach:
dfs = dfs.reset_index(drop=True)
m = dfs.isna().all(1)
dfs.loc[~m, 'name'] = 'name' + m.cumsum().astype(str)
I have a dataframe like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
[2, np.nan, np.nan, np.nan, np.nan],
[np.nan, 2, np.nan, np.nan, np.nan],
[np.nan, np.nan, 2, np.nan, np.nan],
[np.nan, 2, 2, np.nan, np.nan],
[2, np.nan, 2, np.nan, 2],
[2, np.nan, np.nan, 2, np.nan],
[np.nan, 2, 2, 2, np.nan],
[2, np.nan, np.nan, np.nan, 2]
],
index=list('abcdefgh'), columns=list('ABCDE')
)
df
A B C D E
a 2.0 NaN NaN NaN NaN
b NaN 2.0 NaN NaN NaN
c NaN NaN 2.0 NaN NaN
d NaN 2.0 2.0 NaN NaN
e 2.0 NaN 2.0 NaN 2.0
f 2.0 NaN NaN 2.0 NaN
g NaN 2.0 2.0 2.0 NaN
h 2.0 NaN NaN NaN 2.0
I would like to fill NaNs by 0 for each row, before and after there is a non-NaN value, only for one NaN for each side of the non-NaN value with pandas.
so my desired output would be the following:
A B C D E
a 2.0 0.0 NaN NaN NaN
b 0.0 2.0 0.0 NaN NaN
c NaN 0.0 2.0 0.0 NaN
d 0.0 2.0 2.0 0.0 NaN
e 2.0 0.0 2.0 0.0 2.0
f 2.0 0.0 0.0 2.0 0.0
g 0.0 2.0 2.0 2.0 0.0
h 2.0 0.0 NaN 0.0 2.0
I know how to do it with for loops, but I was wondering if it is possible do it only with pandas.
Thank you very much for your help!
You can use shift backward and forward on both axes and mask:
cond = (df.notna().shift(axis=1, fill_value=False) # check left
|df.notna().shift(-1, axis=1, fill_value=False) # check right
)&df.isna() # cell is NA
df.mask(cond, 0)
output:
A B C D E
a 2.0 0.0 NaN NaN NaN
b 0.0 2.0 0.0 NaN NaN
c NaN 0.0 2.0 0.0 NaN
d 0.0 2.0 2.0 0.0 NaN
e 2.0 0.0 2.0 0.0 2.0
f 2.0 0.0 0.0 2.0 0.0
g 0.0 2.0 2.0 2.0 0.0
h 2.0 0.0 NaN 0.0 2.0
NB. This transformation is called a binary dilation, you can also use scipy.ndimage.morphology.binary_dilation for that. The advantage with this method is that you can use various structurating elements (not only Left/Right/Top/Bottom)
import numpy as np
from scipy.ndimage.morphology import binary_dilation
struct = np.array([[True, False, True]])
df.mask(binary_dilation(df.notna(), structure=struct), 0)
I have a DataFrame with some NaN values. In this DataFrame there are some rows with all NaN values. When I apply sum function on these rows, it is returning zero instead of NaN. Code is as follows:
df = pd.DataFrame(np.random.randint(10,60,size=(5,3)),
index = ['a','c','e','f','h'],
columns = ['One','Two','Three'])
df = df.reindex(index=['a','b','c','d','e','f','g','h'])
print(df.loc['b'].sum())
Any Suggestion?
the sum function takes the NaN values as 0.
if you want the result of the sum of NaN values to be NaN:
df.loc['b'].sum(min_count=1)
Output:
nan
if you apply to all rows(
after using reindex) you will get the following:
df.sum(axis=1,min_count=1)
a 137.0
b NaN
c 79.0
d NaN
e 132.0
f 95.0
g NaN
h 81.0
dtype: float64
if you now modify a NaN value of a row:
df.at['b','One']=0
print(df)
One Two Three
a 54.0 20.0 29.0
b 0.0 NaN NaN
c 13.0 24.0 27.0
d NaN NaN NaN
e 28.0 53.0 25.0
f 46.0 55.0 50.0
g NaN NaN NaN
h 47.0 26.0 48.0
df.sum(axis=1,min_count=1)
a 103.0
b 0.0
c 64.0
d NaN
e 106.0
f 151.0
g NaN
h 121.0
dtype: float64
as you can see now the result of row b is 0
i have a pandas dataframe structured as follow:
In[1]: df = pd.DataFrame({"A":[10, 15, 13, 18, 0.6],
"B":[20, 12, 16, 24, 0.5],
"C":[23, 22, 26, 24, 0.4],
"D":[9, 12, 17, 24, 0.8 ]})
Out[1]: df
A B C D
0 10.0 20.0 23.0 9.0
1 15.0 12.0 22.0 12.0
2 13.0 16.0 26.0 17.0
3 18.0 24.0 24.0 24.0
4 0.6 0.5 0.4 0.8
From here my goal is to filter multiple columns based on the last row (index 4) values. More in detail i need to keep those columns that has a value < 0.06 in the last row. The output should be a df structured as follow:
B C
0 20.0 23.0
1 12.0 22.0
2 16.0 26.0
3 24.0 24.0
4 0.5 0.4
I'm trying this:
In[2]: df[(df[["A", "B", "C", "D"]] < 0.6)]
but i get the as follow:
Out[2]:
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN 0.5 0.4 NaN
I even try:
df[(df[["A", "B", "C", "D"]] < 0.6).all(axis=0)]
but It gives me error, It doesn't work.
Is there anybody whom can help me?
Use DataFrame.loc with : for return all rows by condition - compare last row by DataFrame.iloc:
df1 = df.loc[:, df.iloc[-1] < 0.6]
print (df1)
B C
0 20.0 23.0
1 12.0 22.0
2 16.0 26.0
3 24.0 24.0
4 0.5 0.4
I'm trying to reshape this sample dataframe from long to wide format, without aggregating any of the data.
import numpy as np
import pandas as pd
df = pd.DataFrame({'SubjectID': ['A', 'A', 'A', 'B', 'B', 'C', 'A'], 'Date':
['2010-03-14', '2010-03-15', '2010-03-16', '2010-03-14', '2010-05-15',
'2010-03-14', '2010-03-14'], 'Var1': [1 , 12, 4, 7, 90, 1, 9], 'Var2': [ 0,
0, 1, 1, 1, 0, 1], 'Var3': [np.nan, 1, 0, np.nan, 0, 1, np.nan]})
df['Date'] = pd.to_datetime(df['Date']); df
Date SubjectID Var1 Var2 Var3
0 2010-03-14 A 1 0 NaN
1 2010-03-15 A 12 0 1.0
2 2010-03-16 A 4 1 0.0
3 2010-03-14 B 7 1 NaN
4 2010-05-15 B 90 1 0.0
5 2010-03-14 C 1 0 1.0
6 2010-03-14 A 9 1 NaN
To get around the duplicate values, I'm grouping by the "Date" column and getting the cumulative count for each value. Then I make a pivot table
df['idx'] = df.groupby('Date').cumcount()
dfp = df.pivot_table(index = 'SubjectID', columns = 'idx'); dfp
Var1 Var2 Var3
idx 0 1 2 3 0 1 2 3 0 2
SubjectID
A 5.666667 NaN NaN 9.0 0.333333 NaN NaN 1.0 0.5 NaN
B 90.000000 7.0 NaN NaN 1.000000 1.0 NaN NaN 0.0 NaN
C NaN NaN 1.0 NaN NaN NaN 0.0 NaN NaN 1.0
However, I want the idx column index to be the values from the "Date" column and I don't want to aggregate any data. The expected output is
Var1_2010-03-14 Var1_2010-03-14 Var1_2010-03-15 Var1_2010-03-16 Var1_2010-05-15 Var2_2010-03-14 Var2_2010-03-15 Var2_2010-03-16 Var2_2010-05-15 Var3_2010-03-14 Var3_2010-03-15 Var3_2010-03-16 Var3_2010-05-15
SubjectID
A 1 9 12 4 NaN 0 1 0 1.0 NaN NaN NaN 1.0 0.0 NaN
B 7.0 NaN NaN NaN 90 1 NaN NaN 1.0 NaN NaN NaN NaN NaN 0.0
C 1 NaN NaN NaN NaN 0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
How can I do this? Eventually, I'll merge the two column indexes by dfp.columns = [col[0]+ '_' + str(col[1]) for col in dfp.columns].
You are on the correct path:
# group
df['idx'] = df.groupby('Date').cumcount()
# set index and unstack
new = df.set_index(['idx','Date', 'SubjectID']).unstack(level=[0,1])
# drop idx column
new.columns = new.columns.droplevel(1)
new.columns = [f'{val}_{date}' for val, date in new.columns]
I think this is your expected output
Using map looks like it will be a little faster:
df['idx'] = df.groupby('Date').cumcount()
df['Date'] = df['Date'].astype(str)
new = df.set_index(['idx','Date', 'SubjectID']).unstack(level=[0,1])
new.columns = new.columns.droplevel(1)
#new.columns = [f'{val}_{date}' for val, date in new.columns]
new.columns = new.columns.map('_'.join)
Here is a 50,000 row test example:
#data
data = pd.DataFrame(pd.date_range('2000-01-01', periods=50000, freq='D'))
data['a'] = list('abcd')*12500
data['b'] = 2
data['c'] = list('ABCD')*12500
data.rename(columns={0:'date'}, inplace=True)
# list comprehension:
%%timeit -r 3 -n 200
new = data.set_index(['a','date','c']).unstack(level=[0,1])
new.columns = new.columns.droplevel(0)
new.columns = [f'{x}_{y}' for x,y in new.columns]
# 98.2 ms ± 13.3 ms per loop (mean ± std. dev. of 3 runs, 200 loops each)
# map with join:
%%timeit -r 3 -n 200
data['date'] = data['date'].astype(str)
new = data.set_index(['a','date','c']).unstack(level=[0,1])
new.columns = new.columns.droplevel(0)
new.columns = new.columns.map('_'.join)
# 84.6 ms ± 3.87 ms per loop (mean ± std. dev. of 3 runs, 200 loops each)