I am new here and i need some help with python pandas.
I need help creating a new column where i get sum of another columns + previous row of this calculated row.
This is my example:
df = pd.DataFrame({
'column0': ['x', 'x', 'y', 'x', 'y', 'y', 'x'],
'column1': [50, 100, 30, 0, 30, 80, 0],
'column2': [0, 0, 0, 10, 0, 0, 30],
})
print(df)
column0 column1 column2
0 x 50 0
1 x 100 0
2 y 30 0
3 x 0 10
4 y 30 0
5 y 80 0
6 x 0 30
I have used loc to filter this DataFrame like this:
df = df.loc[df['column0'] == 'x']
df = df.reset_index(drop=True)
Now...when i try to get the output, i don't get correct result:
df['Result'] = df['column1'] + df['column2']
df['Result'] = df['column1'] + df['column2'] + df['Result'].shift(1)
print(df)
column0 column1 column2 Result
0 x 50 0 NaN
1 x 100 0 100.0
2 x 0 10 10.0
3 x 0 30 30.0
I just want this output....
column0 column1 column2 Result
0 x 50 0 50
1 x 100 0 150.0
2 x 0 10 160.0
3 x 0 30 190.0
Thank you very much!
You can use .cumsum() to calculate a cumulative sum of the column:
df = pd.DataFrame({
'column1': [50, 100, 30, 0, 30, 80, 0],
'column2': [0, 0, 0, 10, 0, 0, 30],
})
df['column3'] = df['column1'].cumsum() - df['column2'].cumsum()
This results in:
column1 column2 column3
0 50 0 50
1 100 0 150
2 30 0 180
3 0 10 170
4 30 0 200
5 80 0 280
6 0 30 250
Related
I have the following dataframe:
import pandas as pd
df = pd.DataFrame({'a': ['x', 'x', 'y','w', 'x', 'z', 'z', 'y', 'w'],
'Flag': [1, 0, 0, 0, 1, 0, 0, 0, 1]})
I want to add a column b that will flag if any entry of a has a flag of 1 or not:
a Flag b
x 1 1
x 0 1
y 0 0
w 0 1
x 1 1
z 0 0
z 0 0
y 0 0
w 1 1
What I did is: groupby a, cumsum Flag, every entry that > 0 will get 1, 0 otherwise.
Is there any simpler method or function to do this?
You could do it with isin and .astype(int):
df['b'] = df['a'].isin(df.loc[df['Flag'].eq(1), 'a']).astype(int)
>>> df
a Flag b
0 x 1 1
1 x 0 1
2 y 0 0
3 w 0 1
4 x 1 1
5 z 0 0
6 z 0 0
7 y 0 0
8 w 1 1
>>>
Or for other situations, you might need np.where:
df['b'] = np.where(df['a'].isin(df.loc[df['Flag'].eq(1), 'a']), 1, 0)
I need to convert a diagonal Dataframe to 1 row Dataframe.
Input:
df = pd.DataFrame([[7, 0, 0, 0],
[0, 2, 0, 0],
[0, 0, 3, 0],
[0, 0, 0, 8],],
columns=list('ABCD'))
A B C D
0 7 0 0 0
1 0 2 0 0
2 0 0 3 0
3 0 0 0 8
Expected output:
A B C D
0 7 2 3 8
what i tried so far to do this:
df1 = df.sum().to_frame().transpose()
df1
A B C D
0 7 2 3 8
It does the job. But is there any elegant way to do this by groupby or some other pandas builtin?
Not sure if there is any other 'elegant' way, I can only propose alternatives:
Use numpy.diagonal
pd.DataFrame([df.to_numpy().diagonal()], columns=df.columns)
A B C D
0 7 2 3 8
Use groupby with boolean (not sure if this is better than your solution):
df.groupby([True] * len(df), as_index=False).sum()
A B C D
0 7 2 3 8
You can use: np.diagonal(df):
pd.DataFrame(np.diagonal(df), df.columns).T
A B C D
0 7 2 3 8
I wonder how to replace types in data frame. In this sample I want to replace all strings to 0 or NaN. Here is my simple df and I try too do:
df.replace(str, 0, inplace=True)
or
df.replace({str: 0}, inplace=True)
but above solutions does not work.
0 1 2
0 NaN 1 'b'
1 2 3 'c'
2 4 'd' 5
3 10 20 30
check this code will visit every cell in the data frame , and if it was nan or string will replace them with 0
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 1, 2, 3, np.nan],
'B': [np.nan, 6, 7, 8, 9],
'C': ['a', 10, 500, 'd', 'e']})
print("before >>> \n",df)
def replace_nan_and_strings(cell_value):
if pd.isnull(cell_value) or isinstance(cell_value,str):
return 0
else :
return cell_value
new_df=df.applymap(lambda (x):replace_nan_strings(x))
print("after >>> \n",new_df)
Try this:
df = df.replace('[a-zA-Z]', 0, regex=True)
This is how I tested it:
'''
0 1 2
0 NaN 1 'b'
1 2 3 'c'
2 4 'd' 5
3 10 20 30
'''
import pandas as pd
df = pd.read_clipboard()
df = df.replace('[a-zA-Z]', 0, regex=True)
print(df)
Output:
0 1 2
0 NaN 1 0
1 2.0 3 0
2 4.0 0 5
3 10.0 20 30
New scenario as requested in the comments below:
Input:
'''
0 '1' 2
0 NaN 1 'b'
1 2 3 'c'
2 '4' 'd' 5
3 10 20 30
'''
Output:
0 '1' 2
0 NaN 1 0
1 2 3 0
2 '4' 0 5
3 10 20 30
I have this dataframe:
a = [1, 2, 3, 4, 5]
b = ['2019-08-01', '2019-09-01', '2019-10-23', '2019-11-12', '2019-11-30']
c = [12, 0, 0, 0, 0]
d = [0, 23, 0, 0, 0]
e = [12, 24, 35, 0, 0]
f = [0, 0, 44, 56, 82]
g = [21, 22, 17, 75, 63]
df = pd.DataFrame({'ID': a, 'Date': b, 'Unit_sold_8': c,
'Unit_sold_9': d, 'Unit_sold_10': e, 'Unit_sold_11': f,
'Unit_sold_12': g})
df['Date'] = pd.to_datetime(df['Date'])
I want to calculate average sales of each ID which are based on Date. For example, If ID's open date was on Sep, so the average sales of this ID would start on Sep. I tried np.select but I realized that this method would make my code super long.
col = df.columns
mask1 = (df['Date'] >= "08/01/2019") & (df['Date'] < "09/01/2019")
mask2 = (df['Date'] >= "09/01/2019") & (df['Date'] < "10/01/2019")
mask3 = (df['Date'] >= "10/01/2019") & (df['Date'] < "11/01/2019")
mask4 = (df['Date'] >= "11/01/2019") & (df['Date'] < "12/01/2019")
mask5 = (df['Date'] >= "12/01/2019")
condition2 = [mask1, mask2, mask3, mask4, mask5]
result2 = [df[col[2:]].mean(skipna = True, axis = 1),
df[col[3:]].mean(skipna = True, axis = 1),
df[col[4:]].mean(skipna = True, axis = 1),
df[col[5:]].mean(skipna = True, axis = 1),
df[col[6:]].mean(skipna = True, axis = 1)]
df.loc[:, 'Mean'] = np.select(condition2, result2, default = np.nan)
Are there any way faster to solve this problem? Especially when the time range is expanded (12 months, 24 months, .etc)
Does it help you?
from datetime import datetime
import numpy as np
from dateutil import relativedelta
check_date = datetime.today()
df['n_months'] = df['Date'].apply(lambda x: relativedelta.relativedelta( check_date,x).months)
df['total'] = df.iloc[:,range(2,df.shape[1]-1)].sum(axis=1)
df['avg'] = df['total'] / df['n_months']
print(df)
ID Date Unit_sold_8 ... n_months total avg
0 1 2019-08-01 12 ... 5 45 9.00
1 2 2019-09-01 0 ... 4 69 17.25
2 3 2019-10-23 0 ... 3 96 32.00
3 4 2019-11-12 0 ... 2 131 65.50
4 5 2019-11-30 0 ... 2 145 72.50
M= df
#melt data to pull units as variables
.melt(id_vars=['ID','Date'])
#create temp variables to pull out Month from Date and Units
.assign(Mth=lambda x: x['Date'].dt.month,
oda_detail = lambda x: x.variable.str.split('_').str[-1])
.sort_values(['ID','Mth'])
#keep only rows where the Mth is less than or equal to other detail
.loc[lambda x : x['Mth'].astype(int).le(x['oda_detail'].astype(int))]
#groupby and get the mean
.groupby(['ID','Date'])['value'].mean()
.reset_index()
.drop(['ID','Date'],axis=1)
.rename({'value':'Mean'},axis=1)
Join back to original dataframe:
pd.concat([df,M],axis=1)
ID Date Unit_sold_8 Unit_sold_9 Unit_sold_10 Unit_sold_11
Unit_sold_12 Mean
0 1 2019-08-01 12 0 12 0 21 9.00
1 2 2019-09-01 0 23 24 0 22 17.25
2 3 2019-10-23 0 0 35 44 17 32.00
3 4 2019-11-12 0 0 0 56 75 65.50
4 5 2019-11-30 0 0 0 82 63 72.50
I have a dataframe like this:
day time category count
1 1 a 13
1 2 a 47
1 3 a 1
1 5 a 2
1 6 a 4
2 7 a 14
2 2 a 10
2 1 a 9
2 4 a 2
2 6 a 1
I want to group by day, and category and get a vector of the counts per time. Where time can be between 1 and 10. The max and min of time I have defined in two variables called max and min.
This is how I want the resulting dataframe to look:
day category count
1 a [13,47,1,0,2,4,0,0,0,0]
2 a [9,10,0,2,0,1,14,0,0,0]
Does anyone know how to make this aggregation into a vaector?
Use reindex with MultiIndex.from_product for append missing categories and then groupby with list:
df = df.set_index(['day','time', 'category'])
a = df.index.levels[0]
b = range(1,11)
c = df.index.levels[2]
df = df.reindex(pd.MultiIndex.from_product([a,b,c], names=df.index.names), fill_value=0)
df = df.groupby(['day','category'])['count'].apply(list).reset_index()
print (df)
day category count
0 1 a [13, 47, 1, 0, 2, 4, 0, 0, 0, 0]
1 2 a [9, 10, 0, 2, 0, 1, 14, 0, 0, 0]
EDIT:
df = (df.set_index(['day','time', 'category'])['count']
.unstack(1, fill_value=0)
.reindex(columns=range(1,11), fill_value=0))
print (df)
time 1 2 3 4 5 6 7 8 9 10
day category
1 a 13 47 1 0 2 4 0 0 0 0
2 a 9 10 0 2 0 1 14 0 0 0
df = df.apply(list, 1).reset_index(name='count')
print (df)
day ... count
0 1 ... [13, 47, 1, 0, 2, 4, 0, 0, 0, 0]
1 2 ... [9, 10, 0, 2, 0, 1, 14, 0, 0, 0]
[2 rows x 3 columns]