Pandas resample fill NaN - python-3.x

I have this df:
Timestamp List Power Energy Status
0 2020-01-01 01:05:50 [5, 5, 5] 7000 15000 online
1 2020-01-01 01:06:20 [6, 6, 6] 7500 16000 online
2 2020-01-01 01:08:30 [0, 0, 0] 5 0 offline
...
no i want to resample it. Use .resample as following:
df2 = df.set_index('timestamp').resample('min').?
i want the df in 1min - intervalls. To each intervall i want to match with the rows as follows:
List: if status = online: last entry of the intervall else '0';
Power: if status = online: the mean value of the intervall else '0'; Energy: if status = online: last entry of the intervall else '0; Status: the last status of the intervall;
how do i fill the NaN values, which .resample outputs, if there is no data in df? E.g. no data for an interval, then the df should be filled as follows Power = 0; Energy = 0; status = offline;...
I tried something like that:
df2 = df.set_index('Timestamp').resample('T').agg({'List':'last',
'Power':'mean',
'Energy':'last',
'Status':'last'})
and got:
Timestamp List Power Energy Status
0 2020-01-01 01:05 [5, 5, 5] (average of the interval) 15000 online
1 2020-01-01 01:06 [6, 6, 6] (average of the interval) 16000 online
2 2020-01-01 01:07 NaN NaN NaN NaN
3 2020-01-01 01:08 [0, 0, 0] 5 0 offline
Expected outcome:
Timestamp List Power Energy Status
0 2020-01-01 01:05 [5, 5, 5] (average of the interval) 15000 online
1 2020-01-01 01:06 [6, 6, 6] (average of the interval) 16000 online
2 2020-01-01 01:07 [0, 0, 0] 0 0 offline
3 2020-01-01 01:08 [0, 0, 0] 5 0 offline

There is no way to pass fillna rule to separately handle each column NA values during .resample().agg() as viewed in docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
In your case even interpolation does not work, so, try to manually handle each column NA values
Firstly, let's initialize your sample frame.
import pandas as pd
data = {"Timestamp":{"0": "2020-01-01 01:05:50",
"1": "2020-01-01 01:06:20",
"2": "2020-01-01 01:08:30"},
"List": {"0": [5, 5, 5],
"1": [6, 6, 6],
"2": [0, 0, 0]},
"Power": {"0": 7000,
"1": 7500,
"2": 5},
"Energy": {"0": 15000,
"1": 16000,
"2": 0},
"Status": {"0": "online",
"1": "online",
"2": "offline"},
}
df = pd.DataFrame(data)
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.set_index('Timestamp').resample('T').agg({'List':'last',
'Power':'mean',
'Energy':'last',
'Status':'last'})
Now we can manually replace NA in each column separately
df["List"] = df["List"].fillna("[0, 0, 0]")
df["Status"] = df["Status"].fillna('offline')
df = df.fillna(0)
or more convenient dict way to do it
values = {
'List': '[0, 0, 0]',
'Status': 'offline',
'Power': 0,
'Energy': 0
}
df = df.fillna(value=values)
Timestamp List Power Energy Status
0 2020-01-01 01:05:00 [5, 5, 5] 7000.0 15000.0 online
1 2020-01-01 01:06:00 [6, 6, 6] 7500.0 16000.0 online
2 2020-01-01 01:07:00 [0, 0, 0] 0.0 0.0 offline
3 2020-01-01 01:08:00 [0, 0, 0] 5.0 0.0 offline

Related

Flag the month of default - python pandas

I have a pandas dataframe like this:
pd.DataFrame({
'customer_id': ['100', '200', '300', '400', '500', '600'],
'Month1': [1, 1, 1, 1, 1, 1],
'Month2': [1, 0, 1, 1, 1, 1],
'Month3': [0, 0, 0, 0, 1, 1],
'Month4': [0, 0, 0, 0, 0, 1]})
This is showing a boolean value for when a customer defaults on a loan. The first month with 0 means the customer defaulted that month. I want an output that displays the month number the customer defaulted on the loan.
Output:
pd.DataFrame({
'customer_id': ['100', '200', '300', '400', '500', '600'],
'Month1': [1, 1, 1, 1, 1, 1],
'Month2': [1, 0, 1, 1, 1, 1],
'Month3': [0, 0, 0, 0, 1, 1],
'Month4': [0, 0, 0, 0, 0, 1],
'default_month': [3, 2, 3, 3, 4, np.nan]})
You can check whether all the 'Month' columns in a row are not 0, using all(axis=1) and ne(0) and return np.nan which means that the person has not yet defaulted (i.e. your row 5).
Then using eq(0)and idxmax you can check which is the first value of a row that equals to 0 and grab that column name.
import numpy as np
m = df.filter(like='Month')
df['default_month'] = np.where((m.ne(0)).all(1),np.nan,
m.eq(0).idxmax(1))
df
customer_id Month1 Month2 Month3 Month4 default_month
0 100 1 1 0 0 Month3
1 200 1 0 0 0 Month2
2 300 1 1 0 0 Month3
3 400 1 1 0 0 Month3
4 500 1 1 1 0 Month4
5 600 1 1 1 1 NaN
Here's some code to get your result:
first we need a list of months in reverse order. From your data, I just pulled them directly from the index.
months = list(df.columns[1:5])
months.reverse()
months is now ['Month4', 'Month3', 'Month2', 'Month1'].
We iterate backwards so when we find an earlier month of default, it overwrites
for (i,m) in enumerate(months):
mask = df[m] == 0 # Check for a default
df.loc[mask,'default_month'] = len(months) - i
This returns the output you are looking for.

Cant assign value to cell in multiindex dataframe (assigning to copy / slice of df?)

I am trying to assign a value (mean of values in another column) to a cell in a multi-index Pandas dataframe over which I iterate to calculate means over a moving window in a different column. But, when I try to assign the value it doesn't change.
I am not used to working with multi-indexes and have solved several other problems but this one has me stumped for now...
Toy code that reproduces the problem:
tuples = [
('AFG', 1963), ('AFG', 1964), ('AFG', 1965), ('AFG', 1966), ('AFG', 1967), ('AFG', 1968),
('BRA', 1963), ('BRA', 1964), ('BRA', 1965), ('BRA', 1966), ('BRA', 1967), ('BRA', 1968)
]
index = pd.MultiIndex.from_tuples(tuples)
values = [[12, None], [0, None],
[12, None], [0, 4],
[12, 5], [0, 4],
[12, 2], [0, 4],
[12, 2], [0, 4],
[1, 4], [7, 1]]
df = pd.DataFrame(values, columns=['Oil', 'Pop'], index=index)
lag =-2
lead=0
indicator = 'Pop'
new_indicator = 'Mean_pop'
df[new_indicator] = np.nan
df
Gives:
Oil Pop Mean_pop
AFG 1963 12 NaN NaN
1964 0 NaN NaN
1965 12 NaN NaN
1966 0 4.0 NaN
1967 12 5.0 NaN
1968 0 4.0 NaN
BRA 1963 12 2.0 NaN
1964 0 4.0 NaN
1965 12 2.0 NaN
1966 0 4.0 NaN
1967 1 4.0 NaN
1968 7 1.0 NaN
Then to iterate over the df:
for country, country_df in df.groupby(level=0):
oldestyear = country_df[indicator].first_valid_index()[1]
latestyear = country_df[indicator].last_valid_index()[1]
for t in range(oldestyear, latestyear+1):
print (country, oldestyear, latestyear, t)
print (" For", country, ", calculate mean over ", t+lag, "to", t+lead,
"and add to row for year", t)
dftt = country_df.loc[(country, t+lag):(country, t+lead)]
print(dftt[indicator])
mean = dftt[indicator].mean(axis=0)
print("mean for ", indicator, "in", country, "during", t+lag, "to", t+lead, "is", mean)
df.loc[country, t][new_indicator] = mean
Diagnostic output not pasted, but df looks the same after iterating over it and I get the following warning on some iterations:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if sys.path[0] == '':
Any pointers will be greatly appreciated.
I think it is a easy as setting last line to:
df.loc[(country, t), new_indicator] = mean

Is there a way to extract code that constructs a data frame from the data frame?

I am looking for a way to extract code that constructs a data frame, from the loaded data frame.
Consider the following process.
# Code to construct a df:
df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
# Obtain the df output:
df
num_legs num_wings num_specimen_seen
falcon 2 2 10
dog 4 0 2
spider 8 0 1
fish 0 0 8
I am looking for an automatized reverse process. Suppose, I start with the df, which I load from a csv file (example below, same df as above).
df =
pd.read_csv('/path_to_data/df.csv', sep='\t')
df
num_legs num_wings num_specimen_seen
falcon 2 2 10
dog 4 0 2
spider 8 0 1
fish 0 0 8
At this point, is there a way to extract the code (listed below), that would construct the df, assuming that I did not have the code to begin with.
df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
This is not always useful, but I am curious if this can be done, for certain portability purposes. For instance, this would allow sharing one jupyter notebook document, without referencing anything external. And allow for a fully self-sustained replicability of data analysis.
You can get this information using df.to_dict('list') and df.index respectively:
In [9]: df
Out[9]:
num_legs num_wings num_specimen_seen
falcon 2 2 10
dog 4 0 2
spider 8 0 1
fish 0 0 8
In [10]: df.to_dict('list')
Out[10]:
{'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0],
'num_specimen_seen': [10, 2, 1, 8]}
In [11]: df.index
Out[11]: Index(['falcon', 'dog', 'spider', 'fish'], dtype='object')
In [12]: new_df = pd.DataFrame(df.to_dict('list'), index=df.index)
In [13]: new_df
Out[13]:
num_legs num_wings num_specimen_seen
falcon 2 2 10
dog 4 0 2
spider 8 0 1
fish 0 0 8

Pandas: How to build a column based on another column which is indexed by another one?

I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
df=creatingDataFrame()
df['var']=np.where(df['Region']=='A',1.0,0.0)*df['var-A']+np.where(df['Region']=='B',1.0,0.0)*df['var-B']+np.where(df['Region']=='C',1.0,0.0)*df['var-C']
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
df['var']
Out[50]:
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64
You can try with lookup
df.columns=df.columns.str.split('-').str[-1]
df
Out[255]:
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
df.lookup(df.index,df.Region)
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)
#df['var']=df.lookup(df.index,df.Region)

Efficient evaluation of weighted average variable in a Pandas Dataframe

Please, considere the dataframe df generated below:
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'var1': [10, 20, 30, 20 , 30, 30],
'var2': [2,4,6,4,6,6],
'price': [20, 30, 40 , 50, 10, 20],
'sells': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'var1','var2', 'price', 'sells'])
return df
if __name__=="__main__":
df=creatingDataFrame()
setCode=set(df['code'])
listDF=[]
for code in setCode:
dfCode=df[df['code'] == code].copy()
print(dfCode)
lenDfCode=len(dfCode)
if(lenDfCode==1):
theData={'code': [dfCode['code'].iloc[0]],
'var1': [dfCode['var1'].iloc[0]],
'var2': [dfCode['var2'].iloc[0]],
'averagePrice': [dfCode['price'].iloc[0]],
'totalSells': [dfCode['sells'].iloc[0]]
}
else:
dfCode['price*sells']=dfCode['price']*dfCode['sells']
sumSells=np.sum(dfCode['sells'])
sumProducts=np.sum(dfCode['price*sells'])
dfCode['totalSells']=sumSells
av=sumProducts/sumSells
dfCode['averagePrice']=av
theData={'code': [dfCode['code'].iloc[0]],
'var1': [dfCode['var1'].iloc[0]],
'var2': [dfCode['var2'].iloc[0]],
'averagePrice': [dfCode['averagePrice'].iloc[0]],
'totalSells': [dfCode['totalSells'].iloc[0]]
}
dfPart=pd.DataFrame(theData, columns = ['code', 'var1','var2', 'averagePrice','totalSells'])
listDF.append(dfPart)
newDF = pd.concat(listDF)
print(newDF)
I have this dataframe
code var1 var2 price sells
0 1 10 2 20 3
1 2 20 4 30 4
2 3 30 6 40 5
3 2 20 4 50 1
4 3 30 6 10 2
5 3 30 6 20 3
I want to generate the following dataframe:
code var1 var2 averagePrice totalSells
0 1 10 2 20.0 3
0 2 20 4 34.0 5
0 3 30 6 28.0 10
Note that this dataframe is created from the first by evaluating the average price and total sells for each code. Furthermore, var1 and var2 are the same for each code. The python code above does that, but I know that it is inefficient. I believe that a desired solution can be done using groupby, but I am not able to generate it.
It is different , apply with pd.Series
df.groupby(['code','var1','var2']).apply(lambda x : pd.Series({'averagePrice': sum(x['sells']*x['price'])/sum(x['sells']),'totalSells':sum(x['sells'])})).reset_index()
Out[366]:
code var1 var2 averagePrice totalSells
0 1 10 2 20.0 3.0
1 2 20 4 34.0 5.0
2 3 30 6 28.0 10.0

Resources