Find and Add Missing Column Values Based on Index Increment Python Pandas Dataframe - python-3.x

Good Afternoon!
I have a pandas dataframe with an index and a count.
dictionary = {1:5,2:10,4:3,5:2}
df = pd.DataFrame.from_dict(dictionary , orient = 'index' , columns = ['count'])
What I want to do is check from df.index.min() to df.index.max() that the index increment is 1. If a value is missing like in my case the 3 is missing then I want to add 3 to the index with a 0 in the count.
The output will look like the below df2 but done in a programmatic fashion so I can use it on a much bigger dataframe.
RESULTS EXAMPLE DF:
dictionary2 = {1:5,2:10,3:0,4:3,5:2}
df2 = pd.DataFrame.from_dict(dictionary2 , orient = 'index' , columns = ['count'])
Thank you much!!!

Ensure the index is sorted:
df = df.sort_index()
Create an array that starts from the minimum index to the maximum index
complete_array = np.arange(df.index.min(), df.index.max() + 1)
Reindex, fill the null value with 0, and optionally change the dtype to Pandas Int:
df.reindex(complete_array, fill_value=0).astype("Int16")
count
1 5
2 10
3 0
4 3
5 2

Related

Apply function to specific rows of dataframe column

I have a following column in a dataframe:
COLUMN_NAME
1
0
1
1
65280
65376
65280
I want to convert 5 digit values in a column to their corresponding binary values. I know how to convert it by using bin() function, but i don't know how to apply it only to rows that has 5digits.
Note that the column contains only values with either 1 or 5 digits. Values with 1 digit is only 1 or 0.
import pandas as pd
import numpy as np
data = {'c': [1,0,1,1,65280,65376,65280] }
df = pd.DataFrame (data, columns = ['c'])
// create another column 'clen' which has length of 'c'
df['clen'] = df['c'].astype(str).map(len)
//check condition and apply bin function to entire column
df.loc[df['clen']==5,'c'] = df['c'].apply(bin)

How to return index of a row 60 seconds before current row

I have a large (>32 M rows) Pandas dataframe.
In column 'Time_Stamp' I have a Unix timestamp in seconds. These values are not linear, there are gaps, and some timestamps can be duplicated (ex: 1, 2, 4, 6, 6, 9,...).
I would like to set column 'Result' of current row to the index of the row that is 60 seconds before current row (closest match if there are no rows exactly 60 seconds before current row, and if more than one match, take maximum of all matches).
I've tried this to first get the list of indexes, but it always return an empty list:
df.index[df['Time_Stamp'] <= df.Time_Stamp-60].tolist()
I cannot use a for loop due to the large number of rows.
Edit 20.01.2020:
Based on comment below, I'm adding a sample dataset, and instead of returning the index I want to return the column Value:
In [2]: df
Out[2]:
Time_Stamp Value
0 1 2.4
1 2 3.1
2 4 6.3
3 6 7.2
4 6 6.1
5 9 6.0
So with the precious help of ALollz, I managed to achieve what i wanted to do in the end, here's my code:
#make copy of dataframe
df2 = df[['Time_Stamp','Value']].copy()
#add Time_gap to Time_Stamp in df2
df2['Time_Stamp'] = df2.Time_Stamp +Time_gap
#sort df2 on Time_Stamp
df2.sort_values(by = 'Time_Stamp', ascending=True,inplace = True)
df2 = df2.reset_index(drop=True)
df3 = pd.merge_asof(df, df2, on='Time_Stamp', direction='forward')

Normalize Column Values by Monthly Averages with added Group dimension

Initial Note
I already got this running, but it takes a very long time to execute. My DataFrame is around 500MB large. I am hoping to hear some feedback on how to execute this as quickly as possible.
Problem Statement
I want to normalize the DataFrame columns by the mean of the column's values during each month. An added complexity is that I have a column named group which denotes a different sensor in which the parameter (column) was measured. Therefore, the analysis needs to iterate around group and each month.
DF example
X Y Z group
2019-02-01 09:30:07 1 2 1 'grp1'
2019-02-01 09:30:23 2 4 3 'grp2'
2019-02-01 09:30:38 3 6 5 'grp1'
...
Code (Functional, but slow)
This is the code that I used. Coding annotations provide descriptions of most lines. I recognize that the three for loops are causing this runtime issue, but I do not have the foresight to see a way around it. Does anyone know any
# Get mean monthly values for each group
mean_per_month_unit = process_df.groupby('group').resample('M', how='mean')
# Store the monthly dates created in last line into a list called month_dates
month_dates = mean_per_month_unit.index.get_level_values(1)
# Place date on multiIndex columns. future note: use df[DATE, COL_NAME][UNIT] to access mean value
mean_per_month_unit = mean_per_month_unit.unstack().swaplevel(0,1,1).sort_index(axis=1)
divide_df = pd.DataFrame().reindex_like(df)
process_cols.remove('group')
for grp in group_list:
print(grp)
# Iterate through month
for mnth in month_dates:
# Make mask where month and group
mask = (df.index.month == mnth.month) & (df['group'] == grp)
for col in process_cols:
# Set values of divide_df
divide_df.iloc[mask.tolist(), divide_df.columns.get_loc(col)] = mean_per_month_unit[mnth, col][grp]
# Divide process_df with divide_df
final_df = process_df / divide_df.values
EDIT: Example data
Here is the data in CSV format.
EDIT2: Current code (according to current answer)
def normalize_df(df):
df['month'] = df.index.month
print(df['month'])
df['year'] = df.index.year
print(df['year'])
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
agg = df.groupby(by=['group', 'month', 'year'], as_index=True).mean()
print("###################", x.name, x['month'])
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by
print(column)
mean_col = agg.loc[(x['group'], x['month'], x['year']), column]
print(mean_col)
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
normalize_cols = df.columns.tolist()
normalize_cols.remove('group')
#normalize_cols.remove('mode')
df2 = df.apply(find_norm, df_col_list = normalize_cols, axis=1)
The code runs perfectly for one iteration and then it fails with the error:
KeyError: ('month', 'occurred at index 2019-02-01 11:30:17')
As I said, it runs correctly once. However, it iterates over the same row again and then fails. I see according to df.apply() documentation that the first row always runs twice. I'm just not sure why this fails on the second time through.
Assuming that the requirement is to group the columns by mean and the month, here is another approach:
Create new columns - month and year from the index. df.index.month can be used for this provided the index is of type DatetimeIndex
type(df.index) # df is the original dataframe
#pandas.core.indexes.datetimes.DatetimeIndex
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.
Now, group over (grp, month, year) and aggregate to find mean of every column. (Added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.)
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
Use a function to calculate the normalized values and use apply() over the original dataframe
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
#df2 will now have 3 additional columns - normA, normB, normC
df2:
A B C grp month year normA normB normC
2019-02-01 09:30:07 1 2 3 1 2 2019 0.666667 0.8 1.5
2019-03-02 09:30:07 2 3 4 1 3 2019 1.000000 1.0 1.0
2019-02-01 09:40:07 2 3 1 2 2 2019 1.000000 1.0 1.0
2019-02-01 09:38:07 2 3 1 1 2 2019 1.333333 1.2 0.5
Alternatively, for step 3, one can join the agg and df dataframes and find the norm.
Hope this helps!
Here is how the code would look like:
# Step 1
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs
# Step 2
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
# Step 3
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)

conditionally multiply values in DataFrame row

here is an example DataFrame:
df = pd.DataFrame([[1,0.5,-0.3],[0,-4,7],[1,0.12,-.06]], columns=['condition','value1','value2'])
I would like to apply a function which multiples the values ('value1' and 'value2' in each row by 100, if the value in the 'condition' column of that row is equal to 1, otherwise, it is left as is.
presumably some usage of .apply with a lambda function would work here but I am not able to get the syntax right. e.g.
df.apply(lambda x: 100*x if x['condition'] == 1, axis=1)
will not work
the desired output after applying this operation would be:
As simple as
df.loc[df.condition==1,'value1':]*=100
import numpy as np
df['value1'] = np.where(df['condition']==1,df['value1']*100,df['value1']
df['value2'] = np.where(df['condition']==1,df['value2']*100,df['value2']
In case multiple columns
# create a list of columns you want to apply condition
columns_list = ['value1','value2']
for i in columns_list:
df[i] = np.where(df['condition']==1,df[i]*100,df[i]
Use df.loc[] with the condition and filter the list of cols to operate then multiply:
l=['value1','value2'] #list of cols to operate on
df.loc[df.condition.eq(1),l]=df.mul(100)
#if condition is just 0 and 1 -> df.loc[df.condition.astype(bool),l]=df.mul(100)
print(df)
Another solution using df.mask() using same list of cols as above:
df[l]=df[l].mask(df.condition.eq(1),df[l]*100)
print(df)
condition value1 value2
0 1 50.0 -30.0
1 0 -4.0 7.0
2 1 12.0 -6.0
Use a mask to filter and where it is true choose second argument where false choose third argument is how np.where works
value_cols = ['value1','value2']
mask = (df.condition == 1)
df[value_cols] = pd.np.where(mask[:, None], df[value_cols].mul(100), df[value_cols])
If you have multiple value columns such as value1, value2 ... and so on, Use
value_cols = df.filter(regex='value\d').columns

How to select pandas dataframe rows with loc using the ligne index?

I have a big pandas dataframe from which I'm trying to select some rows with the .loc tool. The problem is that the condition I want to use in it needs an index which is given in one of the columns of the dataframe (the 'index' one). I try to select the row if the value is below a value that I need to found with the index in a simple list.
>>> df
r v index
1 2 2
2 4 3
3 20 1
>>> list
[3,6,32]
I want something like:
df.loc[ df['v'] < list[ df['index'] ] ]
So something which refers to the index in the studied row of the dataframe.
IIUC, convert the list to an array, and use "index" as the indexer:
v = np.array([3,6,32])
df[df['v'] < v[df['index'] - 1]]
r v index
0 1 2 2
1 2 4 3
Where,
v[df['index'] - 1]
# array([ 6, 32, 3])
r = df.loc[df['v'] < v[df['index'] - 1]].copy()

Resources