How to find the min and max elements in a list that has nan in between - python-3.x

I have a dataframe that has a column named "score". I am extracting all the elements from that column into a list. It has 'nan's in between. I wish to identify the min and max of elements before every 'nan' occurs.
I was looking into converting the column into a list, and traverse the list until I encounter an "nan". But how do I traverse back to find the min and max elements right before nan?
This is the code I wrote to convert a column of a dataframe into a list and then identify the "nan".
score_list = description_df['score'].tolist()
for i in score_list:
print(i)
if math.isnan(i):
print("\n")
Suppose my data looks like this,
11.03680137760893
5.351482041139766
10.10019513222711
nan
0.960990030082931
nan
6.46983084276682
32.46794015293125
nan
Then, I should be able to identify max as 11.03680137760893
and min as 5.351482041139766 before the occurrence of the first "nan", 0.960990030082931 as the min and max before the occurrence of second nan and after the occurrence of first nan, and 32.46794015293125 as max and 6.46983084276682 as min after the second 'nan' and before the third 'nan'

You can create groups by testing missing values by Series.isna with Series.cumsum, aggregate by GroupBy.agg with min and max and last remove only missing rows by DataFrame.dropna:
df = df.groupby(df['score'].isna().cumsum())['score'].agg(['min','max']).dropna()
print (df)
min max
score
0 5.351482 11.036801
1 0.960990 0.960990
2 6.469831 32.467940

You can create two variables called min and max that begin with a default value each time you find a nan and print them (or store).
import sys
score_list = description_df['score'].tolist()
max = sys.float_info.min
min = sys.float_info.max
for i in score_list:
print(i)
if math.isnan(i):
print("max =", max, "min =", min, "\n")
max = sys.float_info.min
min = sys.float_info.max
else:
if i > max:
max = i
if i < min:
min = i

Related

Pandas: Compare row with all other rows by multiple conditions

I want to compare the all the rows (one-by-one) with all the other rows in the following extract of my dataframe.
Idx ECTRL ID Latitude Longitude
0 186858227 53.617750 30.866759
1 186858229 40.569012 35.138237
2 186858235 38.915970 38.782447
3 186858295 39.737594 37.005481
4 186858299 48.287601 15.487567
I want to extract "ECTRL ID"-Combinations (e.g. 186858235, 186858295), where the differences of longitude and latitude are both less than 2.
e.g.:
df.iloc[2]["Latitude"] - df.iloc[3]["Latitude"] <= 2
if its true then i want to return it as a tuple and append it to a list.
(186858235, 186858295)
It works with a loop but its pretty slow:
l = []
for idx, row in data.iterrows():
for j, row2 in data.iterrows():
if np.absolute(row['Longitude'] - row2['Longitude']) < 0.05 and np.absolute(row['Latitude'] - row2['Latitude']) < 0.05 and row["ECTRL ID"] != row2["ECTRL ID"]:
tup = (row["ECTRL ID"], row2["ECTRL ID"])
l.append(tup)
is there any way to make this faster with the build-in pandas functions? i have not found a way without looping

How to check maximum of consecutive values greater than 5

I want to check in dataframe for how many rows consecutive values greater than 5.
df=pd.DataFrame({'A':[3,4,7,8,11,6,15,3,15,16,87]})
out:
df=pd.DataFrame({'count_greater_5_max':[5]})
Use:
#compare greater like 5
a = df.A.gt(5)
#running sum
b = a.cumsum()
#counter only for consecutive values
out = b-b.mask(a).ffill()
#maximum value of counter
print (int(out.max()))
5
df=pd.DataFrame({'count_greater_5_max':[int(out.max())]})

How to return index of a row 60 seconds before current row

I have a large (>32 M rows) Pandas dataframe.
In column 'Time_Stamp' I have a Unix timestamp in seconds. These values are not linear, there are gaps, and some timestamps can be duplicated (ex: 1, 2, 4, 6, 6, 9,...).
I would like to set column 'Result' of current row to the index of the row that is 60 seconds before current row (closest match if there are no rows exactly 60 seconds before current row, and if more than one match, take maximum of all matches).
I've tried this to first get the list of indexes, but it always return an empty list:
df.index[df['Time_Stamp'] <= df.Time_Stamp-60].tolist()
I cannot use a for loop due to the large number of rows.
Edit 20.01.2020:
Based on comment below, I'm adding a sample dataset, and instead of returning the index I want to return the column Value:
In [2]: df
Out[2]:
Time_Stamp Value
0 1 2.4
1 2 3.1
2 4 6.3
3 6 7.2
4 6 6.1
5 9 6.0
So with the precious help of ALollz, I managed to achieve what i wanted to do in the end, here's my code:
#make copy of dataframe
df2 = df[['Time_Stamp','Value']].copy()
#add Time_gap to Time_Stamp in df2
df2['Time_Stamp'] = df2.Time_Stamp +Time_gap
#sort df2 on Time_Stamp
df2.sort_values(by = 'Time_Stamp', ascending=True,inplace = True)
df2 = df2.reset_index(drop=True)
df3 = pd.merge_asof(df, df2, on='Time_Stamp', direction='forward')

Normalize Column Values by Monthly Averages with added Group dimension

Initial Note
I already got this running, but it takes a very long time to execute. My DataFrame is around 500MB large. I am hoping to hear some feedback on how to execute this as quickly as possible.
Problem Statement
I want to normalize the DataFrame columns by the mean of the column's values during each month. An added complexity is that I have a column named group which denotes a different sensor in which the parameter (column) was measured. Therefore, the analysis needs to iterate around group and each month.
DF example
X Y Z group
2019-02-01 09:30:07 1 2 1 'grp1'
2019-02-01 09:30:23 2 4 3 'grp2'
2019-02-01 09:30:38 3 6 5 'grp1'
...
Code (Functional, but slow)
This is the code that I used. Coding annotations provide descriptions of most lines. I recognize that the three for loops are causing this runtime issue, but I do not have the foresight to see a way around it. Does anyone know any
# Get mean monthly values for each group
mean_per_month_unit = process_df.groupby('group').resample('M', how='mean')
# Store the monthly dates created in last line into a list called month_dates
month_dates = mean_per_month_unit.index.get_level_values(1)
# Place date on multiIndex columns. future note: use df[DATE, COL_NAME][UNIT] to access mean value
mean_per_month_unit = mean_per_month_unit.unstack().swaplevel(0,1,1).sort_index(axis=1)
divide_df = pd.DataFrame().reindex_like(df)
process_cols.remove('group')
for grp in group_list:
print(grp)
# Iterate through month
for mnth in month_dates:
# Make mask where month and group
mask = (df.index.month == mnth.month) & (df['group'] == grp)
for col in process_cols:
# Set values of divide_df
divide_df.iloc[mask.tolist(), divide_df.columns.get_loc(col)] = mean_per_month_unit[mnth, col][grp]
# Divide process_df with divide_df
final_df = process_df / divide_df.values
EDIT: Example data
Here is the data in CSV format.
EDIT2: Current code (according to current answer)
def normalize_df(df):
df['month'] = df.index.month
print(df['month'])
df['year'] = df.index.year
print(df['year'])
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
agg = df.groupby(by=['group', 'month', 'year'], as_index=True).mean()
print("###################", x.name, x['month'])
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by
print(column)
mean_col = agg.loc[(x['group'], x['month'], x['year']), column]
print(mean_col)
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
normalize_cols = df.columns.tolist()
normalize_cols.remove('group')
#normalize_cols.remove('mode')
df2 = df.apply(find_norm, df_col_list = normalize_cols, axis=1)
The code runs perfectly for one iteration and then it fails with the error:
KeyError: ('month', 'occurred at index 2019-02-01 11:30:17')
As I said, it runs correctly once. However, it iterates over the same row again and then fails. I see according to df.apply() documentation that the first row always runs twice. I'm just not sure why this fails on the second time through.
Assuming that the requirement is to group the columns by mean and the month, here is another approach:
Create new columns - month and year from the index. df.index.month can be used for this provided the index is of type DatetimeIndex
type(df.index) # df is the original dataframe
#pandas.core.indexes.datetimes.DatetimeIndex
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.
Now, group over (grp, month, year) and aggregate to find mean of every column. (Added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.)
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
Use a function to calculate the normalized values and use apply() over the original dataframe
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
#df2 will now have 3 additional columns - normA, normB, normC
df2:
A B C grp month year normA normB normC
2019-02-01 09:30:07 1 2 3 1 2 2019 0.666667 0.8 1.5
2019-03-02 09:30:07 2 3 4 1 3 2019 1.000000 1.0 1.0
2019-02-01 09:40:07 2 3 1 2 2 2019 1.000000 1.0 1.0
2019-02-01 09:38:07 2 3 1 1 2 2019 1.333333 1.2 0.5
Alternatively, for step 3, one can join the agg and df dataframes and find the norm.
Hope this helps!
Here is how the code would look like:
# Step 1
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs
# Step 2
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
# Step 3
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)

Take the mean of n numbers in a DataFrame column and "drag" formula down similar to Excel

I'm trying to take the mean of n numbers in a pandas DataFrame column and "drag" the formula down each row to get the respective mean.
Let's say there are 6 rows of data with "Numbers" in column A and "Averages" in column B. I want to take the average of A1:A2, then "drag" that formula down to get the average of A2:A3, A3:A4, etc.
list = [55,6,77,75,9,127,13]
finallist = pd.DataFrame(list)
finallist.columns = ['Numbers']
Below gives me the average of rows 0:2 in the Numbers column. So calling out the rows with .iloc[0:2]) works, but when I try to shift down a row it doesn't work:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2])
Below I'm trying to take the average of the first two rows, then shift down by 1 as you move down the rows, but I get a value of NaN:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2].shift(1))
I expected the .iloc[0:2].shift(1)) to shift the mean function down 1 row but still apply to 2 total rows, but I got a value of NaN.
Here's a screenshot of my output:
What's happening in your shift(1) approach is that you're actually shifting the index in your data "down" once, so this code:
df['Numbers'].iloc[0:2].shift(1)
Produces the output:
0 NaN
1 55.0
Then you take the average of these two, which evalutes to NaN, and then you assign that single value to every element of the Averages Series here:
df['Averages'] = statistics.mean(df['Numbers'].iloc[0:2].shift(1))
You can instead use rolling() combined with mean() to get a sliding average across the entire data frame like this:
import pandas as pd
values = [55,6,77,75,9,127,13]
df = pd.DataFrame(values)
df.columns = ['Numbers']
df['Averages'] = df.rolling(2, min_periods=1).mean()
This produces the following output:
Numbers Averages
0 55 55.0
1 6 30.5
2 77 41.5
3 75 76.0
4 9 42.0
5 127 68.0
6 13 70.0

Resources