How can i find the highest and the lowest value between rows depending on a condition being met in another column in a pd.DataFrame? - python-3.x

I have the following DataFrame, that is generated every time that I run the script, the DataFrame looks like this:
df=
index time value status
0 2020-11-20 20:10:00 10 X
1 2020-11-20 20:20:00 11 X
2 2020-11-20 20:45:00 9 X
3 2020-11-20 20:45:00 5 Y
4 2020-11-20 21:00:00 4 X
5 2020-11-20 21:05:00 2 Y
6 2020-11-20 21:15:00 4 Y
7 2020-11-20 21:20:00 9 X
8 2020-11-20 21:25:00 5 X
The desired output would be :
index time value status
0 2020-11-20 20:20:00 11 X
1 2020-11-20 20:45:00 5 Y
2 2020-11-20 21:00:00 4 X
3 2020-11-20 21:05:00 2 Y
4 2020-11-20 21:20:00 9 X
So my goal here would be to create a new pd.DataFrame with the lowest values of Y and the highest values of X.
Thanks to everyone in advance for all the assistance and support.

You can do a groupby on consecutive values of your DataFrame where the status is the same, sort each grouped DataFrame by value, and keep either the first or last value of the sorted DataFrame depending on whether the grouped DataFrame has status equal to X or Y.
Note: I noticed the time column of your DataFrame has no impact on the answer, so I didn't include it when I recreated your DataFrame.
import pandas as pd
## the time column doesn't matter in your problem
df = pd.DataFrame({
'value':[10,11,9,5,4,2,4,9,5],
'status':['X']*3+['Y']+['X']+['Y']*2+['X']*2
})
df_new = pd.DataFrame(columns=df.columns)
## perform a groupby on consecutive values
for _, g in df.groupby([(df.status != df.status.shift()).cumsum()]):
g = g.sort_values(by='value')
## keep the highest value for X
if g.status.values[0] == 'X':
g = g.drop_duplicates(subset=['status'], keep='last')
## keep the lowest value for Y
elif g.status.values[0] == 'Y':
g = g.drop_duplicates(subset=['status'], keep='first')
else:
pass
df_new = pd.concat([df_new, g])
df_new = df_new.reset_index(drop=True)
Output:
>>> df_new
value status
0 11 X
1 5 Y
2 4 X
3 2 Y
4 9 X

Related

Stack row under row from two different dataframe using python? [duplicate]

df1 = pd.DataFrame({'a':[1,2,3],'x':[4,5,6],'y':[7,8,9]})
df2 = pd.DataFrame({'b':[10,11,12],'x':[13,14,15],'y':[16,17,18]})
I'm trying to merge the two data frames using the keys from the df1. I think I should use pd.merge for this, but I how can I tell pandas to place the values in the b column of df2 in the a column of df1. This is the output I'm trying to achieve:
a x y
0 1 4 7
1 2 5 8
2 3 6 9
3 10 13 16
4 11 14 17
5 12 15 18
Just use concat and rename the column for df2 so it aligns:
In [92]:
pd.concat([df1,df2.rename(columns={'b':'a'})], ignore_index=True)
Out[92]:
a x y
0 1 4 7
1 2 5 8
2 3 6 9
3 10 13 16
4 11 14 17
5 12 15 18
similarly you can use merge but you'd need to rename the column as above:
In [103]:
df1.merge(df2.rename(columns={'b':'a'}),how='outer')
Out[103]:
a x y
0 1 4 7
1 2 5 8
2 3 6 9
3 10 13 16
4 11 14 17
5 12 15 18
Use numpy to concatenate the dataframes, so you don't have to rename all of the columns (or explicitly ignore indexes). np.concatenate also works on an arbitrary number of dataframes.
df = pd.DataFrame( np.concatenate( (df1.values, df2.values), axis=0 ) )
df.columns = [ 'a', 'x', 'y' ]
df
You can rename columns and then use functions append or concat:
df2.columns = df1.columns
df1.append(df2, ignore_index=True)
# pd.concat([df1, df2], ignore_index=True)
You can also concatenate both dataframes with vstack from numpy and convert the resulting ndarray to dataframe:
pd.DataFrame(np.vstack([df1, df2]), columns=df1.columns)

Divide values of rows based on condition which are of running count

Sample of the table for 1 id, exists multiple id in the original df.
id
legend
date
running_count
101
X
24-07-2021
3
101
Y
24-07-2021
5
101
X
25-07-2021
4
101
Y
25-07-2021
6
I want to create a new column where I have to perform division of the running_count on the basis of the id, legend and date - (X/Y) for the date 24-07-2021 for a particular id and so on.
How shall I perform the calculation?
If there is same order X, Y for each id is possible use:
df['new'] = df['running_count'].div(df.groupby(['id','date'])['running_count'].shift(-1))
print (df)
id legend date running_count new
0 101 X 24-07-2021 3 0.600000
1 101 Y 24-07-2021 5 NaN
2 101 X 25-07-2021 4 0.666667
3 101 Y 25-07-2021 6 NaN
If possible change ouput:
df1 = df.pivot(index=['id','date'], columns='legend', values='running_count')
df1['new'] = df1['X'].div(df1['Y'])
df1 = df1.reset_index()
print (df1)
legend id date X Y new
0 101 24-07-2021 3 5 0.600000
1 101 25-07-2021 4 6 0.666667

Sum in Column based on condition in rows in pandas dataframe [duplicate]

I have a dataframe which I want to plot with matplotlib, but the index column is the time and I cannot plot it.
This is the dataframe (df3):
but when I try the following:
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
I'm getting an error obviously:
KeyError: 'YYYY-MO-DD HH-MI-SS_SSS'
So what I want to do is to add a new extra column to my dataframe (named 'Time) which is just a copy of the index column.
How can I do it?
This is the entire code:
#Importing the csv file into df
df = pd.read_csv('university2.csv', sep=";", skiprows=1)
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
#Add Magnetic Magnitude Column
df['magnetic_mag'] = np.sqrt(df['MAGNETIC FIELD X (μT)']**2 + df['MAGNETIC FIELD Y (μT)']**2 + df['MAGNETIC FIELD Z (μT)']**2)
#Subtract Earth's Average Magnetic Field from 'magnetic_mag'
df['magnetic_mag'] = df['magnetic_mag'] - 30
#Copy interesting values
df2 = df[[ 'ATMOSPHERIC PRESSURE (hPa)',
'TEMPERATURE (C)', 'magnetic_mag']].copy()
#Hourly Average and Standard Deviation for interesting values
df3 = df2.resample('H').agg(['mean','std'])
df3.columns = [' '.join(col) for col in df3.columns]
df3.reset_index()
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
Thank you !!
I think you need reset_index:
df3 = df3.reset_index()
Possible solution, but I think inplace is not good practice, check this and this:
df3.reset_index(inplace=True)
But if you need new column, use:
df3['new'] = df3.index
I think you can read_csv better:
df = pd.read_csv('university2.csv',
sep=";",
skiprows=1,
index_col='YYYY-MO-DD HH-MI-SS_SSS',
parse_dates='YYYY-MO-DD HH-MI-SS_SSS') #if doesnt work, use pd.to_datetime
And then omit:
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
EDIT: If MultiIndex or Index is from groupby operation, possible solutions are:
df = pd.DataFrame({'A':list('aaaabbbb'),
'B':list('ccddeeff'),
'C':range(8),
'D':range(4,12)})
print (df)
A B C D
0 a c 0 4
1 a c 1 5
2 a d 2 6
3 a d 3 7
4 b e 4 8
5 b e 5 9
6 b f 6 10
7 b f 7 11
df1 = df.groupby(['A','B']).sum()
print (df1)
C D
A B
a c 1 9
d 5 13
b e 9 17
f 13 21
Add parameter as_index=False:
df2 = df.groupby(['A','B'], as_index=False).sum()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
Or add reset_index:
df2 = df.groupby(['A','B']).sum().reset_index()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
You can directly access in the index and get it plotted, following is an example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
#Get index in horizontal axis
plt.plot(df.index, df[0])
plt.show()
#Get index in vertiacal axis
plt.plot(df[0], df.index)
plt.show()
You can also use eval to achieve this:
In [2]: df = pd.DataFrame({'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')}, index=list('ABCDE'))
In [3]: df
Out[3]:
num date
A 0 2022-06-30
B 1 2022-07-01
C 2 2022-07-02
D 3 2022-07-03
E 4 2022-07-04
In [4]: df.eval('index_copy = index')
Out[4]:
num date index_copy
A 0 2022-06-30 A
B 1 2022-07-01 B
C 2 2022-07-02 C
D 3 2022-07-03 D
E 4 2022-07-04 E

Find no of days gap for a specific ID when it has a flag X in other column

I want to calculate the no. of days gap for when the 'flag' column is equal to 'X' for same IDs.
The dataframe that I have:
ID Date flag
1 1-1-2020 X
1 10-1-2020 null
1 15-1-2020 X
2 1-2-2020 X
2 10-2-2020 X
2 15-2-2020 X
3 15-2-2020 null
The dataframe I want:
ID Date flag no_of_days
1 1-1-2020 X 14
1 10-1-2020 null null
1 15-1-2020 X null
2 1-2-2020 X 9
2 10-2-2020 X 8
2 18-2-2020 X null
3 15-2-2020 null null
Thanks in advance.
First filter rows by X in boolean indexing and then subtract shifted column per groups by DataFrameGroupBy.shift and last convert timedeltas to days by Series.dt.days:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['new'] = df[df['flag'].eq('X')].groupby('ID')['Date'].shift(-1).sub(df['Date']).dt.days
print (df)
ID Date flag new
0 1 2020-01-01 X 14.0
1 1 2020-01-10 NaN NaN
2 1 2020-01-15 X NaN
3 2 2020-02-01 X 9.0
4 2 2020-02-10 X 8.0
5 2 2020-02-18 X NaN
6 3 2020-02-15 NaN NaN

Python correlation matrix 3d dataframe

I have in SQL Server a historical return table by date and asset Id like this:
[Date] [Asset] [1DRet]
jan asset1 0.52
jan asset2 0.12
jan asset3 0.07
feb asset1 0.41
feb asset2 0.33
feb asset3 0.21
...
So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3
Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.
I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe
I see some example but it's always for a dataframe where you have an asset per column and one row per day.
This my code block where I'm doing it:
qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"
result = conn.execute(qryRet)
df = pd.DataFrame(data=list(result),columns=result.keys())
df1d = df[['Date','Id_RiskFactor','1DReturn']]
corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)
conn.close()
For it I'm reciving this msg:
corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'
**Print(df1d.head())**
Date Id_RiskFactor 1DReturn
0 2010-09-02 149 0E-12
1 2010-09-02 150 -0.004242875148
2 2010-09-02 33 0.000590000011
3 2010-09-02 28 0.000099999997
4 2010-09-02 34 -0.000010000000
**print(df.head())**
Date Id_RiskFactor Value 1DReturn 5DReturn
0 2010-09-02 149 0.040096000000 0E-12 0E-12
1 2010-09-02 150 1.736700000000 -0.004242875148 -0.013014321215
2 2010-09-02 33 2.283000000000 0.000590000011 0.001260000048
3 2010-09-02 28 2.113000000000 0.000099999997 0.000469999999
4 2010-09-02 34 0.615000000000 -0.000010000000 0.000079999998
**print(corr.columns)**
Index([], dtype='object')
Create a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'daily_return': np.random.random(15),
'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})
>>> df
daily_return date symbol
0 0.011467 2015-01-01 A
1 0.613518 2015-01-02 A
2 0.334343 2015-01-03 A
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
5 0.431729 2015-01-01 B
6 0.474905 2015-01-02 B
7 0.372366 2015-01-03 B
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
10 0.946504 2015-01-01 C
11 0.337204 2015-01-02 C
12 0.798704 2015-01-03 C
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.
corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2 A B C
symbol_1
A 1.000000 0.188065 -0.745115
B 0.188065 1.000000 -0.688808
C -0.745115 -0.688808 1.000000
You can select the subset of your DataFrame based on dates as follows:
start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
daily_return date symbol
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
If you want to flatten your correlation matrix:
corr.stack().reset_index()
symbol_1 symbol_2 0
0 A A 1.000000
1 A B 0.188065
2 A C -0.745115
3 B A 0.188065
4 B B 1.000000
5 B C -0.688808
6 C A -0.745115
7 C B -0.688808
8 C C 1.000000

Resources