Divide values of rows based on condition which are of running count - python-3.x

Sample of the table for 1 id, exists multiple id in the original df.
id
legend
date
running_count
101
X
24-07-2021
3
101
Y
24-07-2021
5
101
X
25-07-2021
4
101
Y
25-07-2021
6
I want to create a new column where I have to perform division of the running_count on the basis of the id, legend and date - (X/Y) for the date 24-07-2021 for a particular id and so on.
How shall I perform the calculation?

If there is same order X, Y for each id is possible use:
df['new'] = df['running_count'].div(df.groupby(['id','date'])['running_count'].shift(-1))
print (df)
id legend date running_count new
0 101 X 24-07-2021 3 0.600000
1 101 Y 24-07-2021 5 NaN
2 101 X 25-07-2021 4 0.666667
3 101 Y 25-07-2021 6 NaN
If possible change ouput:
df1 = df.pivot(index=['id','date'], columns='legend', values='running_count')
df1['new'] = df1['X'].div(df1['Y'])
df1 = df1.reset_index()
print (df1)
legend id date X Y new
0 101 24-07-2021 3 5 0.600000
1 101 25-07-2021 4 6 0.666667

Related

Merge pandas data frame based on specific conditions

I have a df as shown below
df1:
ID Job Salary
1 A 100
2 B 200
3 B 20
4 C 150
5 A 500
6 A 600
7 A 200
8 B 150
df2:
ID Type Status Age
1 2 P 23
2 1 P 28
8 1 F 33
4 3 P 48
14 1 F 23
11 2 P 28
16 2 F 23
41 3 P 38
df3:
ID T_Type Amount
1 K 20
2 L -50
1 K 30
3 K 5
1 K 100
2 L -50
1 L -30
25 K 500
1 K 20
4 L -80
19 K 30
2 K -5
Explanation About the data
ID is the primary key of df1.
ID is the primary key of df2.
df3 does not have any primary key.
From the above, I would like to prepare below dfs.
1. IDs which are in df1 and df2.
Expected output1:
ID Job Salary
1 A 100
2 B 200
4 C 150
8 B 150
IDs which are there in df1 and not in df2
output2:
ID Job Salary
3 B 20
5 A 500
6 A 600
7 A 200
IDs which are there in df1 and df3
output3:
ID Job Salary
1 A 100
2 B 200
3 B 20
4 C 150
4. IDs which are there in df1 and not in df3.
output4:
ID Job Salary
5 A 500
6 A 600
7 A 200
8 B 150
>>> # 1. IDs which are in df1 and df2.
>>> df1[df1['ID'].isin(df2['ID'])]
ID Job Salary
0 1 A 100
1 2 B 200
3 4 C 150
7 8 B 150
>>> # 2. IDs which are there in df1 and not in df2
>>> df1[~df1['ID'].isin(df2['ID'])]
ID Job Salary
2 3 B 20
4 5 A 500
5 6 A 600
6 7 A 200
>>> # 3. IDs which are there in df1 and df3
>>> df1[df1['ID'].isin(df3['ID'])]
ID Job Salary
0 1 A 100
1 2 B 200
2 3 B 20
3 4 C 150
>>> # 4. IDs which are there in df1 and not in df3.
>>> df1[~df1['ID'].isin(df3['ID'])]
ID Job Salary
4 5 A 500
5 6 A 600
6 7 A 200
7 8 B 150
Actually, your expected results aren't any merges, but rather
selections, based on whether df1.ID is (or is not) in ID column
of the second DataFrame.
To get your expected results, run the following commands:
result_1 = df1[df1.ID.isin(df2.ID)]
result_2 = df1[~df1.ID.isin(df2.ID)]
result_3 = df1[df1.ID.isin(df3.ID)]
result_4 = df1[~df1.ID.isin(df3.ID)]

Find no of days gap for a specific ID when it has a flag X in other column

I want to calculate the no. of days gap for when the 'flag' column is equal to 'X' for same IDs.
The dataframe that I have:
ID Date flag
1 1-1-2020 X
1 10-1-2020 null
1 15-1-2020 X
2 1-2-2020 X
2 10-2-2020 X
2 15-2-2020 X
3 15-2-2020 null
The dataframe I want:
ID Date flag no_of_days
1 1-1-2020 X 14
1 10-1-2020 null null
1 15-1-2020 X null
2 1-2-2020 X 9
2 10-2-2020 X 8
2 18-2-2020 X null
3 15-2-2020 null null
Thanks in advance.
First filter rows by X in boolean indexing and then subtract shifted column per groups by DataFrameGroupBy.shift and last convert timedeltas to days by Series.dt.days:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['new'] = df[df['flag'].eq('X')].groupby('ID')['Date'].shift(-1).sub(df['Date']).dt.days
print (df)
ID Date flag new
0 1 2020-01-01 X 14.0
1 1 2020-01-10 NaN NaN
2 1 2020-01-15 X NaN
3 2 2020-02-01 X 9.0
4 2 2020-02-10 X 8.0
5 2 2020-02-18 X NaN
6 3 2020-02-15 NaN NaN

How can i find the highest and the lowest value between rows depending on a condition being met in another column in a pd.DataFrame?

I have the following DataFrame, that is generated every time that I run the script, the DataFrame looks like this:
df=
index time value status
0 2020-11-20 20:10:00 10 X
1 2020-11-20 20:20:00 11 X
2 2020-11-20 20:45:00 9 X
3 2020-11-20 20:45:00 5 Y
4 2020-11-20 21:00:00 4 X
5 2020-11-20 21:05:00 2 Y
6 2020-11-20 21:15:00 4 Y
7 2020-11-20 21:20:00 9 X
8 2020-11-20 21:25:00 5 X
The desired output would be :
index time value status
0 2020-11-20 20:20:00 11 X
1 2020-11-20 20:45:00 5 Y
2 2020-11-20 21:00:00 4 X
3 2020-11-20 21:05:00 2 Y
4 2020-11-20 21:20:00 9 X
So my goal here would be to create a new pd.DataFrame with the lowest values of Y and the highest values of X.
Thanks to everyone in advance for all the assistance and support.
You can do a groupby on consecutive values of your DataFrame where the status is the same, sort each grouped DataFrame by value, and keep either the first or last value of the sorted DataFrame depending on whether the grouped DataFrame has status equal to X or Y.
Note: I noticed the time column of your DataFrame has no impact on the answer, so I didn't include it when I recreated your DataFrame.
import pandas as pd
## the time column doesn't matter in your problem
df = pd.DataFrame({
'value':[10,11,9,5,4,2,4,9,5],
'status':['X']*3+['Y']+['X']+['Y']*2+['X']*2
})
df_new = pd.DataFrame(columns=df.columns)
## perform a groupby on consecutive values
for _, g in df.groupby([(df.status != df.status.shift()).cumsum()]):
g = g.sort_values(by='value')
## keep the highest value for X
if g.status.values[0] == 'X':
g = g.drop_duplicates(subset=['status'], keep='last')
## keep the lowest value for Y
elif g.status.values[0] == 'Y':
g = g.drop_duplicates(subset=['status'], keep='first')
else:
pass
df_new = pd.concat([df_new, g])
df_new = df_new.reset_index(drop=True)
Output:
>>> df_new
value status
0 11 X
1 5 Y
2 4 X
3 2 Y
4 9 X

Pandas print missing value column names and count only

I am using the following code to print the missing value count and the column names.
#Looking for missing data and then handling it accordingly
def find_missing(data):
# number of missing values
count_missing = data_final.isnull().sum().values
# total records
total = data_final.shape[0]
# percentage of missing
ratio_missing = count_missing/total
# return a dataframe to show: feature name, # of missing and % of missing
return pd.DataFrame(data={'missing_count':count_missing, 'missing_ratio':ratio_missing},
index=data.columns.values)
find_missing(data_final).head(5)
What I want to do is to only print those columns where there is a missing value as I have a huge data set of about 150 columns.
The data set looks like this
A B C D
123 ABC X Y
123 ABC X Y
NaN ABC NaN NaN
123 ABC NaN NaN
245 ABC NaN NaN
345 ABC NaN NaN
In the output I would just want to see :
missing_count missing_ratio
C 4 0.66
D 4 0.66
and not the columns A and B as there are no missing values there
Use DataFrame.isna with DataFrame.sum
to count by columns. We can also use DataFrame.isnull instead DataFrame.isna.
new_df = (df.isna()
.sum()
.to_frame('missing_count')
.assign(missing_ratio = lambda x: x['missing_count']/len(df))
.loc[df.isna().any()] )
print(new_df)
We can also use pd.concat instead DataFrame.assign
count = df.isna().sum()
new_df = (pd.concat([count.rename('missing_count'),
count.div(len(df))
.rename('missing_ratio')],axis = 1)
.loc[count.ne(0)])
Output
missing_count missing_ratio
A 1 0.166667
C 4 0.666667
D 4 0.666667
IIUC, we can assign the missing and total count to two variables do some basic math and assign back to a df.
a = df.isnull().sum(axis=0)
b = np.round(df.isnull().sum(axis=0) / df.fillna(0).count(axis=0),2)
missing_df = pd.DataFrame({'missing_vals' : a,
'missing_ratio' : b})
print(missing_df)
missing_vals ratio
A 1 0.17
B 0 0.00
C 4 0.67
D 4 0.67
you can filter out columns that don't have any missing vals
missing_df = missing_df[missing_df.missing_vals.ne(0)]
print(missing_df)
missing_vals ratio
A 1 0.17
C 4 0.67
D 4 0.67
You can also use concat:
s = df.isnull().sum()
result = pd.concat([s,s/len(df)],1)
result.columns = ["missing_count","missing_ratio"]
print (result)
missing_count missing_ratio
A 1 0.166667
B 0 0.000000
C 4 0.666667
D 4 0.666667

Python correlation matrix 3d dataframe

I have in SQL Server a historical return table by date and asset Id like this:
[Date] [Asset] [1DRet]
jan asset1 0.52
jan asset2 0.12
jan asset3 0.07
feb asset1 0.41
feb asset2 0.33
feb asset3 0.21
...
So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3
Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.
I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe
I see some example but it's always for a dataframe where you have an asset per column and one row per day.
This my code block where I'm doing it:
qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"
result = conn.execute(qryRet)
df = pd.DataFrame(data=list(result),columns=result.keys())
df1d = df[['Date','Id_RiskFactor','1DReturn']]
corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)
conn.close()
For it I'm reciving this msg:
corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'
**Print(df1d.head())**
Date Id_RiskFactor 1DReturn
0 2010-09-02 149 0E-12
1 2010-09-02 150 -0.004242875148
2 2010-09-02 33 0.000590000011
3 2010-09-02 28 0.000099999997
4 2010-09-02 34 -0.000010000000
**print(df.head())**
Date Id_RiskFactor Value 1DReturn 5DReturn
0 2010-09-02 149 0.040096000000 0E-12 0E-12
1 2010-09-02 150 1.736700000000 -0.004242875148 -0.013014321215
2 2010-09-02 33 2.283000000000 0.000590000011 0.001260000048
3 2010-09-02 28 2.113000000000 0.000099999997 0.000469999999
4 2010-09-02 34 0.615000000000 -0.000010000000 0.000079999998
**print(corr.columns)**
Index([], dtype='object')
Create a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'daily_return': np.random.random(15),
'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})
>>> df
daily_return date symbol
0 0.011467 2015-01-01 A
1 0.613518 2015-01-02 A
2 0.334343 2015-01-03 A
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
5 0.431729 2015-01-01 B
6 0.474905 2015-01-02 B
7 0.372366 2015-01-03 B
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
10 0.946504 2015-01-01 C
11 0.337204 2015-01-02 C
12 0.798704 2015-01-03 C
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.
corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2 A B C
symbol_1
A 1.000000 0.188065 -0.745115
B 0.188065 1.000000 -0.688808
C -0.745115 -0.688808 1.000000
You can select the subset of your DataFrame based on dates as follows:
start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
daily_return date symbol
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
If you want to flatten your correlation matrix:
corr.stack().reset_index()
symbol_1 symbol_2 0
0 A A 1.000000
1 A B 0.188065
2 A C -0.745115
3 B A 0.188065
4 B B 1.000000
5 B C -0.688808
6 C A -0.745115
7 C B -0.688808
8 C C 1.000000

Resources