select a row based on the highest value of a column python - python-3.x

The dataFrame looks something like this: Name of person and weight at a given date.
Name date w
1 Mike 2019-01-21 89.1
2 Mike 2018-11-12 88.1
3 Mike 2018-03-14 87.2
4 Hans 2019-03-21 66.5
5 Hans 2018-03-12 57.4
6 Hans 2017-04-21 55.3
7 Hans 2016-10-12 nan
I want to select the last time Hans has logged in his weight. So the answer would be
4 Hans 2019-03-21 66.5
Here's what I successfully managed to do:
# select Hans data that don't have nans
cond = ( data['Name'] == 'Hans' )
a = data.loc[ cond ]
a = a.dropna()
# get the index of the most recent weight
b = d['date'].str.split('-', expand=True) # split the date to get the year
now b looks like this
print(b)
#4 2019 03 21
#5 2018 03 12
#6 2017 04 21
how can I extract the row with index=4 and then get the weight?
I cannot use idxmax because the df are not floats but str.

You cannot use idxmax, but a workaround is to use NumPy's argmax with iloc:
df2 = df.query('Name == "Hans"')
# older versions
# df2.iloc[[df['date'].values.argmax()]]
# >=0.24
df2.iloc[[df['date'].to_numpy().argmax()]]
Name date w
4 Hans 2019-03-21 66.5
Another trick is to convert the date to integer using to_datetime. You can then use idxmax with loc as usual.
df2.loc[[pd.to_datetime(df2['date']).astype(int).idxmax()]]
Name date w
4 Hans 2019-03-21 66.5
To do this for each person, use GroupBy.idxmax:
df.iloc[pd.to_datetime(df.date).astype(int).groupby(df['Name']).idxmax().values]
Name date w
5 Hans 2018-03-12 57.4
2 Mike 2018-11-12 88.1

Related

Computing 10d rolling average on a descending date column in pandas [duplicate]

Suppose I have a time series:
In[138] rng = pd.date_range('1/10/2011', periods=10, freq='D')
In[139] ts = pd.Series(randn(len(rng)), index=rng)
In[140]
Out[140]:
2011-01-10 0
2011-01-11 1
2011-01-12 2
2011-01-13 3
2011-01-14 4
2011-01-15 5
2011-01-16 6
2011-01-17 7
2011-01-18 8
2011-01-19 9
Freq: D, dtype: int64
If I use one of the rolling_* functions, for instance rolling_sum, I can get the behavior I want for backward looking rolling calculations:
In [157]: pd.rolling_sum(ts, window=3, min_periods=0)
Out[157]:
2011-01-10 0
2011-01-11 1
2011-01-12 3
2011-01-13 6
2011-01-14 9
2011-01-15 12
2011-01-16 15
2011-01-17 18
2011-01-18 21
2011-01-19 24
Freq: D, dtype: float64
But what if I want to do a forward-looking sum? I've tried something like this:
In [161]: pd.rolling_sum(ts.shift(-2, freq='D'), window=3, min_periods=0)
Out[161]:
2011-01-08 0
2011-01-09 1
2011-01-10 3
2011-01-11 6
2011-01-12 9
2011-01-13 12
2011-01-14 15
2011-01-15 18
2011-01-16 21
2011-01-17 24
Freq: D, dtype: float64
But that's not exactly the behavior I want. What I am looking for as an output is:
2011-01-10 3
2011-01-11 6
2011-01-12 9
2011-01-13 12
2011-01-14 15
2011-01-15 18
2011-01-16 21
2011-01-17 24
2011-01-18 17
2011-01-19 9
ie - I want the sum of the "current" day plus the next two days. My current solution is not sufficient because I care about what happens at the edges. I know I could solve this manually by setting up two additional columns that are shifted by 1 and 2 days respectively and then summing the three columns, but there's got to be a more elegant solution.
Why not just do it on the reversed Series (and reverse the answer):
In [11]: pd.rolling_sum(ts[::-1], window=3, min_periods=0)[::-1]
Out[11]:
2011-01-10 3
2011-01-11 6
2011-01-12 9
2011-01-13 12
2011-01-14 15
2011-01-15 18
2011-01-16 21
2011-01-17 24
2011-01-18 17
2011-01-19 9
Freq: D, dtype: float64
I struggled with this then found an easy way using shift.
If you want a rolling sum for the next 10 periods, try:
df['NewCol'] = df['OtherCol'].shift(-10).rolling(10, min_periods = 0).sum()
We use shift so that "OtherCol" shows up 10 rows ahead of where it normally would be, then we do a rolling sum over the previous 10 rows. Because we shifted, the previous 10 rows are actually the future 10 rows of the unshifted column. :)
Pandas recently added a new feature which enables you to implement forward looking rolling. You have to upgrade to pandas 1.1.0 to get the new feature.
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=3)
ts.rolling(window=indexer, min_periods=1).sum()
Maybe you can try bottleneck module. When ts is large, bottleneck is much faster than pandas
import bottleneck as bn
result = bn.move_sum(ts[::-1], window=3, min_count=1)[::-1]
And bottleneck has other rolling functions, such as move_max, move_argmin, move_rank.
Try this one for a rolling window of 3:
window = 3
ts.rolling(window).sum().shift(-window + 1)

Define function to classify records within a df and adding new columns. Pandas dfs

I have a list of about 20 dfs and I want to clean the data for analysis.
Can there be a function that loops through all the dfs in the list & performs the tasks below, if all the columns are the same?
Create a column [time_class] that classifies each as arrival time as "early" or "late" by comparing with the [appt_time] column. Next I want to classify each record as "early_yes", "early_no", "late_yes" and "late_no" in another column called [time_response]. This column would check the values of [time_class], [YES] and [NO]. If a record is 'early' and '1' for yes then the [time_response] column should say "early_yes" Then a frequency table to count the [time_response] occurrences. the frequency table headers will be from the [time_response] column.
How can I check to make sure the time columns are reading as times in pandas?
How can I change the values in the yes and no column to 'yes' and 'no' instead of the 1's?
each df has this format for these specific columns:
Arrival_time Appt_Time YES NO
07:25:00 08:00 1
08:24:00 08:40 1
08:12:00 09:00 1
09:20:00 09:30 1
10:01:00 10:00 1
09:33:00 09:30 1
10:22:00 10:20 1
10:29:00 10:30 1
I also have an age column in each df that I have tried binning using the cut() method, and I usually get the error that the input must be one dimensional array. Does this mean I cannot use this method if the df has other columns other than just the age?
How can you define a function to check the age column and create bins grouped by 10 [20-100], then use these bins to create a frequency table? Ideally I'd like the freq table to be columns in each df. I am using pandas.
Any help is appreciated!!
UPDATE: When I try to compare arrival time and scheduled time, I get a type error TypeError: '<=' not supported between instances of 'int' and 'datetime.time'
Hopefully this helps you get started - you'll see that there are a few useful methods like replace in pandas and select from the numpy library. Also if you want to apply any of the code to multiple dataframes that are all in the same format, you'll want to wrap this code in a function.
import numpy as np
import pandas as pd
### this code helps recreate the df you posted
df = pd.DataFrame({
"Arrival_time": ['07:25:00', '08:24:00', '08:12:00', '09:20:00', '10:01:00', '09:33:00', '10:22:00', '10:29:00'],
"Appt_Time":['08:00', '08:40', '09:00', '09:30', '10:00', '09:30', '10:20', '10:30'],
"YES": ['1','1','','','','1','','1'],
"NO": ['','','1','1','1','','1','']})
df.Arrival_time = pd.to_datetime(df.Arrival_time, format='%H:%M:%S').dt.time
df.Appt_Time = pd.to_datetime(df.Appt_Time, format='%H:%M').dt.time
### end here
# you can start using the code from this line onward:
# creates "time_class" column based on Arrival_time being before Appt_Time
df["time_class"] = (df.Arrival_time <= df.Appt_Time).replace({True: "early", False: "late"})
# creates a new column "time_response" based on conditions
# this may need to be changed depending on whether your "YES" and "NO" columns
# are a string or an int... I just assumed a string so you can modify this code as needed
conditions = [
(df.time_class == "early") & (df.YES == '1'),
(df.time_class == "early") & (df.YES != '1'),
(df.time_class == "late") & (df.YES == '1'),
(df.time_class == "late") & (df.YES != '1')]
choices = ["early_yes", "early_no", "late_yes", "late_no"]
df["time_response"] = np.select(conditions, choices)
# creates a new df to sum up each time_response
df_time_response_count = pd.DataFrame({"Counts": df["time_response"].value_counts()})
# replace 1 with YES and 1 with NO in your YES and NO columns
df.YES = df.YES.replace({'1': "YES"})
df.NO = df.NO.replace({'1': "NO"})
Output:
>>> df
Arrival_time Appt_Time YES NO time_class time_response
0 07:25:00 08:00:00 YES early early_yes
1 08:24:00 08:40:00 YES early early_yes
2 08:12:00 09:00:00 NO early early_no
3 09:20:00 09:30:00 NO early early_no
4 10:01:00 10:00:00 NO late late_no
5 09:33:00 09:30:00 YES late late_yes
6 10:22:00 10:20:00 NO late late_no
7 10:29:00 10:30:00 YES early early_yes
>>> df_time_response_count
Counts
early_yes 3
late_no 2
early_no 2
late_yes 1
To answer your question about binning, I think np.linspace() is easiest to create the bins you want.
So I'll add some random ages between 20 and 100 to the df:
df['age'] = [21,31,34,26,46,70,56,55]
So the dataframe looks like this:
df
Arrival_time Appt_Time YES NO time_class time_response age
0 07:25:00 08:00:00 YES early early_yes 21
1 08:24:00 08:40:00 YES early early_yes 31
2 08:12:00 09:00:00 NO early early_no 34
3 09:20:00 09:30:00 NO early early_no 26
4 10:01:00 10:00:00 NO late late_no 46
5 09:33:00 09:30:00 YES late late_yes 70
6 10:22:00 10:20:00 NO late late_no 56
7 10:29:00 10:30:00 YES early early_yes 55
Then use the value_counts method in pandas and with the bins parameter:
df_age_counts = pd.DataFrame({"Counts": df.age.value_counts(bins = np.linspace(20,100,9))})
df_age_counts = df_age_counts.sort_index()
Output:
>>> df_age_counts
Counts
(19.999, 30.0] 2
(30.0, 40.0] 2
(40.0, 50.0] 1
(50.0, 60.0] 2
(60.0, 70.0] 1
(70.0, 80.0] 0
(80.0, 90.0] 0
(90.0, 100.0] 0

Python - Create copies of rows based on column value and increase date by number of iterations

I have a dataframe in Python:
md
Out[94]:
Key_ID ronDt multidays
0 Actuals-788-8AA-0001 2017-01-01 1.0
11 Actuals-788-8AA-0012 2017-01-09 1.0
20 Actuals-788-8AA-0021 2017-01-16 1.0
33 Actuals-788-8AA-0034 2017-01-25 1.0
36 Actuals-788-8AA-0037 2017-01-28 1.0
... ... ...
55239 Actuals-789-8LY-0504 2020-02-12 1.0
55255 Actuals-788-T11-0001 2018-08-23 8.0
55257 Actuals-788-T11-0003 2018-09-01 543.0
55258 Actuals-788-T15-0001 2019-02-20 368.0
55259 Actuals-788-T15-0002 2020-02-24 2.0
I want to create an additional record for every multiday and increase the date (ronDt) by number of times that record was duplicated.
For example:
row[0] would repeat one time with the new date reading 2017-01-02.
row[55255] would be repeated 8 times with the corresponding dates ranging from 2018-08-24 - 2018-08-31.
When I did this in VBA, I used loops, and in Alteryx I used multirow functions. What is the best way to achieve this in Python? Thanks.
Here's a way to in pandas:
# get list of dates possible
df['datecol'] = df.apply(lambda x: pd.date_range(start=x['ronDt'], periods=x['multidays'], freq='D'), 1)
# convert the list into new rows
df = df.explode('datecol').drop('ronDt', 1)
# rename the columns
df.rename(columns={'datecol': 'ronDt'}, inplace=True)
print(df)
Key_ID multidays ronDt
0 Actuals-788-8AA-0001 1.0 2017-01-01
1 Actuals-788-8AA-0012 1.0 2017-01-09
2 Actuals-788-8AA-0021 1.0 2017-01-16
3 Actuals-788-8AA-0034 1.0 2017-01-25
4 Actuals-788-8AA-0037 1.0 2017-01-28
.. ... ... ...
8 Actuals-788-T15-0001 368.0 2020-02-20
8 Actuals-788-T15-0001 368.0 2020-02-21
8 Actuals-788-T15-0001 368.0 2020-02-22
9 Actuals-788-T15-0002 2.0 2020-02-24
9 Actuals-788-T15-0002 2.0 2020-02-25
# Get count of duplication for each row which corresponding to multidays col
df = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0: 'multidays'})
# Assume ronDt dtype is str so convert it to datetime object
# Then sum ronDt and multidays columns
df['ronDt_new'] = pd.to_datetime(df['ronDt']) + pd.to_timedelta(df['multidays'], unit='d')

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Python correlation matrix 3d dataframe

I have in SQL Server a historical return table by date and asset Id like this:
[Date] [Asset] [1DRet]
jan asset1 0.52
jan asset2 0.12
jan asset3 0.07
feb asset1 0.41
feb asset2 0.33
feb asset3 0.21
...
So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3
Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.
I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe
I see some example but it's always for a dataframe where you have an asset per column and one row per day.
This my code block where I'm doing it:
qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"
result = conn.execute(qryRet)
df = pd.DataFrame(data=list(result),columns=result.keys())
df1d = df[['Date','Id_RiskFactor','1DReturn']]
corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)
conn.close()
For it I'm reciving this msg:
corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'
**Print(df1d.head())**
Date Id_RiskFactor 1DReturn
0 2010-09-02 149 0E-12
1 2010-09-02 150 -0.004242875148
2 2010-09-02 33 0.000590000011
3 2010-09-02 28 0.000099999997
4 2010-09-02 34 -0.000010000000
**print(df.head())**
Date Id_RiskFactor Value 1DReturn 5DReturn
0 2010-09-02 149 0.040096000000 0E-12 0E-12
1 2010-09-02 150 1.736700000000 -0.004242875148 -0.013014321215
2 2010-09-02 33 2.283000000000 0.000590000011 0.001260000048
3 2010-09-02 28 2.113000000000 0.000099999997 0.000469999999
4 2010-09-02 34 0.615000000000 -0.000010000000 0.000079999998
**print(corr.columns)**
Index([], dtype='object')
Create a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'daily_return': np.random.random(15),
'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})
>>> df
daily_return date symbol
0 0.011467 2015-01-01 A
1 0.613518 2015-01-02 A
2 0.334343 2015-01-03 A
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
5 0.431729 2015-01-01 B
6 0.474905 2015-01-02 B
7 0.372366 2015-01-03 B
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
10 0.946504 2015-01-01 C
11 0.337204 2015-01-02 C
12 0.798704 2015-01-03 C
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.
corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2 A B C
symbol_1
A 1.000000 0.188065 -0.745115
B 0.188065 1.000000 -0.688808
C -0.745115 -0.688808 1.000000
You can select the subset of your DataFrame based on dates as follows:
start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
daily_return date symbol
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
If you want to flatten your correlation matrix:
corr.stack().reset_index()
symbol_1 symbol_2 0
0 A A 1.000000
1 A B 0.188065
2 A C -0.745115
3 B A 0.188065
4 B B 1.000000
5 B C -0.688808
6 C A -0.745115
7 C B -0.688808
8 C C 1.000000

Resources