I have two dataframes.
First one:
Date B
2021-12-31 NaN
2022-01-31 500
2022-02-28 540
Second one:
Date A
2021-12-28 520
2021-12-31 530
2022-01-20 515
2022-01-31 529
2022-02-15 544
2022-02-25 522
I want to concatenate both the dataframes based on Year and Month and the resultant dataframe should look like below
Date A B
2021-12-28 520 NaN
2021-12-31 530 NaN
2022-01-20 515 500
2022-01-31 529 500
2022-02-15 544 540
2022-02-25 522 540
You need a left merge on the month period:
df2.merge(df1,
left_on=pd.to_datetime(df2['Date']).dt.to_period('M'),
right_on=pd.to_datetime(df1['Date']).dt.to_period('M'),
suffixes=(None, '_'),
how='left'
)
Then drop(columns=['key_0', 'Date_']) if needed.
Output:
key_0 Date A Date_ B
0 2021-12 2021-12-28 520 2021-12-31 NaN
1 2021-12 2021-12-31 530 2021-12-31 NaN
2 2022-01 2022-01-20 515 2022-01-31 500.0
3 2022-01 2022-01-31 529 2022-01-31 500.0
4 2022-02 2022-02-15 544 2022-02-28 540.0
5 2022-02 2022-02-25 522 2022-02-28 540.0
Related
I have a set of 5 time series dataframes with resolution of 15mins but they do not end on the same date and time. However, the starting date and time are same. So, I would prefer to clip them so that they are of the same length.
And, then I would like to reshape the data to see weekly pattern or 14-days pattern.
The data looks like this:
I think what you mean by clipping is to resample the dates and take the last value. (correct me if I'm wrong)
To resample you can use .resample() method from pandas (set your timestamp column as index before using this method), followed by a .last() to take the last value.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
... 'Timestamp':pd.date_range('2021-06-21 23:07', periods = 200, freq='15min'),
... 'Speed':range(200)
>>> })
>>> print(df)
Timestamp Speed
0 2021-06-21 23:07:00 0
1 2021-06-21 23:22:00 1
2 2021-06-21 23:37:00 2
3 2021-06-21 23:52:00 3
4 2021-06-22 00:07:00 4
.. ... ...
195 2021-06-23 23:52:00 195
196 2021-06-24 00:07:00 196
197 2021-06-24 00:22:00 197
198 2021-06-24 00:37:00 198
199 2021-06-24 00:52:00 199
[200 rows x 2 columns]
>>> df_grouped = df.set_index('Timestamp').resample('1H').last()
>>> print(df_grouped.head(10))
Speed
Timestamp
2021-06-21 23:00:00 3
2021-06-22 00:00:00 7
2021-06-22 01:00:00 11
2021-06-22 02:00:00 15
2021-06-22 03:00:00 19
2021-06-22 04:00:00 23
2021-06-22 05:00:00 27
2021-06-22 06:00:00 31
2021-06-22 07:00:00 35
2021-06-22 08:00:00 39
Ok, so hopefully the title is understandable. I have two dataframes, one with datetime index and one column with values, and another with latitud and longitud and other columns.
The general layout is
df1=
factor
2015-04-15 NaN
2015-04-16 NaN
2015-04-17 NaN
2015-04-18 NaN
2015-04-19 NaN
2015-04-20 NaN
2015-04-21 NaN
2015-04-22 NaN
2015-04-23 NaN
2015-04-24 7.067218
2015-04-25 9.414628
2015-04-26 13.702154
2015-04-27 16.489926
2015-04-28 17.917428
2015-04-29 20.359118
2015-04-30 18.608707
2015-05-01 10.627798
2015-05-02 8.398942
2015-05-03 5.984976
2015-05-04 4.363621
2015-05-05 3.468062
2015-05-06 2.830794
2015-05-07 2.347879
df2=
i_lat i_lon multiplier sum ID distance
226 1092 264 -60.420166 61.420166 609 0.6142016587060164 km
228 1092 265 -129.914662 130.914662 609 1.309146617117938 km
204 1091 264 -203.371915 204.371915 609 2.043719152272311 km
206 1091 265 -233.799786 234.799786 609 2.347997860007727 km
224 1092 263 -240.718140 241.718140 609 2.417181399246371 km
.. ... ... ... ... ... ...
295 1095 268 -969.728516 970.728516 609 9.707285164114008 km
216 1092 259 -977.398084 978.398084 609 9.783980837220454 km
278 1094 269 -984.131470 985.131470 609 9.851314704203592 km
160 1088 267 -994.142285 995.142285 609 9.951422853836982 km
194 1091 259 -996.513606 997.513606 609 9.975136064824323 km
I basically need to do df1["factor"]*df2["multiplier"]+df2["sum"] for every pair of i_lat and i_lon so a multiindexed dataframe like this is outputed
df_output=
col
i_lat i_lon time
1092 264 2015-04-15 -9.000000e+33
2015-04-16 -9.000000e+33
2015-04-17 -9.000000e+33
2015-04-18 -9.000000e+33
2015-04-19 -9.000000e+33
... ...
1091 259 2015-05-05 -9.000000e+33
2015-05-06 -9.000000e+33
2015-05-07 -9.000000e+33
2015-05-08 -9.000000e+33
2015-05-09 -9.000000e+33
With col having the operation described above. I tried to use applyas df2.apply(lambda a: print(df1*a["multiplier"]+a["sum"], axis=1)) but it returns something that doesnt make sense. Not really know how to continue from now on.
Thanks!
You can do:
df2=df2.set_index(['i_lat', 'i_lon'])
(pd.DataFrame(df1.values * df2.multiplier.values + df2['sum'].values,
index=df1.index,
columns=df2.index
)
.unstack()
)
I am creating a percentile rank over a rolling window of time and would like help refining my approach.
My DataFrame has a multi-index with the first level set to datetime and the second set to an identifier. Ultimately, I’d like the rolling window to evaluate the trailing n periods, including the current period, and produce the corresponding percentile ranks.
I referenced the posts shown below but found they were working with the data a bit differently than how I intend to. In those posts, the final functions group results by identifier and then by datetime, whereas I'm looking to use rolling panels of data in my function (dates and identifiers).
using rolling functions on multi-index dataframe in pandas
Panda rolling window percentile rank
This is an example of what I am after.
Create a sample DataFrame:
num_days = 5
np.random.seed(8675309)
stock_data = {
"AAPL": np.random.randint(1, max_value, size=num_days),
"MSFT": np.random.randint(1, max_value, size=num_days),
"WMT": np.random.randint(1, max_value, size=num_days),
"TSLA": np.random.randint(1, max_value, size=num_days)
}
dates = pd.date_range(
start="2013-01-03",
periods=num_days,
freq=BDay()
)
sample_df = pd.DataFrame(stock_data, index=dates)
sample_df = sample_df.stack().to_frame(name='data')
sample_df.index.names = ['date', 'ticker']
Which outputs:
date ticker
2013-01-03 AAPL 2
MSFT 93
TSLA 39
WMT 21
2013-01-04 AAPL 141
MSFT 43
TSLA 205
WMT 20
2013-01-07 AAPL 256
MSFT 93
TSLA 103
WMT 25
2013-01-08 AAPL 233
MSFT 60
TSLA 13
WMT 104
2013-01-09 AAPL 19
MSFT 120
TSLA 282
WMT 293
The code below breaks out the sample_df into 2 day increments and produces a rank vs. ranking over a rolling window of time. So it's close, but not what I'm after.
sample_df.reset_index(level=1, drop=True)[['data']] \
.apply(
lambda x: x.groupby(pd.Grouper(level=0, freq='2d')).rank()
)
I then tried what's shown below without much luck either.
from scipy.stats import rankdata
def rank(x):
return rankdata(x, method='ordinal')[-1]
sample_df.reset_index(level=1, drop=True) \
.rolling(window="2d", min_periods=1) \
.apply(
lambda x: rank(x)
)
I finally arrived at the output I'm looking for but the formula seems a bit contrived, so I'm hoping to identify a more elegant approach if one exists.
import numpy as np
import pandas as pd
from pandas.tseries.offsets import BDay
window_length = 1
target_column = "data"
def rank(df, target_column, ids, window_length):
percentile_ranking = []
list_of_ids = []
date_index = df.index.get_level_values(0).unique()
for date in date_index:
rolling_start_date = date - BDay(window_length)
first_date = date_index[0] + BDay(window_length)
trailing_values = df.loc[rolling_start_date:date, target_column]
# Only calc rolling percentile after the rolling window has lapsed
if date < first_date:
pass
else:
percentile_ranking.append(
df.loc[date, target_column].apply(
lambda x: stats.percentileofscore(trailing_values, x, kind="rank")
)
)
list_of_ids.append(df.loc[date, ids])
ranks, output_ids = pd.concat(percentile_ranking), pd.concat(list_of_ids)
df = pd.DataFrame(
ranks.values, index=[ranks.index, output_ids], columns=["percentile_rank"]
)
return df
ranks = rank(
sample_df.reset_index(level=1),
window_length=1,
ids='ticker',
target_column="data"
)
sample_df.join(ranks)
I get the feeling that my rank function is more than what's needed here. I appreciate any ideas/feedback to help in simplifying this code to arrive at the output below. Thank you!
data percentile_rank
date ticker
2013-01-03 AAPL 2 NaN
MSFT 93 NaN
TSLA 39 NaN
WMT 21 NaN
2013-01-04 AAPL 141 87.5
MSFT 43 62.5
TSLA 205 100.0
WMT 20 25.0
2013-01-07 AAPL 256 100.0
MSFT 93 50.0
TSLA 103 62.5
WMT 25 25.0
2013-01-08 AAPL 233 87.5
MSFT 60 37.5
TSLA 13 12.5
WMT 104 75.0
2013-01-09 AAPL 19 25.0
MSFT 120 62.5
TSLA 282 87.5
WMT 293 100.0
Edited: The original answer was taking 2d groups without the rolling effect, and just grouping the first two days that appeared. If you want rolling by every 2 days:
Dataframe pivoted to keep the dates as index and ticker as columns
pivoted = sample_df.reset_index().pivot('date','ticker','data')
Output
ticker AAPL MSFT TSLA WMT
date
2013-01-03 2 93 39 21
2013-01-04 141 43 205 20
2013-01-07 256 93 103 25
2013-01-08 233 60 13 104
2013-01-09 19 120 282 293
Now we can apply a rolling function and consider all stocks in the same window within that rolling
from scipy.stats import rankdata
def pctile(s):
wdw = sample_df.loc[s.index,:].values.flatten() ##get all stock values in the period
ranked = rankdata(wdw) / len(wdw)*100 ## their percentile
return ranked[np.where(wdw == s[len(s)-1])][0] ## return this value's percentile
pivoted_pctile = pivoted.rolling('2D').apply(pctile, raw=False)
Output
ticker AAPL MSFT TSLA WMT
date
2013-01-03 25.0 100.0 75.0 50.0
2013-01-04 87.5 62.5 100.0 25.0
2013-01-07 100.0 50.0 75.0 25.0
2013-01-08 87.5 37.5 12.5 75.0
2013-01-09 25.0 62.5 87.5 100.0
To get the original format back, we just melt the results:
pd.melt(pivoted_pctile.reset_index(),'date')\
.sort_values(['date', 'ticker']).reset_index()
Output
value
date ticker
2013-01-03 AAPL 25.0
MSFT 100.0
TSLA 75.0
WMT 50.0
2013-01-04 AAPL 87.5
MSFT 62.5
TSLA 100.0
WMT 25.0
2013-01-07 AAPL 100.0
MSFT 50.0
TSLA 75.0
WMT 25.0
2013-01-08 AAPL 87.5
MSFT 37.5
TSLA 12.5
WMT 75.0
2013-01-09 AAPL 25.0
MSFT 62.5
TSLA 87.5
WMT 100.0
If you prefer in one execution:
pd.melt(
sample_df\
.reset_index()\
.pivot('date','ticker','data')\
.rolling('2D').apply(pctile, raw=False)\
.reset_index(),'date')\
.sort_values(['date', 'ticker']).set_index(['date','ticker'])
Note that on day 7 this is different than what you displayed. This is actually rolling, so in day 7, because there is no day 6, the values are ranked only for that day, as the window of data is only 4 values and windows don't look forward. This differs from your result for that day.
Original
Is this something you might be looking for? I combined the groupby on the date (2 days) with transform so the number of observations is the same as the series provided. As you can see I kept the first observation of the window group.
df = sample_df.reset_index()
df['percentile_rank'] = df.groupby([pd.Grouper(key='date',freq='2D')]['data']\
.transform(lambda x: x.rank(ascending=True)/len(x)*100)
Output
Out[19]:
date ticker data percentile_rank
0 2013-01-03 AAPL 2 12.5
1 2013-01-03 MSFT 93 75.0
2 2013-01-03 WMT 39 50.0
3 2013-01-03 TSLA 21 37.5
4 2013-01-04 AAPL 141 87.5
5 2013-01-04 MSFT 43 62.5
6 2013-01-04 WMT 205 100.0
7 2013-01-04 TSLA 20 25.0
8 2013-01-07 AAPL 256 100.0
9 2013-01-07 MSFT 93 50.0
10 2013-01-07 WMT 103 62.5
11 2013-01-07 TSLA 25 25.0
12 2013-01-08 AAPL 233 87.5
13 2013-01-08 MSFT 60 37.5
14 2013-01-08 WMT 13 12.5
15 2013-01-08 TSLA 104 75.0
16 2013-01-09 AAPL 19 25.0
17 2013-01-09 MSFT 120 50.0
18 2013-01-09 WMT 282 75.0
19 2013-01-09 TSLA 293 100.0
I have two Data Frames, df1 & df2 (see below), which i would like to
merge on one of the common columns
conditionally update the other common columns.
Sample Data Frame and expected results.
df1:
A B C
0 123 1819. NaN
1 456 NaN 115
2 789 9012. NaN
3 121 8732. NaN
4 883 NaN 171
5 771 8871. 191
# df2:
C B
0 115 41853
1 115 22723
2 115 57302
3 115 91494
4 171 43607
5 171 36327
6 191 39874
7 191 25456
8 191 76283
9 191 97506
merge on column C
# how='left' is necessary
pd.merge(df1, df2, on='C', how='left')
A B_x C B_y
0 123 1819.0 NaN NaN
1 456 NaN 115.0 41853.0
2 456 NaN 115.0 22723.0
3 456 NaN 115.0 57302.0
4 456 NaN 115.0 91494.0
5 789 9012.0 NaN NaN
6 121 8732.0 NaN NaN
7 883 NaN 171.0 43607.0
8 883 NaN 171.0 36327.0
9 771 NaN 191.0 39874.0
10 771 NaN 191.0 25456.0
11 771 NaN 191.0 76283.0
12 771 NaN 191.0 97506.0
Conditionally combine columns B_x and B_y i.e. replace the NaN values in the left_table (B_x) with non-NaN values with right_table (B_y)
PS: Assume that both B_x and B_y are never simultaneously NaN
The End Result:
A C B
0 123 NaN 1819
1 456 115.0 41853
2 456 115.0 22723
3 456 115.0 57302
4 456 115.0 91494
5 789 NaN 9012
6 121 NaN 8732
7 883 171.0 43607
8 883 171.0 36327
9 771 191.0 39874
10 771 191.0 25456
11 771 191.0 76283
12 771 191.0 97506
I am aware of the function combine_first, but it works only with indices.
After merge using np.where
df=pd.merge(df1, df2, on='C', how='left')
df['B']=np.where(df.B_x.isnull(),df.B_y,df.B_x)
df.drop(['B_x','B_y'],1,inplace=True)
df
Out[136]:
A C B
0 123 NaN 1819.0
1 456 115.0 41853.0
2 456 115.0 22723.0
3 456 115.0 57302.0
4 456 115.0 91494.0
5 789 NaN 9012.0
6 121 NaN 8732.0
7 883 171.0 43607.0
8 883 171.0 36327.0
9 771 191.0 8871.0
10 771 191.0 8871.0
11 771 191.0 8871.0
12 771 191.0 8871.0
I am interested to find out if:
a sequence of strings or numbers is contained as it is in a bigger or larger sequence of strings or numbers.
Following is a pandas dataframe with two columns: Id and Time. This dataframe is sorted beforehand by the values of Time.
import pandas as pd
label1 = ['422','422','422','428','428','453','453','453','453','453','421','421','421','421','421','422','422','422','424','424','424']
label2 = ['13:08','13:08','13:09','13:12','13:12','13:16','13:16','13:17','13:17','13:18','13:20','13:20','13:20','13:20','13:22', '13:23','13:24','13:24', '13:25','13:25','13:26']
d = {'Id':label1,'Time':label2}
df=pd.DataFrame(d)
df
The output df looks like the following:
In [4]: df
Out[4]:
Id Time
0 422 13:08
1 422 13:08
2 422 13:09
3 428 13:12
4 428 13:12
5 453 13:16
6 453 13:16
7 453 13:17
8 453 13:17
9 453 13:18
10 421 13:20
11 421 13:20
12 421 13:20
13 421 13:20
14 421 13:22
15 422 13:23
16 422 13:24
17 422 13:24
18 424 13:25
19 424 13:25
20 424 13:26
What I have done so far. I have tried to generate a smaller dataframe as follows:
df["Id"] = df['Id'].astype('int')
bb1= df[df['Id'].diff(-1).ne(0)]
bb1
which has produced the following output:
In [59]: bb1
Out[59]:
Id Time
2 422 13:09
4 428 13:12
9 453 13:18
14 421 13:22
17 422 13:24
20 424 13:26
bb dataframe contains the ids in the order they have appeared. They appear in the following order: S1=[422, 428,453,421,422,424].
Besides, there is a given sub-sequence which is S2=[421,422,424], which happens to be contained in S1.
I need to find if bb dataframe contain a sub-sequence of Ids as reflected in S2=[421, 422, 424]. The answer to which should be returned with the following output if the sub-sequence gets identified:
index Id Time
10 421 13:20
14 421 13:22
15 422 13:23
17 422 13:24
18 424 13:25
20 424 13:26
The desired output contains the first and last time stamp and its associated index.
I would really appreciate your help.
Working start from your bb1, the key is sub-sequence match, I found a solution here, and made a slightly modify to fit your situation:
S2 = [421,422,424]
N = len(S2)
# Sub-sequence matching
sub = (bb1.Id.rolling(window=N)
.apply(lambda x: (x == S2).all(), raw=True)
.mask(lambda x: x == 0)
.bfill(limit=N))
print(sub)
# Output
2 NaN
4 NaN
9 1.0
14 1.0
17 1.0
20 1.0
Name: Id, dtype: float64
# And for final results
sub = sub[sub.eq(1)]
beg = sub.index[0] + 1
end = sub.index[-1]
res = df.loc[beg:end].drop_duplicates(keep='first')
print(res)
# Output
Id Time
10 421 13:20
14 421 13:22
15 422 13:23
16 422 13:24
18 424 13:25
20 424 13:26