Merging two pandas dataframes with date variable - python-3.x

I want to merger two pandas dataframes based on common date variable. Below is my code
import pandas as pd
data = pd.DataFrame({'date' : pd.to_datetime(['2010-12-31', '2012-12-31']), 'val' : [1,2]})
datarange = pd.DataFrame(pd.period_range('2009-12-31', '2012-12-31', freq='A'), columns = ['date'])
pd.merge(datarange, data, how = 'left', on = 'date')
With this I get below result
date val
0 2009 NaN
1 2010 NaN
2 2011 NaN
3 2012 NaN
Could you please help how can I correctly merge these two dataframes?

Use right_on for same anual periods like in datarange['date'] column:
df = pd.merge(datarange,
data,
how = 'left',
left_on = 'date',
right_on=data['date'].dt.to_period('A'))
print (df)
date date_x date_y val
0 2009 2009 NaT NaN
1 2010 2010 2010-12-31 1.0
2 2011 2011 NaT NaN
3 2012 2012 2012-12-31 2.0
Or create helper column:
df = pd.merge(datarange,
data.assign(datetimes=data['date'], date=data['date'].dt.to_period('A')),
how = 'left',
on = 'date')
print (df)
date val datetimes
0 2009 NaN NaT
1 2010 1.0 2010-12-31
2 2011 NaN NaT
3 2012 2.0 2012-12-31

You need to merge on a common type.
For example you can set the year as merging key on each side:
pd.merge(datarange, data, how='left',
left_on=datarange['date'].dt.year,
right_on=data['date'].dt.year
)
output:
key_0 date_x date_y val
0 2009 2009 NaT NaN
1 2010 2010 2010-12-31 1.0
2 2011 2011 NaT NaN
3 2012 2012 2012-12-31 2.0

Related

Convert one dataframe's format and check if each row exits in another dataframe in Python

Given a small dataset df1 as follow:
city year quarter
0 sh 2019 q4
1 bj 2020 q3
2 bj 2020 q2
3 sh 2020 q4
4 sh 2020 q1
5 bj 2021 q1
I would like to create date range in quarter from 2019-q2 to 2021-q1 as column names, then check if each row in df1's year and quarter for each city exist in df2.
If they exist, then return ys for that cell, otherwise, return NaNs.
The final result will like:
city 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN y y NaN y
1 sh NaN NaN y y NaN NaN y NaN
To create column names for df2:
pd.date_range('2019-04-01', '2021-04-01', freq = 'Q').to_period('Q')
How could I achieve this in Python? Thanks.
We can use crosstab on city and the string concatenation of the year and quarter columns:
new_df = pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
new_df:
col_0 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
city
bj 0 0 1 1 0 1
sh 1 1 0 0 1 0
We can convert to bool, replace False and True to be the correct values, reindex to add missing columns, and cleanup axes and index to get exact output:
col_names = pd.date_range('2019-01-01', '2021-04-01', freq='Q').to_period('Q')
new_df = (
pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
.astype(bool) # Counts to boolean
.replace({False: np.NaN, True: 'y'}) # Fill values
.reindex(columns=col_names.strftime('%Y-q%q')) # Add missing columns
.rename_axis(columns=None) # Cleanup axis name
.reset_index() # reset index
)
new_df:
city 2019-q1 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN NaN y y NaN y
1 sh NaN NaN NaN y y NaN NaN y NaN
DataFrame and imports:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'city': ['sh', 'bj', 'bj', 'sh', 'sh', 'bj'],
'year': [2019, 2020, 2020, 2020, 2020, 2021],
'quarter': ['q4', 'q3', 'q2', 'q4', 'q1', 'q1']
})

calculate different between consecutive date records at an ID level

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?
Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN
You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

how to multiply values with group of data from pandas series without loop iteration

I have two pandas time series with different length and index, and a Boolean series. Series_1 is from the last data of each month with index last day of the month, series_2 is daily data with index daily, the Boolean series is True on the last day of each month, else as false.
I want to get data from series_1 (s1[0]) times data from series_2 (s2[1:n]) which is the daily data from one month, is there a way to do it without loop?
series_1 = 2010-06-30 1
2010-07-30 2
2010-08-31 5
2010-09-30 7
series_2 = 2010-07-01 2
2010-07-02 3
2010-07-03 5
2010-07-04 6
.....
2010-07-30 7
2010-08-01 6
2010-08-02 7
2010-08-03 5
.....
2010-08-31 6
Boolean = False
false
....
True
False
False
....
True
(with only the end of each month True)
want to get a series as a result that s = series_1[i] * series_2[j:j+n] (n data from same month)
How to make it?
Thanks in advance
Not sure if I got your question completely right but this should get you there:
series_1 = pd.Series({
'2010-07-30': 2,
'2010-08-31': 5
})
series_2 = pd.Series({
'2010-07-01': 2,
'2010-07-02': 3,
'2010-07-03': 5,
'2010-07-04': 6,
'2010-07-30': 7,
'2010-08-01': 6,
'2010-08-02': 7,
'2010-08-03': 5,
'2010-08-31': 6
})
Make the series Datetime aware and resample them to daily frequency:
series_1.index = pd.DatetimeIndex(series_1.index)
series_1 = series_1.resample('1D').asfreq()
series_2.index = pd.DatetimeIndex(series_2.index)
series_2 = series_2.resample('1D').asfreq()
Put them in a dataframe and perform basic multiplication:
df = pd.DataFrame()
df['1'] = series_1
df['2'] = series_2
df['product'] = df['1'] * df['2']
Result:
>>> df
1 2 product
2010-07-30 2.0 7.0 14.0
2010-07-31 NaN NaN NaN
2010-08-01 NaN 6.0 NaN
2010-08-02 NaN 7.0 NaN
2010-08-03 NaN 5.0 NaN
[...]
2010-08-27 NaN NaN NaN
2010-08-28 NaN NaN NaN
2010-08-29 NaN NaN NaN
2010-08-30 NaN NaN NaN
2010-08-31 5.0 6.0 30.0

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

assigning a column to be index and dropping it

As the title suggests I am reassigning a column as index but I want the column to appear only as index.
df.set_index(df['col_name'], drop = True, inplace = True)
The documenatation as I understand it says that the above will reassign the column to the df and drop the initial column. But when I print out the df the column is now duplicated (as index and still as column). Can anyone point out where I am missing something ?
You need only set column name col_name and parameter inplace in set_index. If use not column name, but column like df['a'], set_index doesn't drop column, only copy it:
print df
col_name a b
0 1.255 2003 1
1 3.090 2003 2
2 3.155 2003 3
3 3.115 2004 1
4 3.010 2004 2
5 2.985 2004 3
df.set_index('col_name', inplace = True)
print df
a b
col_name
1.255 2003 1
3.090 2003 2
3.155 2003 3
3.115 2004 1
3.010 2004 2
2.985 2004 3
df.set_index(df['a'], inplace = True)
print df
a b
a
2003 2003 1
2003 2003 2
2003 2003 3
2004 2004 1
2004 2004 2
2004 2004 3
This works for me in Python 2.x, should work for you too (3.x)!
pandas.__version__
u'0.17.0'
df.set_index('col_name', inplace = True)

Resources