How do I select all dates that are most recent in comparison with a different date field? - python-3.x

I'm trying to gather all dates between 06-01-2020 and 06-30-2020 based on the forecast date which can be 06-08-2020, 06-20-2020, and 06-24-2020. The problem I am running into is that I'm only grabbing all of the dates associated with the forecast date 06-24-2020. I need all dates that are most recent so if say 06-03-2020 occurs with the forecast date 06-08-2020 and not with 06-20-2020, I still need all of the dates associated with that forecast date. Here's the code I am currently using.
df = df[df['Forecast Date'].isin([max(df['Forecast Date'])])]
It's producing this-
Date \
5668 2020-06-25
5669 2020-06-26
5670 2020-06-27
5671 2020-06-28
5672 2020-06-29
5673 2020-06-30
Media Granularity Forecast Date
5668 NaN 2020-06-24
5669 NaN 2020-06-24
5670 NaN 2020-06-24
5671 NaN 2020-06-24
5672 NaN 2020-06-24
5673 NaN 2020-06-24
With a length of 6 (len(df[df['Forecast Date'].isin([max(df['Forecast Date'])])])). It needs to be a length of 30, one for each unique date. It is only grabbing the columns where the max of Forecast date is 06-24-2020.
I'm thinking it's something along the lines of df.sort_values(df[['Date', 'Forecast Date']]).drop_duplicates(df['Date'], keep='last') but it's giving me a key error.

It was easy but not what I expected.
df = df.sort_values(by=['Date', 'Forecast Date']).drop_duplicates(subset=['Date'], keep='last')

Related

How to display latitude and longitude as two separate columns from current single with no separator?

I have a data frame in Pandas with a column labeled “Location.”
The column is an object data type and is in the following format within the column:
Location
————————
POINT
(-73.525969
41.081897)
I’d like to remove the formatting and store each of the data points in two columns: Latitude and Longitude, which would have to be created. How would I accomplish this?
Similar post I’ve reviewed always have delimiters between the numbers (such as a comma) but this doesn’t.
Thank you!
Serial Number
List Year
Date Recorded
Town
Address
Assessed Value
Sale Amount
Sales Ratio
Property Type
Residential Type
Non Use Code
Assessor Remarks
OPM remarks
Location
0
141466
2014
2015-08-06
Stamford
83 OVERBROOK DRIVE
503270.0
850000.0
0.592082
Residential
Single Family
NaN
NaN
NaN
POINT (-73.525969 41.081897)
1
140604
2014
2015-06-29
New Haven
56 HIGHVIEW LANE
86030.0
149900.0
0.573916
Residential
Single Family
NaN
NaN
NaN
POINT (-72.878115 41.30285)
2
14340
2014
2015-07-01
Ridgefield
32 OVERLOOK DR
351880.0
570000.0
0.617333
Residential
Single Family
NaN
NaN
NaN
POINT (-73.508273 41.286223)
3
140455
2014
2015-04-30
New Britain
171 BRADFORD WALK
204680.0
261000.0
0.784215
Residential
Condo
NaN
NaN
NaN
POINT (-72.775592 41.713335)
4
141195
2014
2015-06-26
Stamford
555 GLENBROOK ROAD
229330.0
250000.0
0.917320
Residential
Single Family
NaN
NaN
NaN
POINT (-73.519774 41.07203)
You can use pandas.Series.str.extract:
df['Location'].str.extract('POINT \((?P<latitude>[-\d.#+])\s+(?P<longitude>[-\d.#+])\)').astype(float)
You might need to change the \s+ separator depending on your real input

how to update rows based on previous row of dataframe python

I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?
Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0

Resampling data ending in specific data

I have the following dataframe:
data
Out[8]:
Population
date
1980-11-03 1591.4
1980-11-10 1592.9
1980-11-17 1596.3
1980-11-24 1597.2
1980-12-01 1596.1
...
2020-06-01 18152.1
2020-06-08 18248.6
2020-06-15 18328.9
2020-06-22 18429.1
2020-06-29 18424.4
[2070 rows x 1 columns]
If i resample it over a year i will get this:
data.resample('Y').mean()
date
1980-12-31 1596.144444
1981-12-31 1678.686538
1982-12-31 1829.826923
Is it possible for it to be resampled in way such that it is resampled ending on a specific date such as Jan 1st. Thus the dates would be 1980 - 1 -1 in the above example instead of 1980-12-31.
What we can do is change to YS
df.resample('YS').mean()

Dividing two dataframes gives NaN

I have two dataframes, one with a metric as of the last day of the month. The other contains a metric summed for the whole month. The former (monthly_profit) looks like this:
profit
yyyy_mm_dd
2018-01-01 8797234233.0
2018-02-01 3464234233.0
2018-03-01 5676234233.0
...
2019-10-01 4368234233.0
While the latter (monthly_employees) looks like this:
employees
yyyy_mm_dd
2018-01-31 924358
2018-02-28 974652
2018-03-31 146975
...
2019-10-31 255589
I want to get profit per employee, so I've done this:
profit_per_employee = (monthly_profit['profit']/monthly_employees['employees'])*100
This is the output that I get:
yyyy_mm_dd
2018-01-01 NaN
2018-01-31 NaN
2018-02-01 NaN
2018-02-28 NaN
How could I fix this? The reason that one dataframe is the last day of the month and the other is the first day of the month is due to rolling vs non-rolling data.
monthly_profit is the result of grouping and summing daily profit data:
monthly_profit = df.groupby(['yyyy_mm_dd'])[['proft']].sum()
monthly_profit = monthly_profit.resample('MS').sum()
While monthly_employees is a running total, so I need to take the current value for the last day of each month:
monthly_employees = df.groupby(['yyyy_mm_dd'])[['employees']].sum()
monthly_employees = monthly_employees.groupby([monthly_employees.index.year, monthly_employees.index.month]).tail(1)
Change MS to M for end of months for match both DatatimeIndex:
monthly_profit = monthly_profit.resample('M').sum()

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Resources