Counting and ranking the dates based on the number of grid cells when daily rainfall is greater than a threshold in a netcdf file using Python - python-3.x

I have daily gridded rainfall data with dimensions (time: 14245, lon: 40, lat: 20) . I've calculated 2,3,5 and 7-days accumulated rainfall and their respective 90th percentiles at every grid points in my data domain. I've set my condition using DataArray.where(condition, drop=True) to know when daily rainfall amount exceed the threshold as shown in the code below. My current working code is here:
import numpy as np
import pandas as pd
import xarray as xr
#=== reading in the data ===
data_path = '/home/wilson/Documents/PH_D/GPCC/GPCC/GPCC_daily_1982-2020.nc'
data = xr.open_dataset(data_path)
#=== computing 2, 3, 5 and 7-days acummulated rainfall amount ===
data[['precip_2d']] = np.around(data.precip.rolling(time=2).sum(),decimals=2)
data[['precip_3d']] = np.around(data.precip.rolling(time=3).sum(),decimals=2)
data[['precip_5d']] = np.around(data.precip.rolling(time=5).sum(),decimals=2)
data[['precip_7d']] = np.around(data.precip.rolling(time=7).sum(),decimals=2)
#=== Computing 10% largest at each grid point (per grid cel) this is 90th percentile ===
data[['accum_2d_90p']] = np.around(data.precip_2d.quantile(0.9, dim='time'), decimals=2)
data[['accum_3d_90p']] = np.around(data.precip_3d.quantile(0.9, dim='time'), decimals=2)
data[['accum_5d_90p']] = np.around(data.precip_5d.quantile(0.9, dim='time'), decimals=2)
data[['accum_7d_90p']] = np.around(data.precip_7d.quantile(0.9, dim='time'), decimals=2)
#=== locating extreme events, i.e., when daily precip greater than 90th percentile of each of the accumulated rainfall amount ===
data[['extreme_2d']] = data['precip'].where(data['precip'] > data['accum_2d_90p'], drop=True)
data[['extreme_3d']] = data['precip'].where(data['precip'] > data['accum_2d_90p'], drop=True)
data[['extreme_5d']] = data['precip'].where(data['precip'] > data['accum_2d_90p'], drop=True)
data[['extreme_7d']] = data['precip'].where(data['precip'] > data['accum_2d_90p'], drop=True)
My problem now is how to count the number of grid cells/points within my domain where the condition is true on a particular date and using the result of the count to rank the date in descending order.
Expected result should look like a table that can be saved as txt file. For example: cells_count is a variable that contain desired result, when print(cells_count) gives
Date
Number of grid cells/point
1992-07-01
432
1983-09-23
407
2009-08-12
388

ok based on your comments it sounds like what you'd like is to get a list of dates in your time coordinate based on a global summary statistic of 3D (time, lat, lon) data. I'll use the condition (data['precip'] > data['accum_2d_90p']) as an example.
I'd definitely recommend reducing the dimensionality of your condition first, before working with the dates, because working with ragged 3D datetime arrays is a real pain. So since you mention wanting the count of pixels satisfying the criteria, you can simply do this:
global_pixel_count = (
(data['precip'] > data['accum_2d_90p']).sum(dim=("lat", "lon"))
)
Now, you have a 1-D array, global_pixel_count, indexed by time only. You can get the ranked order of dates from this using xr.DataArray.argsort:
sorted_date_positions = global_pixel_count.argsort()
sorted_dates = global_pixel_count.time.isel(
time=sorted_date_positions.values
)
This will return a DataArray of dates which sort the array in ascending order. You could reverse this to get the dates in descending order with sorted_date_positions.values[::-1] and you could select all the pixel counts in descending order with:
global_pixel_count.isel(time=sorted_date_positions.values[::-1])
you could also index your entire array with this indexer:
data.isel(time=sorted_date_positions.values[::-1])

Related

Trying to group and find margins of pandas dataframe based on multiple columns. Keep getting IndexError

I am trying to calculate the margins between two values based on 2 other columns.
def calcMargin(data):
marginsData = data[data.groupby('ID')['Status'].transform(lambda x: all(x != 'Tie'))] # Taking out all inquiries with ties.
def difference(df): # Subtracts 'Accepted' from lowest price
if len(df) <=1:
return pd.NA
winner = df.loc[(df['Status'] == 'Accepted'), 'Price']
df = df[df.Status != 'Accepted']
return min(df['Price']) - winner
winningMargins = marginsData.groupby('ID').agg(difference(marginsData)).dropna()
winningMargins.columns = ['Margin']
winners = marginsData.loc[(marginsData.Status == 'Accepted'), :]
winners = winners.join(winningMargins, on = 'ID')
winnersMargins = winners[['Name', 'Margin']].groupby('Name').sum().reset_index()
To explain a bit further, I am trying to find the difference between two prices. One of them is wherever the "Accepted" value is in the second column. The other price is whatever is the lowest price after the "Accepted" row is extracted, then taking the difference between the two. But this is based on grouping by a third column, the ID column. Then, trying to attach the margin to the winner, 'Name', in the fourth column.
I keep getting the error -- IndexError: index 25 is out of bounds for axis 0 with size 0. Not 100% sure how to fix this, or if my code is correct.

Arithmetic operations for groups within a dataframe

I have loaded multiple CSV (time series) to create one dataframe. This dataframe contains data for multiple stocks. Now I want to calculate 1 month return for all the datapoints.
There 172 datapoints for each stock i.e. from index 0 to 171. The time series for next stock starts from index 0 again.
When I am trying to calculate the 1 month return its getting calculated correctly for all data points except for index 0 of new stock. Because it is taking the difference with index 171 of the previous stock.
I want the return to be calculated per stock name basis so I tried the for loop but it doesnt seem working.
e.g. In the attached image (highlighted) the 1 month return is calculated for company name ITC with SHREECEM. I expect for SHREECEM the first value of 1Mreturn should be NaN
Using groupby instead of a for loop you can get the result you want:
Mreturn_function = lambda df: df['mean_price'].diff(periods=1)/df['mean_price'].shift(1)*100
gw_stocks.groupby('CompanyName').apply(Mreturn_function)

Calculate percentage of grouped values

I have a Pandas dataframe that looks like"
I calculated the number of Win, Lost and Draw for each year, and now it looks like:
My goal is to calculate the percentage of each score group by year. To become like:
But I stuck here.
I looked in this thread but was not able to apply it on my df.
Any thoughts?
Here is quite a simple method I wrote for this task:
Just do as follows:
create a dataframe of the total score within each year:
total_score = df.groupby('year')['score'].sum().reset_index(name = 'total_score_each_year')
merge the original and the new dataframe into a single dataframe:
df = df.merge(total_score, on = 'year')
calculate the percents:
df['percentages'] = 100 * (df['score'] / df['total_score_each_year'])
That's it, I hope it helps :)
You could try using : df.iat[row, column]
it would look something like this:
percentages = []
for i in range(len(df) // 3):
draws = df.iat[i, 2]
losses = df.iat[i + 1, 2]
wins = df.iat[i + 2, 2]
nbr_of_games = draws + losses + wins
percentages.append[draws * 100/nbr_of_games] #Calculate percentage of draws
percentages.append[losses * 100/nbr_of_games] #Calculate percentage of losses
percentages.append[wins * 100/nbr_of_games] #Calculate percentage of wins
df["percentage"] = percentages
This may not be the fastest way to do it but i hope it helps !
Similar to #panter answer, but in only one line and without creating any additional DataFrame:
df['percentage'] = df.merge(df.groupby('year').score.sum(), on='year', how='left').apply(
lambda x: x.score_x * 100 / x.score_y, axis=1
)
In detail:
df.groupby('year').score.sum() creates a DataFrame with the sum of the score per year.
df.merge creates a Dataframe equal to the original df, but with the column score renamed to score_x and an additional column score_y, that represents the sum of all the scores for the year of each row; the how='left' keeps only row in the left DataFrame, i.e., df.
.apply computes for each the correspondent percentage, using score_x and score_y (mind the axis=1 option, to apply the lambda row by row).

Going from monthly average dataframe to an interpolated daily timeseries

I am interested in taking average monthly values, for each month, and set the monthly average values to be the value on the 15th day of each month (within a daily timeseries).
I start with the following (these are the monthly average values I am given):
m_avg = pd.DataFrame({'Month': ['1.527013956', '1.899169054', '1.669356146','1.44920871', '1.188557788', '1.017035727', '0.950243755', '1.022453993', '1.203913739', '1.369545041','1.441827406','1.48621651']
EDIT: I added one more value to the dataframe so that there are now 12 values.
Next, I want to put each of these monthly values on the 15th day (within each month) for the following time period:
ts = pd.date_range(start='1/1/1950', end='12/31/1999', freq='D')
I know how to pull out the date on 15th day of an already existing daily timeseries by using:
df= df.loc[(df.index.day==15)] # Where df is any daily timeseries
Lastly, I know how to interpolate the values once I have the average monthly values on the 15th day of each month, using:
df.loc[:, ['Col1']] = df.loc[:, ['Col1']].interpolate(method='linear', limit_direction='both', limit=100)
How do I get from the monthly DataFrame to an interpolated daily DataFrame, where I linearly interpolate between the 15th day of each month, which is the monthly value of my original DataFrame by construction?
EDIT:
Your suggestion to use np.tile() was good, but I ended up needing to do this for multiple columns. Instead of np.tile, I used:
index = pd.date_range(start='1/1/1950', end='12/31/1999', freq='MS')
m_avg = pd.concat([month]*49,axis=0).set_index(index)
There may be a better solution out there, but this is working for my needs so far.
Here is one way to do it:
import pandas as pd
import numpy as np
# monthly averages, note these should be cast to float
month = np.array(['1.527013956', '1.899169054', '1.669356146',
'1.44920871', '1.188557788', '1.017035727',
'0.950243755', '1.022453993', '1.203913739',
'1.369545041', '1.441827406', '1.48621651'], dtype='float')
# expand this to 51 years, with the same monthly averages repeating each year
# (obviously not very efficient, probably there are better ways to attack the problem,
# but this was the question)
month = np.tile(month, 51)
# create DataFrame with these values
m_avg = pd.DataFrame({'Month': month})
# set the date index to the desired time period
m_avg.index = pd.date_range(start='1/1/1950', end='12/1/2000', freq='MS')
# shift the index by 14 days to get the 15th of each month
m_avg = m_avg.tshift(14, freq='D')
# expand the index to daily frequency
daily = m_avg.asfreq(freq='D')
# interpolate (linearly) the missing values
daily = daily.interpolate()
# show result
display(daily)
Output:
Month
1950-01-15 1.527014
1950-01-16 1.539019
1950-01-17 1.551024
1950-01-18 1.563029
1950-01-19 1.575034
... ...
2000-12-11 1.480298
2000-12-12 1.481778
2000-12-13 1.483257
2000-12-14 1.484737
2000-12-15 1.486217
18598 rows × 1 columns

Using numpy present value function

I am trying to compute the present value using numpy's pv function in pandas dataframe.
I also have 2 lists, one includes period [6,18,24] and other one includes pmt
values [100,200,300].
Present value should be computed for each value in pmt list to each value in period list.
lets say in below table column values represents period and row represents pmt
I am trying to compute the data values using a single line of code without writing multiple lines How can I do that?
Currently I hard coded the period as follows.
PRESENT_VALUE6 = np.pv(pmt=-PMT_REMAINING_PERIOD,rate=(INTEREST_RATE/12),nper=6,fv=0,when=0)
PRESENT_VALUE18 = np.pv(pmt=-PMT_REMAINING_PERIOD,rate=(INTEREST_RATE/12),nper=18,fv=0,when=0)
PRESENT_VALUE30 = np.pv(pmt=-PMT_REMAINING_PERIOD,rate=(INTEREST_RATE/12),nper=30,fv=0,when=0)
I want the python to iterate the nper from the list, currently when I do that it produces the following not the expected result
Expected result is
I don't know what interest rate you used in your example, I set it to 10% below:
INTEREST_RATE = 0.1
# Build a Cartesian product between PMT and Period
pmt = [100, 200, 300]
period = [6, 18, 24]
df = pd.DataFrame(product(pmt, period), columns=['PMT', 'Period'])
# Calculate the PV
df['PV'] = np.pv(INTEREST_RATE / 12, nper=df['Period'], pmt=-df['PMT'])
# Final pivot
df.pivot(index='PMT', columns='Period')
Result:
PV
Period 6 18 24
PMT
100 582.881717 1665.082618 2167.085483
200 1165.763434 3330.165236 4334.170967
300 1748.645151 4995.247853 6501.256450

Resources