Print the first value of a dataframe based on condition, then iterate to the next sequence - python-3.x

I'm looking to perform data analysis on 100-years of climatological data for select U.S. locations (8 in particular), for each day spanning the 100-years. I have a pandas dataFrame set up with columns for Max temperature, Min temperature, Avg temperature, Snowfall, Precip Total, and then Day, Year, and Month values (then, I have an index also based on a date-time value). Right now, I want to set up a for loop to print the first Maximum temperature of 90 degrees F or greater from each year, but ONLY the first. Eventually, I want to narrow this down to each of my 8 locations, but first I just want to get the for loop to work.
Experimented with various iterations of a for loop.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
Unsurprisingly, the output of the loop I provided prints the first 90 degree day period (from the year 1919, the beginning of my data frame) and breaks.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
1919-06-12 00:00:00
That's fine. If I take out the break statement, all of the 90 degree days print, including multiple in the same year. I just want the first value from each year to print. Do I need to set up a second for loop to increment through the year? If I explicitly state the year, ala below, while trying to loop through a counter, the loop still begins in 1919 and eventually reaches an out of bounds index. I know this logic is incorrect.
count = 1919
while count < 2019:
for year in range(len(climate['Year'])):
if (climate[climate['Year']==count]['Max'][year] >=90).all():
print (climate.index[year])
count = count+1
Any input is sincerely appreciated.

You can achieve this without having a second for loop. Assuming the climate dataframe is ordered chronologically, this should do what you want:
current_year = None
for i in range(climate.shape[0]):
if climate['Max'][i] >= 90 and climate['Year'][i] != current_year:
print(climate.index[i])
current_year = climate['Year'][i]
Notice that we're using the current_year variable to keep track of the latest year that we've already printed the result for. Then, in the if check, we're checking if we've already printed a result for the year of the current row in the loop.
That's one way to do it, but I would suggest taking a look at pandas.DataFrame.groupby because I think it fits your use case well. You could get a dataframe that contains the first >=90 max days per year with the following (again assuming climate is ordered chronologically):
climate[climate.Max >= 90].groupby('Year').first()
This just filters the dataframe to only contain the >=90 max days, groups rows from the same year together, and retains only the first row from each group. If you had an additional column Location, you could extend this to get the same except per location per year:
climate[climate.Max >= 90].groupby(['Location', 'Year']).first()

Related

Is there any function in excel to find day time between two date and time in Excel?

I need a formula to calculate between two date and time excluding lunch time, holidays, weekend and between 09:00 and 18:00 work hours.
For example, 25/07/2022 12:00 and 29/07/2022 10:00 and answer has to be 1 day, 06:00
Thanks in advance.
I had a formula but it didn't work when hours bigger than 24 hours.
I don't know how you got to 1 day and 6 hours, but here is a customizable way to filter your time difference calculation:
=LET(
start,E3,
end,E4,
holidays,$B$3:$B$5,
array,SEQUENCE(INT(end)-INT(start)+1,24,INT(start),TIME(1,0,0)),
crit_1,array>=start,
crit_2,array<=end,
crit_3,WEEKDAY(array,2)<6,
crit_4,HOUR(array)>=9,
crit_5,HOUR(array)<=18,
crit_6,HOUR(array)<>13,
crit_7,ISERROR(MATCH(DATE(YEAR(array),MONTH(array),DAY(array)),holidays,0)),
result,SUM(crit_1*crit_2*crit_3*crit_4*crit_5*crit_6*crit_7),
result
)
Limitation
This solution only works on an hourly level, i.e. the start and end dates and times will only be considered on a full hour basis. When providing times like 12:45 as input, the 15 minute increment won't be accounted for.
Explanation
The 4th item in the LET() function SEQUENCE(INT(end)-INT(start)+1,24,INT(start),TIME(1,0,0)) creates an array that contains all hours within the start and end date of the range:
(transposed for illustrative purposes)
then, based on that array, the different 'crit_n' statements are the individual criteria you mentioned. For example, crit_1,array>=start means that only the dates and times after the start date and time will be counted, or crit_6,HOUR(array)<>13 is the lunch break (assuming the 13th hour is lunch time), ...
All of the individual crit_n's are then arrays of the same size containing TRUE and FALSE elements.
At the end of the LET() function, by multiplying all the individual crit_n arrays, the product returns a single array that will then only contain those hours where all individual criteria statements are TRUE:
So then the SUM() function is simply returning the total number of hours that fit all criteria.
Example
I assumed lunch hour to be hour 13, and I assumed the 28th to be a holiday within the given range. With those assumptions and the other criteria you already specified above, I'm getting the following result:
Which looks like this when going into the formula bar:
In cell G2, you can put the following formula:
=LET(from,A2:A4,to,B2:B4,holidays,C2:C2,startHr,E1,endHr,E2, lunchS, E3, lunchE, E4,
CALC, LAMBDA(date,isFrom, LET(noWkDay, NETWORKDAYS(date,date,holidays)=0,
IF(noWkDay, 0, LET(d, INT(date), start, d + startHr, end, d + endHr,
noOverlap, IF(isFrom, date > end, date < start), lunchDur, lunchE-lunchS,
ls, d + lunchS, le, d + lunchE,
isInner, IF(isFrom, date > start, date < end),
diff, IF(isFrom, end-date-1 - IF(date < ls, lunchDur, 0),
date-start-1 - IF(date > le, lunchDur, 0)),
IF(noOverlap, -1, IF(isInner, diff, 0)))))),
MAP(from,to,LAMBDA(ff,tt, LET(wkdays, NETWORKDAYS(ff,tt,holidays),
duration, wkdays + CALC(ff, TRUE) + CALC(tt, FALSE),
days, INT(duration), time, duration - TRUNC(duration),
TEXT(days, "d") &" days "& TEXT(time, "hh:mm") &" hrs "
)))
)
and here is the output:
Explanation
Used LET function for easy reading and composition. The main idea is first to calculate the number of working days excluding holidays from column value to to column value. We use for that NETWORKDAYS function. Once we have this value for each row, we need to adjust it considering the first day and last day of the interval, in case we cannot count as a full day and instead considering hours. For inner days (not start/end of the interval) it is counted as an entire day.
We use MAP function to do the calculation over all values of from and to names. For each corresponding value (ff, tt) we calculate the working days (wkdays). Once we have this value, we use the user LAMBDA function CALC to adjust it. The function has a second input argument isFrom to consider both scenarios, i.e. adjustment at the beginning of the interval (isFrom = TRUE) or to the end of the interval (isFrom=FALSE). The first input argument is the given date.
In case the input date of CALC is a non working day, we don't need to make any adjustment. We check it with the name noWkDay. If that is not the case, then we need we need to determine if there is no overlap (noOverlap):
IF(isFrom, date > end, date < start)
where start, end names correspond to the same date as date, but with different hours corresponding to start Hr and end Hr (E1:E2). For example for the first row, there is no overlap, because the end date doesn't have hour information, i.e. (12:00 AM), in such case the corresponding date should not be taken into account and CALC returns -1, i.e. one day needs to be subtracted.
In case we have overlap, then we need to consider the case the working hours are lower than the maximum working hours (from 9:00 to 18:00). It is identified with the name isInner. If that is the case, we calculate the actual hours. We need to subtract 1 because it is going to be one less full working day and instead to consider the corresponding hours (that should be less than 9hrs, which is the maximum workday duration). The calculation is carried under the name diff:
IF(isFrom, end-date-1 - IF(date < ls, lunchDur, 0),
date-start-1 - IF(date > le, lunchDur, 0))
If the actual start is before the start of the lunch time (ls), then we need to subtract lunch duration (lunchDur). Similarly if the actual end is is after lunch time, we need to discount it too.
Finally, we use CALC to calculate the interval duration:
wkdays + CALC(ff, TRUE) + CALC(tt, FALSE)
Once we have this information, it is just to put in the specified format indicating days and hours.
Now let's review some of the sample input data and results:
The interval starts on Monday 7/25 and ends on Friday 7/29, therefore we have 5 working days, but 7/26 is a holiday, so the maximum number of working days will be 4 days.
For the interval [7/25, 7/29] starts and ends on midnight (12:00 AM), therefore the last day of the interval should not be considered, so actual working days will be 3.
Interval [7/25 10:00, 7/29 17:00]. For the start of the interval we cannot count one day, instead 8hrs and for the end of the interval, the same situation 8hrs, so instead of 4days we are goin to have 2days plus 16hrs, but we need to subtract in both cases the lunch duration (1hr) so the final result will be 2 days 14hrs.

Arithmetic operations for groups within a dataframe

I have loaded multiple CSV (time series) to create one dataframe. This dataframe contains data for multiple stocks. Now I want to calculate 1 month return for all the datapoints.
There 172 datapoints for each stock i.e. from index 0 to 171. The time series for next stock starts from index 0 again.
When I am trying to calculate the 1 month return its getting calculated correctly for all data points except for index 0 of new stock. Because it is taking the difference with index 171 of the previous stock.
I want the return to be calculated per stock name basis so I tried the for loop but it doesnt seem working.
e.g. In the attached image (highlighted) the 1 month return is calculated for company name ITC with SHREECEM. I expect for SHREECEM the first value of 1Mreturn should be NaN
Using groupby instead of a for loop you can get the result you want:
Mreturn_function = lambda df: df['mean_price'].diff(periods=1)/df['mean_price'].shift(1)*100
gw_stocks.groupby('CompanyName').apply(Mreturn_function)

How to build a simple moving average measure

I want to build a measure to get the simple moving average for each day. I have a cube with a single table containing stock market information.
Schema
The expected result is that for each date, this measure shows the closing price average of the X previous days to that date.
For example, for the date 2013-02-28 and for X = 5 days, this measure would show the average closing price for the days 2013-02-28, 2013-02-27, 2013-02-26, 2013-02-25, 2013-02-22. The closing values of those days are summed, and then divided by 5.
The same would be done for each of the rows.
Example dashboard
Maybe it could be achieved just with the function tt..agg.mean() but indicating those X previous days in the scope parameter.
The problem is that I am not sure how to obtain the X previous days for each of the dates dynamically so that I can use them in a measure.
You can compute a sliding average you can use the cumulative scope as referenced in the atoti documentation https://docs.atoti.io/latest/lib/atoti.scope.html#atoti.scope.cumulative.
By passing a tuple containing the date range, in your case ("-5D", None) you will be able to calculate a sliding average over the past 5 days for each date in your data.
The resulting Python code would be:
import atoti as tt
// session setup
...
m, l = cube.measures, cube.levels
// measure setup
...
tt.agg.mean(m["ClosingPrice"], scope=tt.scope.cumulative(l["date"], window=("-5D", None)))

Finding average age of incidents in a datetime series

I'm new to Stackoverflow and fairly fresh with Python (some 5 months give or take), so apologies if I'm not explaining this too clearly!
I want to build up a historic trend of the average age of outstanding incidents on a daily basis.
I have two dataframes.
df1 contains incident data going back 8 years, with the two most relevant columns being "opened_at" and "resolved_at" which contains datetime values.
df2 contains a column called date with the full date range from 2012-06-13 to now.
The goal is to have df2 contain the number of outstanding incidents on each date (as of 00:00:00) and the average age of all those deemed outstanding.
I know it's possible to get all rows that exist between two dates, but I believe I want the opposite and find where each date row in df2 exists between dates in opened_at and resolved_at in df1
(It would be helpful to have some example code containing an anonymized/randomized short extract of your data to try solutions on)
This is unlikely to be the most efficient solution, but I believe you could do:
df2["incs_open"] = 0 # Ensure the column exists
for row_num in range(df2.shape[0]):
df2.at[row_num, "incs_open"] = sum(
(df1["opened_at"] < df2.at[row_num, "date"]) &
(df2.at[row_num, "date"] < df1["opened_at"])
)
(This assumes you haven't set an index on the data frame other than the default one)
For the second part of your question, the average age, I believe you can calculate that in the body of the loop like this:
open_incs = (df1["opened_at"] < df2.at[row_num, "date"]) & \
(df2.at[row_num, "date"] < df1["opened_at"])
ages_of_open_incs = df2.at[row_num, "date"] - df1.loc[open_incs, "opened_at"]
avg_age = ages_of_open_incs.mean()
You'll hit some weirdnesses about rounding and things. If an incident was opened last night at 3am, what is its age "today" -- 1 day, 0 days, 9 hours (if we take noon as the point to count from), etc. -- but I assume once you've got code that works you can adjust that to taste.

Grab minimum time in year

As an athlete I want to keep track of my progression in Excel.
I need a formula that looks for the fastest time ran in a given season. (The lowest value in E for a given year. For 2017, for example, this is 13.32, for 2018 12 and so on.
Can you help me further?
Instead of formula you can use PIVOT
Keep the Year in Report Filter and Time into Value. Then on value field setting select min as summarize value by.
So every you change the year in the Filter the min value will show up.
=AGGREGATE(15,6,E3:E6/(B3:B6=2017),1)
15 tell aggregate to sort the results in ascending order
6 tells aggregate to ignore any errors such as when you divide by 0
E3:E6 is your time range
B3:B6 is you Year as an integer.
B3:B6=2017 when true will be 1 and false will be 0 (provide it goes through a math operation like divide.
1 tells aggregate to return the 1st value in the sorted list of results

Resources