I have currency quotes in pandas dataframe, column 0 - date/time, column 1 - close price.
And as its a 1 minute period, there are a lot of gaps.
So I need to apply .asfreq(freq='T') on close series, but also I must skip all weekends.
How do I do that? Unfourtunately, .asfreq(freq='BT') doesnt work
data = data.asfreq('15T')
data = data[data.index.dayofweek<5]
It works this way.
Related
I'm new to Stackoverflow and fairly fresh with Python (some 5 months give or take), so apologies if I'm not explaining this too clearly!
I want to build up a historic trend of the average age of outstanding incidents on a daily basis.
I have two dataframes.
df1 contains incident data going back 8 years, with the two most relevant columns being "opened_at" and "resolved_at" which contains datetime values.
df2 contains a column called date with the full date range from 2012-06-13 to now.
The goal is to have df2 contain the number of outstanding incidents on each date (as of 00:00:00) and the average age of all those deemed outstanding.
I know it's possible to get all rows that exist between two dates, but I believe I want the opposite and find where each date row in df2 exists between dates in opened_at and resolved_at in df1
(It would be helpful to have some example code containing an anonymized/randomized short extract of your data to try solutions on)
This is unlikely to be the most efficient solution, but I believe you could do:
df2["incs_open"] = 0 # Ensure the column exists
for row_num in range(df2.shape[0]):
df2.at[row_num, "incs_open"] = sum(
(df1["opened_at"] < df2.at[row_num, "date"]) &
(df2.at[row_num, "date"] < df1["opened_at"])
)
(This assumes you haven't set an index on the data frame other than the default one)
For the second part of your question, the average age, I believe you can calculate that in the body of the loop like this:
open_incs = (df1["opened_at"] < df2.at[row_num, "date"]) & \
(df2.at[row_num, "date"] < df1["opened_at"])
ages_of_open_incs = df2.at[row_num, "date"] - df1.loc[open_incs, "opened_at"]
avg_age = ages_of_open_incs.mean()
You'll hit some weirdnesses about rounding and things. If an incident was opened last night at 3am, what is its age "today" -- 1 day, 0 days, 9 hours (if we take noon as the point to count from), etc. -- but I assume once you've got code that works you can adjust that to taste.
I'm looking to perform data analysis on 100-years of climatological data for select U.S. locations (8 in particular), for each day spanning the 100-years. I have a pandas dataFrame set up with columns for Max temperature, Min temperature, Avg temperature, Snowfall, Precip Total, and then Day, Year, and Month values (then, I have an index also based on a date-time value). Right now, I want to set up a for loop to print the first Maximum temperature of 90 degrees F or greater from each year, but ONLY the first. Eventually, I want to narrow this down to each of my 8 locations, but first I just want to get the for loop to work.
Experimented with various iterations of a for loop.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
Unsurprisingly, the output of the loop I provided prints the first 90 degree day period (from the year 1919, the beginning of my data frame) and breaks.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
1919-06-12 00:00:00
That's fine. If I take out the break statement, all of the 90 degree days print, including multiple in the same year. I just want the first value from each year to print. Do I need to set up a second for loop to increment through the year? If I explicitly state the year, ala below, while trying to loop through a counter, the loop still begins in 1919 and eventually reaches an out of bounds index. I know this logic is incorrect.
count = 1919
while count < 2019:
for year in range(len(climate['Year'])):
if (climate[climate['Year']==count]['Max'][year] >=90).all():
print (climate.index[year])
count = count+1
Any input is sincerely appreciated.
You can achieve this without having a second for loop. Assuming the climate dataframe is ordered chronologically, this should do what you want:
current_year = None
for i in range(climate.shape[0]):
if climate['Max'][i] >= 90 and climate['Year'][i] != current_year:
print(climate.index[i])
current_year = climate['Year'][i]
Notice that we're using the current_year variable to keep track of the latest year that we've already printed the result for. Then, in the if check, we're checking if we've already printed a result for the year of the current row in the loop.
That's one way to do it, but I would suggest taking a look at pandas.DataFrame.groupby because I think it fits your use case well. You could get a dataframe that contains the first >=90 max days per year with the following (again assuming climate is ordered chronologically):
climate[climate.Max >= 90].groupby('Year').first()
This just filters the dataframe to only contain the >=90 max days, groups rows from the same year together, and retains only the first row from each group. If you had an additional column Location, you could extend this to get the same except per location per year:
climate[climate.Max >= 90].groupby(['Location', 'Year']).first()
As an athlete I want to keep track of my progression in Excel.
I need a formula that looks for the fastest time ran in a given season. (The lowest value in E for a given year. For 2017, for example, this is 13.32, for 2018 12 and so on.
Can you help me further?
Instead of formula you can use PIVOT
Keep the Year in Report Filter and Time into Value. Then on value field setting select min as summarize value by.
So every you change the year in the Filter the min value will show up.
=AGGREGATE(15,6,E3:E6/(B3:B6=2017),1)
15 tell aggregate to sort the results in ascending order
6 tells aggregate to ignore any errors such as when you divide by 0
E3:E6 is your time range
B3:B6 is you Year as an integer.
B3:B6=2017 when true will be 1 and false will be 0 (provide it goes through a math operation like divide.
1 tells aggregate to return the 1st value in the sorted list of results
I am measuring room utilization (time used/time available) from a data dump. Each row contains the available time for the day and the time used for a particular case.
The image is a simplified version of the data.
If you read the yellow and green highlights (Room 1):
In room 1, there are 200 available minutes on 1/1/2016.
Case 1 took 60 minutes, case 2 took 50 minutes.
There are 500 available minutes on 1/2/2016, and only one case occurred that day, using 350 minutes.
Room 1 utilization = (60 + 50 + 350)/(200 + 500)
The problem with summing the available time is that it double counts the 200 minutes for 1/1/2016, giving: Utilization = (60+50+350)/(200+200+500)
There are hundreds of rows in this data (and there will be multiple data dumps of differing #'s of rows) with multiple cases occurring each day.
I am trying to use a pivot table, but I cannot obtain the 'sum of averages' for a particular room (see image). I am using a macro to pull the numbers out of the grand total column.
Is this possible? Do you see another way to obtain utilization?
(note: there are lots of other columns in the data, like case start, case end, day of week, etc, that are not used in this calculation but are available)
The reason that you're getting 300 for both Average of Available Time columns is because the grand total is a grand total based on the overall average and not a sum of the averages.
Room 1: 200 + 200 + 500 / 3 = 300
Room 2: 300 + 300 + 300 / 3 = 300
I could not comment on the original question, so my solution is based on a few assumptions.
Assumption #1: The data will always be grouped. E.G. All cases in room 1 on a given day will grouped in sequential rows.
Assumption #2: The available time column is a single value for the whole day, there will never be differing available times on the same day.
Solution: Use column E as the Actual Available Time. This column will use a formula to determine if the current row has a unique combination (Date + Room + Available Time) to the previous and if so, the cell will contain that row's available time.
Formula to use in E2:
=IF(AND($A1 = $A2, $B1 = $B2, $C1 = $C2), 0, $C2)
Extend the formula as far down as necessary and then include the new column in your PivotTable data range.
End Result
I created a unique reference by combining columns and then used sumif/countif/countif.
So the formula in column E would be:
=sumif(colB,cellB,ColC)/Countif(colB,cellE)/Countif(colB,cellE)
Doesn't matter if the data is in order or not then.
Extend the formula as far down as necessary and then include the new column in your PivotTable data range.
The easiest method I would recommend is this.
=SUM(H:H)-GETPIVOTDATA("Average of Available Time",$G$3)
The first term sums the H column, and the second term subtracts the grand total value. It is a dynamic solution, and will change to fit the size of the pivot table.
My assumptions are that the Pivot Table was originally placed in cell G3.
Ive spent the last 2 days trying to get this, and I really just need a few pointers. Im using Excel 2010 w/ Power Pivot and calculating inventories. I am trying to get the amount sold between 2 dates. I recorded the quantity on hand if the item was in stock.
Item # Day Date Qty
Black Thursday 11/6/2014 2
Blue Thursday 11/6/2014 3
Green Thursday 11/6/2014 3
Black Friday 11/7/2014 2
Green Friday 11/7/2014 2
Black Monday 11/10/2014 3
Blue Monday 11/10/2014 4
Green Monday 11/10/2014 3
Is there a way to do this in dax? I may have to go back and calculate the differences for each record in excel, but Id like to avoid that if possible.
Somethings that have made this hard for me.
1) I only record the inventory Mon-Fri. I am not sure this will always be the case so i'd like to avoid a dependency on this being only weekdays.
2) When there is none in stock, I dont have a record for that day
Ive tried, CALCULATE with dateadd and it gave me results nearly right, but it ended up filtering out some of the results. Really was odd, but almost right.
Any Help is appreciated.
Bryan, this may not totally answer your question as there are a couple of things that aren't totally clear to me but it should give you a start and I'm happy to expand my answer if you provide further info.
One 'pattern' you can use involves the TOPN function which when used with the parameter n=1 can return the earliest or latest value from a table that it sorts by dates and can be filtered to be earlier/later than dates specified.
For this example I am using a 'disconnected' date table from which the user would select the two dates required in a slicer or report filter:
=
CALCULATE (
SUM ( inventory[Qty] ),
TOPN (
1,
FILTER ( inventory, inventory[Date] <= MAX ( dates[Date] ) ),
inventory[Date],
0
)
)
In this case the TOPN returns a single row table of the latest date earlier than or equal to the latest date provided. The 1st argument in the TOPN specifies the number of rows, the second the table to use, the 3rd the column to sort on and the 4th says to sort descending.
From here it is straightforward to adapt this for a second measure that finds the value for the latest date before or equal to the earliest date selected (i.e. swap MIN for MAX in MAX(dates[Date])).
Hope this helps.
Jacob
*prettified using daxformatter.com