Hi everyone, my question is how the highlighted t-statistics are measured? So the 1.24, 0.27 and 0.97 are monthly excess returns but how are the t statistics calculated?
Related
I'm working on a spreadsheet to calculate the size of a holding tank for condensate water. The goal is to size the tank so that on the worst day (sometime in February) we have 50 or 100 gallons in the tank.
I've got a data set of average monthly condensate water from 18 ac units.
Here's the monthly average, Jan to Dec (in gallons):
310
134
996
2298
3801
3289
3110
3350
3046
1454
1430
307
To make the simulation more accurate and eliminate the sudden changes that occur after the first of each month (Where Feb is 134 gallons and Mar is 966) I'd like to be able to calculate 365 data points that are along the mathematical curve created by the 12 average points so that the accumulation is more realistic. For simulation purposes I can assume that the average monthly is at the mid month of each month.
How would one go about such a calculation?
I've been playing with the predict-appointment-noshow notebook tutorial and I'm confused by the output of the PERCENT_TRUE primitive.
My understanding is that after feature generation, a column like locations.PERCENT_TRUE(appointments.sms_received) gives the percent of rows for which sms_received is True, given a single location, which was defined as its own Entity earlier. I'd expect that column to be the same for all rows of a single location, because that's what it was conditioned on, but I'm not finding that to be the case. Any ideas why?
Here's an example from that notebook data to demonstrate:
>>> fm.loc[fm.neighborhood == 'HORTO', 'locations.PERCENT_TRUE(appointments.sms_received)'].describe()
count 144.00
mean 0.20
std 0.09
min 0.00
25% 0.20
50% 0.23
75% 0.26
max 0.31
Name: locations.PERCENT_TRUE(appointments.sms_received), dtype: float64
Even though the location is restricted to just 'HORTO', the column ranges from 0.00-0.31. How is this being calculated?
This is a result of using cutoff times when calculating this feature matrix.
In this example, we are making predictions for every appointment at the time the appointment is scheduled. The feature locations.PERCENT_TRUE(appointments.sms_received) therefore is calculated at a specific time given by the cutoff times. It is calculating for each appointment "the percentage of appointments at this location received an an sms prior to the scheduled_time"
That construction is necessary to prevent the leakage of future information into the prediction for that row at that time. If we were calculated PERCENT_TRUE using the whole dataset, we'd necessarily be using information from appointments that hadn't yet happened, which isn’t valid for predictive modeling.
If you instead want to make the predictions after all of the data is known, all you have to do is remove the cutoff_time argument to the ft.dfs call:
fm, features = ft.dfs(entityset=es,
target_entity='appointments',
agg_primitives=['count', 'percent_true'],
trans_primitives=['weekend', 'weekday', 'day', 'month', 'year'],
max_depth=3,
approximate='6h',
# cutoff_time=cutoff_times[20000:],
verbose=True)
Now you can see that the feature is the same when we condition on a specific location
fm.loc[fm.neighborhood == 'HORTO', 'locations.PERCENT_TRUE(appointments.sms_received)'].describe()
count 175.00
mean 0.32
std 0.00
min 0.32
25% 0.32
50% 0.32
75% 0.32
max 0.32
You can read more about how Featuretools handles time in the documentation.
To compute the travel time (across two points) over a period of time, I was logging the time at which I crossed the sub-points between the two main points. My ultimate aim was to prepare a line chart which would show the trend of the journeys, and then find out where I have lost time and where I have gained.
My time entries are in the below format
Day 1|Day 2
09:55|09:35
10:01|09:37
10:06|09:42
10:09|09:45
10:12|09:49
10:15|09:51
10:22|09:58
10:28|10:08
10:35|10:18
10:38|10:21
10:48|10:31
I drew the chart with the time series appearing on the y axis (with journey points on the x axis). But Excel has some logic to determine which time values should be shown in the y axis. How can I force excel to only show the time values which I want to be seen? For example in the above case I actually need time stamps like ;
09:00
09:15
09:30
09:45
10:00
10:15
10:30
10:45
11:00
How can this be done? Thanks in advance.
Found the answer. On the Axis there is a provision to set the Min, Max values, and Major and Minor points on the axis.
For the Min and Max values, divide the hours you wish to see by 24. Example, if the starting time should be 9AM, then 9/24 = 0.375 should be given as the Min value. If 12 noon should be the max value then 12/24 = 0.500 should be given as the Max value. For the Major point, in case the intervals have to be of 15 mins the formula to be used is (15/60)/24 = 0.01041.
I have an excel sheet like below:
Estimated Cost 130
Discount (%) 10
------------------------
Desired Cost 117
Item Quanity Rate Min. Rate Max. Rate Total(Quanity*Rate)
Wood 10 2.00 4.00 4.00 40.00
Sand 10 5.00 8.00 6.00 60.00
Cement 10 3.00 5.00 4.00 40.00
--------------------------------------------------------------------------
Actual Cost 140
Here column Rate should be adjusted such that the Actual Cost becomes 117. Note that the adjusted Rate value should not be less than Rate Min. and should not be greater than Rate Max. Is there any formula or piece of code to handle this. Also the values should be accurate to 2 decimal places and should not exceed 2 decimal places. I am new to Excel, any help would highly be appreciated.
Should the rates of the items be discounted by the same amount? If yes, just add two columns (Adjusted Rate and Adjusted Total) to your sheet.
Adjusted Rate = Actual Cost / Desired Cost * Rate.
Adjusted Cost = Adjusted Rate * Quantity
If the discount rate can be different per item, then the sum gets complex and has multiple correct answers (eg, you could discount Wood only to make up the difference, or split it between Wood and Sand, leaving cement at its original rate).
As far as the decimal places, the best option is to let all the calculations use as many decimal places as they need, but round off the displayed total to two decimals. This gives the greatest accuracy, regardless of the desired precision.
I have a set of data that has over 15,000 records in Excel that is from a measurement tool that finds trends over a large areas. I'm not interested in looking for trends within the data as whole but rather over the data closest to each other to get a sense of how noisy (variation with neighboring records). Almost like I want to know the average standard deviation of looking at the 15,000 or so records only at 20 records at a time. The hope is the data values trend gradually rather than sudden changes from record to record and thus looks noisy. If I add a Chart and use the "Moving Average" Trendline it kind of visually shows how noisy the data looks across the 15,000 + records. However, I was hoping to get a numeric value to rate how noisy the data is vs. other datasets. Any ideas on what I could do here with formula's built-in Excel or by adding some add-in? Let me know if I need to explain this any better.
Could you calculate your moving average for your 20 sample window, then use the difference between each point and the expected value to calculate a variance?
Hard to do tables here, but here is a sample of what I mean
Actual Measured Expected Variance
5 5.44 4.49 0.91
6 4.34 5.84 2.26
7 8.45 7.07 1.90
8 6.18 7.84 2.75
9 8.89 9.10 0.04
10 11.98 10.01 3.89
The "measured" values were determined as
measured = actual + (rand() - 0.5) * 4
The "expected" values were calculated from a moving average (the table was pulled from the middle of the data set).
The variance is simply the square of expected minus measured.
Then you could calculate an average variance as a summary statistic.
Moving average is the correct, but you need a critical element - order. Do you date/time variable or a sequence number?
Use the OFFSET function to setup your window. If you want 20, your formula will look something like AVERAGE(OFFSET(C15,-10,0,21)). This is your moving average.
Relate that to C15, whether additive or multiplicative, you'll have your distance. All we need now is your tolerance.