Estimating linear fit as a moving average - apache-spark

Say I have the following data:
Year Day Amount
2015 1 2
2015 2 3
2015 3 4
2015 4 5
Using window functions or aggregations, I want to get a number for each row that represents the "linearity based on the previous n rows". In this simple example, for the row with day = 4, linearity would be pretty high, for obvious reasons, based on the previous n days, where n is 3.
Issues pop up when some of the previous days do not exist, and in that case, I would just want to use a default value of -1, for example to indicate otherwise.
I don't have an exact value I want to represent the linearity, but as an example, similar to correlation coefficients, 1 could represent high linearity, while 0 none.
Edit:
What I have done as a makeshift use-case was add a column to each row representing the day (and taking year into account), and used a window function, with lag to find the previous 4 values (if they existed). After getting (or not getting these values), I made a simple calculation to calculate the difference of each combination of points, and used division to see how close they were to each other (1 would be the best). I apologize, I cannot share any code due to an agreement of code sharing.

generate series with all dates you want to estimate
left outer join it with input
replace null values for Amount with a valuet to estimate
convert to RDD
generate keys with lower bound -> for 2015 4 generate keys [2015 4, 2015 3, 2015 2] amd flatten
groupByKey
estimate for groups

Related

In Excel: How to built a graph with time as X, names as Y with multiples series?

I am looking for a way to make a specific graph in Excel and I can't find a solution in Excel or on the web.
I have data about an online training with people completing parts of a course at a certain time:
FullName
Course
TIME
Name-A
Part 1
23/03/2022 10:38
Name-A
Part 2
23/03/2022 12:07
Name-A
Part 3
23/03/2022 16:55
Name-B
Part 1
11/03/2022 15:14
Name-B
Part 2
22/03/2022 12:08
Name-B
Part 3
28/03/2022 16:06
Name-B
Part 4
30/03/2022 14:55
Name-B
Part 5
18/04/2022 08:13
Name-C
Part 1
11/04/2022 15:25
Name-C
Part 2
20/04/2022 13:50
I would like to have a specific graph of this data:
On the vertical axis: one row for each user' name: Name-A, Name-B and Name-C.
On the horizontal axis: continuous time (say, in days) From the minimum time in the table (or less) to the maximum (or more)
Series of plots for the data: Each part of the course (from Part 1 to Part 5 here) would be a series of dots of a specific color, placed on the right row (for a learner's name) above the corresponding time on the horizontal axis.
Do you have any idea on how it could be achieved?
All the best, R.S.
Edit: The table does not appear as in the preview so i try to add a screenshot:
Screenshot of the table
So one way to visualise this as mentioned in the comments is to create a separate series for each person and show passing each part of the course as a vertical step:
It's based very loosely on this but I've set each day in the date range as the x-coordinates and used a lookup to transform the data in H2
=RIGHT(XLOOKUP($G2+TIME(23,59,59),FILTER($C$2:$C$11,$A$2:$A$11=H$1),FILTER($B$2:$B$11,$A$2:$A$11=H$1),0,-1))+(COLUMN()-COLUMN($G$1))*10
pulled down and across to give
Explanation
The data for the graph has dates spanning the times in the raw data for its x-coordinates (column G). I generated it manually but could have used Sequence in Excel 365.
There are three columns of y-values, H to J, generating a separate series for each person. The three lines are initially spaced out by 10 units based on the column number. In the formula above, the raw data is filtered by the person's name so the headers in columns H, I or J match the names in column A in the raw data. Xlookup is used with 'next smallest' match so where the date in column G is greater or equal to the date/time in column C it will return the corresponding course from column B. Because column C actually contains date/times, I have added almost 24 hours when matching the date in column G to make sure that a match is found if the day is the same, regardless of time. In a case like Name-A, where three courses are completed in the same day, this will automatically select the last one (Part 3). Then I take the right-hand character of the course name (which is a digit in the sample data) and add it to the relative column number multiplied by 10. If there is no match, Xlookup returns zero so you just get the initial value for each series (10, 20 or 30), otherwise the result will be an increase by one unit each time a course is passed. If you couldn't assume the last character of the course name was a digit, you would need a lookup to assign a number to each course name.
The data is then plotted on a scatter graph with points joined by straight lines. I had to adjust the x-axis manually to make the range correct and the labelling clearer.
This could be done without Excel 365, probably using Aggregate to get the highest row number with a condition on the name and date.
EDIT
I could have achieved the same result much more easily using Countifs to find how many courses had been passed by a certain person by a certain date:
=COUNTIFS($A$2:$A$11,H$1,$C$2:$C$11,"<="&$G2+TIME(23,59,59))+(COLUMN()-COLUMN($G$1))*10
This wouldn't have needed Excel 365. If you needed to give different courses different weightings, you could do this with a sumproduct and a lookup, also fairly straightforward.

Excel Solver issues

I want to setup a system by which employees have a set number of points that they can use to weight against each holiday, basically if they don't want to work on a certain holiday they would set a large number of points on that day. We would then setup holiday assignments such that there are two people on each holiday and each person works 2 holidays; there are 8 employees with 8 "holidays", so the matrix is 8x8.
I setup a preference array that has a preference number for each employee, call it P.
I setup an assignment array for year 1, call it Y1.
I then take SUM(P*Y1) to get the total points for Year 1.
I solve to minimize SUM(P*Y1), subject to the constraints above: 2 holidays/employee, 2 employees/holiday. Assignments are integers <=1.
The solver gives a solution that looks reasonable.
I then repeat the formula above, but I use a new assignment array for year 2, Y2.
I then setup a matrix of Y1+Y2, giving the total points over two years.
I also setup a matrix of Y1*Y2=0, ie no repeat assignments.
I use Solver to minimize SUM(PY1+PY2) by changing the year 2 assignments, Y2. Again, with the constraints of 2 employees per holiday, 2 holidays per employee.
I expect it to give me the second lowest point total possible. It does not, it gives me the same solution as in Y1, and Y1Y2<>0.
Is this my math, or is it the Solver? It gives the absolute minimum without following the constraint of not repeating any values, ie Y1Y2=0.

How to create exponential growth in excel over a year

So I am trying to build an excel model where every month the numbers will increase exponentially to a point at the end of the year which is driven by annual expectations. Currently I have it divided by 12 and each year there are huge jumps over the previous making the chart/growth very jumpy. For illustration purposes, lets say for 2020 the desired number for the year is 12. In the current state, I would get 1 per month (12/12), however, what I want is for it to be growing gradually/exponentially, so for example 0.2, 0.5, 0.9 etc with December being the largest, and the sum for the entire year equaling 12. Then the next year (2021), starting in January, it would take into account the December 2020 number and grow from there again to the desired number (lets say total 24 for 2021) and so on. I'd love for it to have a more exponential / hockey stick-like growth.
What would be a good way to do this?
The function RRI can be used to find an interest rate which will give you a given target value. This can be used to find terms in a geometric series which have a given sum (which is what you seem to be asking for).
For example, say you want 12 exponentially increasing numbers which, when added to 100, gets you to 2000. Starting with 100, repeatedly multiply by (1 + RRI(12,100,2000)). To get the numbers that you want (which will be 12 numbers which sum to 1900) just calculate the difference each month:
I think the simplest way to solve this is by using Goal Seek. First you need to build a sheet like this:
You choose the starting value in January (B1) and every month is a constant growth rate (D1) bigger than the previous month. You also calculate the total sum at the bottom in B13.
Now you use goal seek to find the growth rate which makes the sum equal to 12:
The answer I get for a starting value of 0.1 is a growth rate of 1.376:

Trying to either pull or recreate trendline data using LINEST

I am trying to recreate the formula from a trendline on a graph. basically my company is trying to predict the corn yields for next year. all of the actual programmers are out for the week so they passed it on to me(web developer:D). Ive attempted the LINEST formula multiple times with no luck.
basically in column B I have the years(1-15, trying to project 16) and Column C i have the actual trend data. i am probably doing this wrong however
EX =LINEST(C16:C30,B16:B30,FALSE,FALSE)
Any help would be appreciated. just tell me if you need the actual file or more information. Thanks in advance!
The fourth argument, concerning the return of additional regression statistics, is optional and is taken as FALSE if omitted, so seems not required for your purposes. The third argument, concerning the intercept with the Y-axis (the value of y when x is 0), is also optional but taken as TRUE if omitted. In your case TRUE seems appropriate so the third parameter seems not required for your purposes.
With your data spanning 15 years, if ending with the current year, it is conveniently 2001-2015 bdi and has no information about the value of y (production) in year 2000 (ie when x is 0) but this is unlikely to have been 0, as would be taken to be the case if the third argument is FALSE.
In a simplified example, take production of 50 in 2001, increasing by an (unrealistically!) constant 5 each year. By 2015 this has reached 120, so for 2016 at the same rate of increase production of 125 should be expected. Your formula returns 9.35 so would predict production of 129.35, though we know to expect 125, as given by:
=LINEST(C16:C30,B16:B30)
when added to the latest available (120).
The former is too high a predicted increase because it assumes growth was from 0 to 120 in sixteen years, rather than what I have taken to be from 50 to 120 in fifteen.
As has been mentioned by #Byron Wall, Excel has the TREND function that may be used for linear extrapolation to obtain the next (16th) value like so:
=TREND(C16:C30,B16:B30,16)
This directly returns 125 for the, simplified, sample data.
HOWEVER, all the above assumes growth is linear. Taking say Brazilian corn production (Million tons) over the period (offset one year) this has been roughly (based on USDA.gov):
The red line is the Linear trend and green a fourth order Polynomial. They happen both to end up at the same place for one year ahead (the hollow bar) but predict different results from the latest six years:
It may be worth charting the data you have, and adding different trend lines, before deciding whether linear extrapolation seems the most promising for forecasting purposes. ‘Wavy’ (cyclical) progress is evident in many datasets.

How to obtain Incremental standard deviations from a set of standard deviations?

I have a data set containing three columns, first column represents number of trials, second column represents experimental values, and the third column represents corresponding standard deviation.
With each experiment there is an increment in my experimental values. To get the incremental values, I hold my first value as the reference value and subtract this reference value from each subsequent value and use them to create fourth column of these incremental values.
My problem begins right from here. How do I create a new set of incremental standard deviations for the incremental experimental values I got? My apology if the problem is not well defined but hopefully someone will eventually be able to help me out. Many thanks!
Below is my data set,
Trial Mean SD Incr Mean Incre SD
1 45.311 4.668 0
2 56.682 2.234 11.371
3 62.197 2.266 16.886
4 70.550 4.751 25.239
5 80.528 4.412 35.217
6 87.453 4.542 42.142
7 89.979 2.185 44.668
8 96.859 3.476 51.548
To be clear, for other readers, your incremental mean is actually the difference between trial 1 and the other trials.
Variances add directly when you subtract (or add) independent normal distributions. So you first want to convert that standard deviation to a variance by squaring it, and then you can add the variances, and then you can take the square root to turn it back into a standard deviation. Note when using this kind of Pythagorean combination, you are assuming that trial 1 is independent from the trials, so for example, you cannot do things like have some sample in both trials.
Logically this makes sense that your so called "incremental SD" will always be greater than the individual SDs, since the uncertainty of both distributions contributes towards the uncertainty of the difference.

Resources