Looking to create weighted average of partitioned columns in Excel - excel-formula

Horrible title, but I couldn't find a way to describe what I'm trying to do concisely. This question was posed to me by a friend, and I'm usually competent in Excel, but in this case I am totally stumped.
Suppose I have the following data:
| A | B | C | D | E | F | G | H |
---------------------------------------------------------------------
1 | 0.50 | 0.50 | 1 | | | 0.30 | 0.30 | |
2 | 0.25 | 0.75 | 2 | | | 0.40 | 0.70 | |
3 | 1.00 | 1.75 | 8 | | | 0.30 | 1.00 | |
4 | 0.75 | 2.50 | 2 | | | 0.50 | 1.50 | |
5 | 1.25 | 3.75 | 3 | | | 1.75 | 3.25 | |
6 | 0.50 | 4.25 | 1 | | | 0.25 | 3.50 | |
7 | 1.00 | 5.25 | 0 | | | 0.50 | 4.00 | |
8 | 0.25 | 5.50 | 2 | | | 0.30 | 4.30 | |
9 | 0.25 | 5.75 | 9 | | | 0.25 | 4.55 | |
10 | 0.75 | 6.50 | 4 | | | 0.70 | 5.25 | |
11 | | | | | | 1.00 | 6.25 | |
12 | | | | | | 0.25 | 0.25 | |
Column A represents the distance traveled while the measurement in column C was collected. Column B represents the total distance traveled so far. So C1 represents some value produced during the process from distance 0 to 0.5. B2 represents the value from distance 0.5 to 0.75, and B3 represents the value from 0.75 to 1.75, etc...
Column F represents a PLANNED second iteration of the same process, but with different measurement intervals. What I need is a way to PREDICT column H, based on a WEIGHTED AVERAGE of values from column C, based on where the intervals in column F intersect with the intervals in column A. For example, since F2 represents the measurement taken from distance 0.30 to 0.70 (an interval of 0.4, split 50/50 across the measurements in C1 and C2), H2 would be equal to: C1*0.5 + C2*0.5: 1.5.
Another example: H3 represents the expected measurement from an interval between 0.7 and 1.0, which is split between C2 (from 0.7 to 0.75 = 0.05) and C3 (from 0.75 to 1.0 = 0.25). So H3 = 16.6%*C2 + 83.3%*C3 = 0.332+6.664 = 6.996.
I'm looking for a way to do this in an Excel spreadsheet without using VBA or breaking it down into something like a Python script to process externally, but so far I'm not finding any way to do it.
Any ideas for accomplishing this entirely within Excel without any special add-ins/scripts installed ?

It's not pretty, but I think the following should work for all except H1 (which would need an added zero row):
=(MAX(0,INDEX(B:B,MATCH(G2,B:B,1))-G1)*INDEX(C:C,MATCH(G2,B:B,1)) +
(G2-INDEX(B:B,MATCH(G2,B:B,1)))*INDEX(C:C,MATCH(G2,B:B,1)+1)) /
MAX(G2-G1,G2-INDEX(B:B,MATCH(G2,B:B,1)))
It matches the values in B and C and weights them accordingly.

Related

Spark Window Functions: calculated once per frame/range?

This is a question about Window Functions in Spark.
Assume I have this DF
DATE_S | ID | STR | VALUE
-------------------------
1 | 1 | A | 0.5
1 | 1 | A | 1.23
1 | 1 | A | -0.4
2 | 1 | A | 2.0
3 | 1 | A | -1.2
3 | 1 | A | 0.523
1 | 2 | A | 1.0
2 | 2 | A | 2.5
3 | 2 | A | 1.32
3 | 2 | A | -3.34
1 | 1 | B | 1.5
1 | 1 | B | 0.23
1 | 1 | B | -0.3
2 | 1 | B | -2.0
3 | 1 | B | 1.32
3 | 1 | B | 523.0
1 | 2 | B | 1.3
2 | 2 | B | -0.5
3 | 2 | B | 4.3243
3 | 2 | B | 3.332
This is just an example! Assume that there are many more DATE_S for each (ID, STR), many more IDs and STRs, and many more entries per (DATE_S, ID, STR). Obviously there are multiple values per Combination (DATE_S, ID, STR)
Now I do this:
val w = Window.partitionBy("ID", "STR").orderBy("DATE_S").rangeBetween(-N, -1)
df.withColumn("RESULT", function("VALUE").over(w))
where N might lead to the inclusion of a large range of rows, from 100 to 100000 and more, depending on ("ID", "STR")
The result will be something like this
DATE_S | ID | STR | VALUE | RESULT
----------------------------------
1 | 1 | A | 0.5 | R1
1 | 1 | A | 1.23 | R1
1 | 1 | A | -0.4 | R1
2 | 1 | A | 2.0 | R2
3 | 1 | A | -1.2 | R3
3 | 1 | A | 0.523 | R3
1 | 2 | A | 1.0 | R4
2 | 2 | A | 2.5 | R5
3 | 2 | A | 1.32 | R6
3 | 2 | A | -3.34 | R7
1 | 1 | B | 1.5 | R8
1 | 1 | B | 0.23 | R8
1 | 1 | B | -0.3 | R9
2 | 1 | B | -2.0 | R10
3 | 1 | B | 1.32 | R11
3 | 1 | B | 523.0 | R11
1 | 2 | B | 1.3 | R12
2 | 2 | B | -0.5 | R13
3 | 2 | B | 4.3243| R14
3 | 2 | B | 3.332 | R14
There are identical "RESULT"s because for every row with identical (DATE_S, ID, ST), the values that go into the calculation of "function" are the same.
My question is this:
Does spark call "function" for each ROW (recalculating the same value multiple times) or calculate it once per range (frame?) of values and just pastes them on all rows that fall in the range?
Thanks for reading :)
From your data the result may not be the same if run twice from what I can see as there is no distinct ordering possibility. But we leave that aside.
Whilst there is codegen optimization, it is nowhere to be found that it checks in the way you state for if the next invocation is the same set of data to process for the next row. I have never read of that type of optimization. There is fusing due to lazy evaluation approach, but that is another matter. So, per row it calculates again.
From a great source: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-windows.html
... At its core, a window function calculates a return value for every
input row of a table based on a group of rows, called the frame. Every
input row can have a unique frame associated with it. ...
... In other words, when executed, a window function computes a value
for each and every row in a window (per window specification). ...
The biggest issue is to have suitable number of partitions for parallel processing, which is expensive, but this is big data. partitionBy("ID", "STR") is the clue here and that is a good thing.

tensorflow timeseries different lengths

I try to get a timeseries into tenserflow to work for an LSTM. I have 4 Files but I'm not sure how to get them together running together. The biggest problem I have is that my first dataset has 1 Data-point per year but 2 others monthly data which should be used for correlation to predict the first set. The 4th Dataset just has some Metadata like Species and Coordinates. Should I put them together somehow, if so how? Any advice in right direction would be nice.
I already looked to the timeseries documentation of tenserflow and also was trying to follow this guide: https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/
but I struggle with getting the year and month data good together. I manage the data in R but run Tensorflow in Python. I'm more familiar with R in general.
Thank you all for being here!
Header samples of the Data structure:
File1.csv:
years | noaa-tree-2657 |noaa-tree-2658 |noaa-tree-2659 |noaa-tree-2662
1901 | 1.676948 | 1.305594 | 0.6756204 | 0.7149572
1902 | 1.562344 | 0.899884 | 0.5102933 | 0.6351094
1903 | 1.687270 | 1.354678 | 0.9899198 | 0.6158589
File2.csv:
noaa-tree-2657 |noaa-tree-2658 |noaa-tree-2659 |noaa-tree-2662 |noaa-tree-2664
1 6.41 | 1.85 | 0.33 | 8.61 | 6.07
2 10.45 | 3.20 | 0.38 | 8.58 | 5.30
3 10.81 | 4.30 | 1.50 | 9.34 | 8.50
File3.csv:
noaa-tree-2657 |noaa-tree-2658 |noaa-tree-2659 |noaa-tree-2662 |noaa-tree-2664
1 -0.3 | 11.0 | 10.1 | -22.4 | -15.1
2 -2.9 | 10.2 | 8.8 | -14.5 | -13.3
3 1.0 | 14.3 | 14.7 | -13.8 | -12.7
File4.csv:
noaa-tree-2657 |noaa-tree-2658 |noaa-tree-2659 |noaa-tree-2662 |noaa-tree-2664
1 QUPR | PSME | PSME | PCGL | THOC
2 280.28 | 249.65 | 250.08 | 298 | 280.72
3 39.1 | 31.45 | 32.72 | 56.55 | 48.47

Is there any faster way to copy paste in Excel?

I have a list of data that I need to keep copy paste with the same data
I found is stupid that I keep copy paste due to I have 10000 over data inside my excel sheet.
Is there any faster way to allow me copy paste the same data in few minutes?
Below will be my data:
A B C D E F G
1 17449 2JW3-1512-P2 NPJW3A3177 0.111 3.149 0.024 0.034
2 0129 3.100 0.026 0.033
3 0130 3.200 0.023 0.025
4 0131 3.159 0.024 0.015
5 17580 2JW3-1511-P2 NPJW3A3177 7129 3.160 0.025 0.015
6 7130 3.140 0.025 0.014
7 7180 3.214 0.023 0.011
Is there any faster way that I want A2:C4 will fill in with the data same as A1:C1
while A6:C7 will fill in with data same A5:C5
Try doing this. First move your data down one row to give you a empty first row. Then in H2 put this formula:
=IF(A2="",H1,A2)
Now copy that across two columns and down to the bottom of your data.
Then your spreadsheet would look like this:
+---+-------+--------------+------------+-------+-------+-------+-------+-------+--------------+------------+
| | A | B | C | D | E | F | G | H | I | J |
+---+-------+--------------+------------+-------+-------+-------+-------+-------+--------------+------------+
| 1 | | | | | | | | | | |
| 2 | 17449 | 2JW3-1512-P2 | NPJW3A3177 | 0.111 | 3.149 | 0.024 | 0.034 | 17449 | 2JW3-1512-P2 | NPJW3A3177 |
| 3 | | | | 129 | 3.1 | 0.026 | 0.033 | 17449 | 2JW3-1512-P2 | NPJW3A3177 |
| 4 | | | | 130 | 3.2 | 0.023 | 0.025 | 17449 | 2JW3-1512-P2 | NPJW3A3177 |
| 5 | | | | 131 | 3.159 | 0.024 | 0.015 | 17449 | 2JW3-1512-P2 | NPJW3A3177 |
| 6 | 17580 | 2JW3-1511-P2 | NPJW3A3177 | 7129 | 3.16 | 0.025 | 0.015 | 17580 | 2JW3-1511-P2 | NPJW3A3177 |
| 7 | | | | 7130 | 3.14 | 0.025 | 0.014 | 17580 | 2JW3-1511-P2 | NPJW3A3177 |
| 8 | | | | 7180 | 3.214 | 0.023 | 0.011 | 17580 | 2JW3-1511-P2 | NPJW3A3177 |
+---+-------+--------------+------------+-------+-------+-------+-------+-------+--------------+------------+
You can now copy columns H2:J8 over to A2:C8 (or as far down as you need to go) and in one copy/paste you're done. Make sure you paste values, not formulas.
Select A:C, HOME, Editing,- Find & Select, Go To Special..., Blanks, OK
=, ↑, Ctrl+Enter
then copy down the last entries in ColumnsA:C to suit.
Just select A1:C1 together, and then drag the bottom right cross (fill handle) down to fill in A2:C4.
Similarily, select A5:C5 together and fill A6:C7 with data by dragging the cross down again.
Alternatively, if you have large range of cells to fill, you can also use the Fill Button in the menu. For example, if you wanted to fill A5:C5 downwards for say 100 rows, you can do this:
Put in the inital data for A5:C5.
In the Excel Name Box, type in A5:C105 and press enter (the Name Box is located left of the formula bar). A5:C105 should now all be selected.
On the Home Ribbon tab > Editing Group > Fill, Select the "Down" option.
Now each row in A5:C105 should be filled exactly with whatever was in A5:C5.

Calculating median with three conditions to aggregate a large amount of data

Looking for some help here at aggregating more than 60,000 data points (a fish telemetry study). I need to calculate the median of acceleration values by individual fish, date, and hour. For example, I want to calculate the median for a fish moving from 2:00-2:59PM on June 1.
+--------+----------+-------+-------+------+-------+------+-------+-----------+-------------+
| Date | Time | Month | Diel | ID | Accel | TL | Temp | TempGroup | Behav_group |
+--------+----------+-------+-------+------+-------+------+-------+-----------+-------------+
| 6/1/10 | 01:25:00 | 6 | night | 2084 | 0.94 | 67.5 | 22.81 | High | Non-angled |
| 6/1/10 | 01:36:00 | 6 | night | 2084 | 0.75 | 67.5 | 22.81 | High | Non-angled |
| 6/1/10 | 02:06:00 | 6 | night | 2084 | 0.75 | 67.5 | 22.65 | High | Non-angled |
| 6/1/10 | 02:09:00 | 6 | night | 2084 | 0.57 | 67.5 | 22.65 | High | Non-angled |
| 6/1/10 | 03:36:00 | 6 | night | 2084 | 0.75 | 67.5 | 22.59 | High | Non-angled |
| 6/1/10 | 03:43:00 | 6 | night | 2084 | 0.57 | 67.5 | 22.59 | High | Non-angled |
| 6/1/10 | 03:49:00 | 6 | night | 2084 | 0.57 | 67.5 | 22.59 | High | Non-angled |
| 6/1/10 | 03:51:00 | 6 | night | 2084 | 0.57 | 67.5 | 22.59 | High | Non-angled |
+--------+----------+-------+-------+------+-------+------+-------+-----------+-------------+
I suggest adding a column (say hr) to your data (containing something like =HOUR(B2) copied down to suit) and pivoting your data with ID, Date, hr and Time for ROWS and Sum of Accel for VALUES. Then copy the pivot table (in Tabular format, without Grand Totals) and Paste Special, Values. On the copy, apply Subtotal At each change in: hr, Use function: Average, Add subtotal to: Sum of Accel then select the Sum of Accel column and replace SUBTOTAL(1, with MEDIAN(. Change Average to Median if required.

Create Line-Chart with different X-Values

I have a certain number of measurements. Each in the following form:
Table A:
| Time [s] | Value |
| 0.5 | 2.0 |
| 50.3 | 33.7 |
| 100.0 | 25.5 |
Table B:
| Time [s] | Value |
| 1.3 | 12.7 |
| 27.8 | 25.0 |
| 97.5 | 20.0 |
| 100.0 | 7.1 |
Table C:
...
The time is always the same, from 0.0 seconds to 100.0 seconds.
The measurement-points as to be seen in the example differ.
I now want to display the different measurements in one chart. Each table has its own line-graph. The X-Axis would display the Time.
Is something like this possible in Excel?
Solved my problem by using a Scatter graph instead of a Line graph...

Resources