I have a questionnaire with answers in a number of different formats. I want the range to be between -1 and 1. However, not all ranges include negative numbers.
I need to create an excel formula to convert the value to the following dependent upon the range.
+---+--------+
| A |To this |
+---+--------+
|-3 | -1 |
|-2 | -0.66 |
|-1 | -0.33 |
| 0 | 0 |
| 1 | 1 |
+---+--------+
Or
+---+--------+
| A |To this |
+---+--------+
| 0 | 0 |
| 1 | 0.25 |
| 2 | 0.5 |
| 3 | 0.75 |
| 4 | 1 |
+---+--------+
Or
+---+--------+
| A |To this |
+---+--------+
| 1 | 0.2 |
| 2 | 0.4 |
| 3 | 0.6 |
| 4 | 0.8 |
| 5 | 1 |
+---+--------+
Or
+---+--------+
| A |To this |
+---+--------+
|-2 | -1 |
|-1 | -0.5 |
| 0 | 0 |
| 1 | 0.5 |
| 2 | 1 |
+---+--------+
etc.
This formula should do the trick:
=IFERROR(IF(A1<=0,-1*A1/(MIN(A:A)+MIN(0,MAX(A:A))),A1/(MAX(A:A))),0)
This produces this example output when autofilled down:
-3 -1
-2 -0.666666667
-1 -0.333333333
0 0
1 0.2
2 0.4
3 0.6
4 0.8
5 1
Note: this includes 0 for both sets of -1,0 and 0,1
If the range of input numbers is finite, even with negative numbers, you can use the general range mapping formula as below.
If the range of input numbers is [X1:X2] and the range of output numbers is [Y1:Y2] (in your case [-1:+1]) then number x is mapped to number y in the output range with the following formula:
y = (x - X1) * (Y2 - Y1) / (X2 - X1) + Y1
when X2-X1 != 0
Related
This started its life as a list of activities. I first built a matrix similar to the one below to represent all activities, which I inverted to show all inactivity, before building the following matrix, where zero indicates an activity, and anything greater than zero indicates the number of days before the next activity.
+------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| Item | 01/08/2020 | 02/08/2020 | 03/08/2020 | 04/08/2020 | 05/08/2020 | 06/08/2020 | 07/08/2020 | 08/08/2020 | 09/08/2020 |
+------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| B | 3 | 2 | 1 | 0 | 0 | 3 | 2 | 1 | 0 |
| C | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| D | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | 0 |
| E | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 |
+------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
Now I need to find suitable intervals for each Item. For instance, in this case I want to find all intervals with a minimum duration of 3-days.
+------+------------+------------+------------+------------+
| Item | 1_START | 1_END | 2_START | 2_END |
+------+------------+------------+------------+------------+
| A | NaN | NaN | NaN | NaN |
| B | 01/08/2020 | 03/08/2020 | 06/08/2020 | 08/08/2020 |
| C | NaN | NaN | NaN | NaN |
| D | 01/08/2020 | 07/08/2020 | NaN | NaN |
| E | 01/08/2020 | NaN | NaN | NaN |
+------+------------+------------+------------+------------+
In reality the data is 700+ columns wide and 1,000+ rows. How can I do this efficiently?
This is a question about Window Functions in Spark.
Assume I have this DF
DATE_S | ID | STR | VALUE
-------------------------
1 | 1 | A | 0.5
1 | 1 | A | 1.23
1 | 1 | A | -0.4
2 | 1 | A | 2.0
3 | 1 | A | -1.2
3 | 1 | A | 0.523
1 | 2 | A | 1.0
2 | 2 | A | 2.5
3 | 2 | A | 1.32
3 | 2 | A | -3.34
1 | 1 | B | 1.5
1 | 1 | B | 0.23
1 | 1 | B | -0.3
2 | 1 | B | -2.0
3 | 1 | B | 1.32
3 | 1 | B | 523.0
1 | 2 | B | 1.3
2 | 2 | B | -0.5
3 | 2 | B | 4.3243
3 | 2 | B | 3.332
This is just an example! Assume that there are many more DATE_S for each (ID, STR), many more IDs and STRs, and many more entries per (DATE_S, ID, STR). Obviously there are multiple values per Combination (DATE_S, ID, STR)
Now I do this:
val w = Window.partitionBy("ID", "STR").orderBy("DATE_S").rangeBetween(-N, -1)
df.withColumn("RESULT", function("VALUE").over(w))
where N might lead to the inclusion of a large range of rows, from 100 to 100000 and more, depending on ("ID", "STR")
The result will be something like this
DATE_S | ID | STR | VALUE | RESULT
----------------------------------
1 | 1 | A | 0.5 | R1
1 | 1 | A | 1.23 | R1
1 | 1 | A | -0.4 | R1
2 | 1 | A | 2.0 | R2
3 | 1 | A | -1.2 | R3
3 | 1 | A | 0.523 | R3
1 | 2 | A | 1.0 | R4
2 | 2 | A | 2.5 | R5
3 | 2 | A | 1.32 | R6
3 | 2 | A | -3.34 | R7
1 | 1 | B | 1.5 | R8
1 | 1 | B | 0.23 | R8
1 | 1 | B | -0.3 | R9
2 | 1 | B | -2.0 | R10
3 | 1 | B | 1.32 | R11
3 | 1 | B | 523.0 | R11
1 | 2 | B | 1.3 | R12
2 | 2 | B | -0.5 | R13
3 | 2 | B | 4.3243| R14
3 | 2 | B | 3.332 | R14
There are identical "RESULT"s because for every row with identical (DATE_S, ID, ST), the values that go into the calculation of "function" are the same.
My question is this:
Does spark call "function" for each ROW (recalculating the same value multiple times) or calculate it once per range (frame?) of values and just pastes them on all rows that fall in the range?
Thanks for reading :)
From your data the result may not be the same if run twice from what I can see as there is no distinct ordering possibility. But we leave that aside.
Whilst there is codegen optimization, it is nowhere to be found that it checks in the way you state for if the next invocation is the same set of data to process for the next row. I have never read of that type of optimization. There is fusing due to lazy evaluation approach, but that is another matter. So, per row it calculates again.
From a great source: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-windows.html
... At its core, a window function calculates a return value for every
input row of a table based on a group of rows, called the frame. Every
input row can have a unique frame associated with it. ...
... In other words, when executed, a window function computes a value
for each and every row in a window (per window specification). ...
The biggest issue is to have suitable number of partitions for parallel processing, which is expensive, but this is big data. partitionBy("ID", "STR") is the clue here and that is a good thing.
I have this PySpark Dataframe calculated in my algorithm:
+------+--------------------+
| A | b |
+------+--------------------+
| 1|1.000540895285929161|
| 2|1.097289726627339219|
| 3|0.963925596369865420|
| 4|0.400642772674179290|
| 5|1.136213095583983134|
| 6|1.563124989279187345|
| 7|0.924395764582530139|
| 8|0.833237679638091343|
| 9|1.381905515925928345|
| 10|1.315542676739417356|
| 11|0.496544353345593242|
| 12|1.075150956754565637|
| 13|0.912020266273109506|
| 14|0.445620998720738948|
| 15|1.440258342829831504|
| 16|0.929157554709733613|
| 17|1.168496273549324876|
| 18|0.836936489952743701|
| 19|0.629466356196215569|
| 20|1.145973619225162914|
| 21|0.987205342817734242|
| 22|1.442075381077187609|
| 23|0.958558287841447591|
| 24|0.924638906376455542|
+------+--------------------+
I need to calculate a new Column named F, as a sort of recursive calculation :
F(I) = F(I- 1) * 0.25
+ b(I+ 1) * 0.50 + b(I) * 0.25
When I is the row index, and only for I= 1 the value of F(1) is:
f(i) = b(i) * 0.25
+ b(i+ 1) * 0.50 + b(i) * 0.25
How I should calculate that? Should I use lag and lead functions?
I'm trying to get a cell with value BBBBBBBGGGGGJJJJCCCCDDDDAA from these cells:
-----------------------------------------
| 2 | 7 | 4 | 4 | 0 | 0 | 5 | 0 | 0 | 4 |
-----------------------------------------
So it gets the highest value and writes the cell's horizontal address (that might have an offset) that many times. Then gets the next highest and does the same thing until it reaches the zeroes. Is that possible in excel?
additional samples:
------------------------------------------------------------------------------------
| 2 | 0 | 0 | 3 | 0 | 0 | 5 | 0 | 0 | 0 | GGGGGDDDAA |
------------------------------------------------------------------------------------
| 0 | 0 | 2 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | GGGGGCC |
------------------------------------------------------------------------------------
| 0 | 7 | 2 | 2 | 4 | 3 | 3 | 0 | 0 | 0 | BBBBBBBEEEEFFFGGGCCDD |
------------------------------------------------------------------------------------
| 4 | 7 | 0 | 7 | 7 | 0 | 0 | 0 | 8 | 7 | IIIIIIIIBBBBBBBDDDDDDDEEEEEEEJJJJJJJAAAA |
------------------------------------------------------------------------------------
| 0 | 2 | 0 | 2 | 8 | 0 | 8 | 0 | 7 | 10| JJJJJJJJJJEEEEEEEEGGGGGGGGIIIIIIIBBDD |
------------------------------------------------------------------------------------
In my excel worksheet I have a matrix like this:
+---+------------+--------+--------+--------+--------+--------+-------+
| * | A | B | C | D | E | F | Col n |
+---+------------+--------+--------+--------+--------+--------+-------+
| 1 | 01/01/2000 | -1.000 | -1.000 | -1.000 | -1.000 | -1.000 | ... |
| 2 | 01/02/2000 | | 1.200 | 500 | 500 | 500 | ... |
| 3 | 01/03/2001 | | | 1.100 | 800 | 800 | ... |
| 4 | 01/04/2000 | | | | 1.000 | 700 | ... |
| 5 | 01/05/2000 | | | | | 900 | ... |
| 6 | 01/06/2000 | | | | | | ... |
| 7 | 01/07/2000 | | | | | | ... |
+---+------------+--------+--------+--------+--------+--------+-------+
I need a formula for each column (from column 2) with a dynamic range like this:
For Column B:
=XIRR(B1:B1,A1:A1)
For Column C:
=XIRR(C1:C2,A1:A2)
For Column D:
=XIRR(D1:D3,A1:A3)
For Column E:
=XIRR(E1:E4,A1:A4)
and so on.
Is it possible?
Thanks
I think what you are after is:
=XIRR(OFFSET(B$1,0,0,COLUMN()-1),OFFSET($A$1,0,0,COLUMN()-1))
Using OFFSET we can specify the number of rows in our offset range... We can use the COLUMN() number -1 to get 1 for B, 2 for C etc. We start the offset from an unfixed cell for the values (so it moves along the columns) and a fixed one for dates (so it stays in A)
This formula can just be copied along the cells as far as necessary...