Spark Window Functions: calculated once per frame/range? - apache-spark

This is a question about Window Functions in Spark.
Assume I have this DF
DATE_S | ID | STR | VALUE
-------------------------
1 | 1 | A | 0.5
1 | 1 | A | 1.23
1 | 1 | A | -0.4
2 | 1 | A | 2.0
3 | 1 | A | -1.2
3 | 1 | A | 0.523
1 | 2 | A | 1.0
2 | 2 | A | 2.5
3 | 2 | A | 1.32
3 | 2 | A | -3.34
1 | 1 | B | 1.5
1 | 1 | B | 0.23
1 | 1 | B | -0.3
2 | 1 | B | -2.0
3 | 1 | B | 1.32
3 | 1 | B | 523.0
1 | 2 | B | 1.3
2 | 2 | B | -0.5
3 | 2 | B | 4.3243
3 | 2 | B | 3.332
This is just an example! Assume that there are many more DATE_S for each (ID, STR), many more IDs and STRs, and many more entries per (DATE_S, ID, STR). Obviously there are multiple values per Combination (DATE_S, ID, STR)
Now I do this:
val w = Window.partitionBy("ID", "STR").orderBy("DATE_S").rangeBetween(-N, -1)
df.withColumn("RESULT", function("VALUE").over(w))
where N might lead to the inclusion of a large range of rows, from 100 to 100000 and more, depending on ("ID", "STR")
The result will be something like this
DATE_S | ID | STR | VALUE | RESULT
----------------------------------
1 | 1 | A | 0.5 | R1
1 | 1 | A | 1.23 | R1
1 | 1 | A | -0.4 | R1
2 | 1 | A | 2.0 | R2
3 | 1 | A | -1.2 | R3
3 | 1 | A | 0.523 | R3
1 | 2 | A | 1.0 | R4
2 | 2 | A | 2.5 | R5
3 | 2 | A | 1.32 | R6
3 | 2 | A | -3.34 | R7
1 | 1 | B | 1.5 | R8
1 | 1 | B | 0.23 | R8
1 | 1 | B | -0.3 | R9
2 | 1 | B | -2.0 | R10
3 | 1 | B | 1.32 | R11
3 | 1 | B | 523.0 | R11
1 | 2 | B | 1.3 | R12
2 | 2 | B | -0.5 | R13
3 | 2 | B | 4.3243| R14
3 | 2 | B | 3.332 | R14
There are identical "RESULT"s because for every row with identical (DATE_S, ID, ST), the values that go into the calculation of "function" are the same.
My question is this:
Does spark call "function" for each ROW (recalculating the same value multiple times) or calculate it once per range (frame?) of values and just pastes them on all rows that fall in the range?
Thanks for reading :)

From your data the result may not be the same if run twice from what I can see as there is no distinct ordering possibility. But we leave that aside.
Whilst there is codegen optimization, it is nowhere to be found that it checks in the way you state for if the next invocation is the same set of data to process for the next row. I have never read of that type of optimization. There is fusing due to lazy evaluation approach, but that is another matter. So, per row it calculates again.
From a great source: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-windows.html
... At its core, a window function calculates a return value for every
input row of a table based on a group of rows, called the frame. Every
input row can have a unique frame associated with it. ...
... In other words, when executed, a window function computes a value
for each and every row in a window (per window specification). ...
The biggest issue is to have suitable number of partitions for parallel processing, which is expensive, but this is big data. partitionBy("ID", "STR") is the clue here and that is a good thing.

Related

Splitting ID column in pandas dataframe to multiple columns

I have a pandas dataframe like below :
| ID | Value |
+----------+--------+
|1C16 | 34 |
|1C1 | 45 |
|7P.75 | 23 |
|7T1 | 34 |
|1C10DG | 34 |
+----------+--------+
I want to split the ID column (its a string column) in a way that looks like below:
| ID | Value | Code | Core |size |
+----------+--------+-------+------+-----+
|1C16 | 34 | C | 1 | 16 |
|1C1 | 45 | C | 1 | 1 |
|7P.75 | 23 | P | 7 | .75 |
|7T1 | 34 | T | 7 | 1 |
|1C10DG | 34 | C | 1 | 10 |
+----------+--------+-------+------+-----+
So how can this be achieved ? Thanks
You can try .str.extract with regex (?P<Code>\d+)(?P<Core>[A-Z])(?P<size>[.0-9]+) to capture the patterns:
df.ID.str.extract(r'(?P<Code>\d+)(?P<Core>[A-Z])(?P<size>[.0-9]+)')
# Code Core size
#0 1 C 16
#1 1 C 1
#2 7 P .75
#3 7 T 1
#4 1 C 10
use .str.extract() with multiple capturing groups & join
df.join(
df['ID'].str.extract('(\d)(\w)(\d+|.\d+)').rename(
columns={0 : 'Core', 1 : 'Code', 2 : 'Size'}))
ID Value Core Code Size
1 1C16 34.0 1 C 16
2 1C1 45.0 1 C 1
3 7P.75 23.0 7 P .75
4 7T1 34.0 7 T 1
5 1C10DG 34.0 1 C 10

Pandas finding intervals (of n-Days) and capturing start/end dates

This started its life as a list of activities. I first built a matrix similar to the one below to represent all activities, which I inverted to show all inactivity, before building the following matrix, where zero indicates an activity, and anything greater than zero indicates the number of days before the next activity.
+------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| Item | 01/08/2020 | 02/08/2020 | 03/08/2020 | 04/08/2020 | 05/08/2020 | 06/08/2020 | 07/08/2020 | 08/08/2020 | 09/08/2020 |
+------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| B | 3 | 2 | 1 | 0 | 0 | 3 | 2 | 1 | 0 |
| C | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| D | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | 0 |
| E | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 |
+------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
Now I need to find suitable intervals for each Item. For instance, in this case I want to find all intervals with a minimum duration of 3-days.
+------+------------+------------+------------+------------+
| Item | 1_START | 1_END | 2_START | 2_END |
+------+------------+------------+------------+------------+
| A | NaN | NaN | NaN | NaN |
| B | 01/08/2020 | 03/08/2020 | 06/08/2020 | 08/08/2020 |
| C | NaN | NaN | NaN | NaN |
| D | 01/08/2020 | 07/08/2020 | NaN | NaN |
| E | 01/08/2020 | NaN | NaN | NaN |
+------+------------+------------+------------+------------+
In reality the data is 700+ columns wide and 1,000+ rows. How can I do this efficiently?

MS Excel's alternative for ={A:A} formula in Google Sheets

This must be a simple thing to do but somehow I am unable to find answer to this question. In google sheets, if you want to reference an entire column (e.g Column A) you will put ={A:A} and the entire column will be referenced. How do you achieve similar thing in MS excel?
EDIT: (Asked in comments to post specific example)
Lets assume google sheets contain the following data:
| A | B | C |
| 1 | 5 | 9 |
| 2 | 6 | 0 |
| 3 | 7 | 9 |
| 4 | 8 | 0 |
Now if in cell D1 I type ={A:A}, the entire column A will be shown in column D.
| A | B | C | D |
| 1 | 5 | 9 |={A:A}
| 2 | 6 | 0 |
| 3 | 7 | 9 |
| 4 | 8 | 0 |
becomes
| A | B | C | D |
| 1 | 5 | 9 | 1 |
| 2 | 6 | 0 | 2 |
| 3 | 7 | 9 | 3 |
| 4 | 8 | 0 | 4 |
I dont have to drag the formula to the bottom or anything. It just shows the entire column
How do I do the exact same thing in excel?
It depends. For example:
=COUNTIF(A:A,"gold")
Excel does not support stuff like:
=COUNTIF(A12:A,"gold")

Assigning ranks to items that vary in order

I am trying to build a dataset from an online questionnaire. In this questionnaire, participants were asked to name 6 items. These items are represented with numbers from 1 to 6 (order of mention does not matter). Afterwards, participants were asked to rank those items from most important to least important (order here matters). Right now I have three columns "Named items", "Item ranked" and "Rank." The last column represents the position at which each case was ranked at. Thus, the idea would be to look at the number in the first column "Named item" and search for its position on the second column "Items Ranked" and return its position to the third column corresponding row.
Since the numbers go from 1 to 6, every six rows the process has to start again on the 7th row. I have a total of 186 participants, which means there's a total of 1116 items. What would be the most efficient way of doing this and preventing human error?
Here is an example of how the sheet looks like done manually:
+----------------------+-----------------------------+------+
| Order of named items | Items ranked (# = Identity) | Rank |
+----------------------+-----------------------------+------+
| 1 | 2 | 4 |
| 2 | 5 | 1 |
| 3 | 6 | 6 |
| 4 | 1 | 5 |
| 5 | 4 | 2 |
| 6 | 3 | 3 |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
| 4 | 4 | 4 |
| 5 | 5 | 5 |
| 6 | 6 | 6 |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
| 4 | 4 | 4 |
| 5 | 5 | 5 |
| 6 | 6 | 6 |
| 1 | 5 | 3 |
| 2 | 6 | 4 |
| 3 | 1 | 5 |
| 4 | 2 | 6 |
| 5 | 3 | 1 |
| 6 | 4 | 2 |
| 1 | 2 | 2 |
| 2 | 1 | 1 |
| 3 | 6 | 4 |
| 4 | 3 | 5 |
| 5 | 4 | 6 |
| 6 | 5 | 3 |
+----------------------+-----------------------------+------+
You can use this non volatile function:
=MATCH(A2,INDEX(B:B,INT((ROW(1:1)-1)/6)*6+2):INDEX(B:B,INT((ROW(1:1)-1)/6)*6+7),0)
Assuming 1st column starts at A2 and second column at B2 use this formula in C2 copied down
=MATCH(A2,OFFSET(B$2,6*INT((ROWS(C$2:C2)-1)/6),0,6),0)
OFFSET returns the 6 cell range required and MATCH finds the position of the relevant item within that
See screenshot below

Excel: Give scores based on range, where max = 1 and min = 10

I have following problem:
I want to give scores to a range of numbers from 1-10 for example:
| | A | B |
|---|------|----|
| 1 | 1209 | 1 |
| 2 | 401 | 7 |
| 3 | 123 | 9 |
| 4 | 49 | 10 |
| 5 | 30 | 10 |
(Not sure if B is 100% correct but roughly)
I got the B values with
=ABS(CEILING(A1;MAX($A$1:$A$32)/10)*10/MAX($A$1:$A$32)-11)
It seems to work but if I for example take numbers like
| | A | B |
|---|------|----|
| 1 | 100 | 1 |
| 2 | 90 | 2 |
| 3 | 80 | 3 |
| 4 | 70 | 4 |
| 5 | 50 | 6 |
But I want 50 to be 10.
I would like to have it scalable so I can do it with a 1-10 or 1-100 or 5-27 or whatever scale and with however many numbers in the list and whatever numbers to score from.
Thanks!
Use this formula:
=$E$1 + ROUND((MIN($A:$A)-A1)/((MAX($A:$A)-MIN($A:$A))/($E$1-$E$2)),0)
It is scalable. You put the max and min in E1 and E2.

Resources