PySpark Dataframe recursive column - apache-spark

I have this PySpark Dataframe calculated in my algorithm:
+------+--------------------+
| A | b |
+------+--------------------+
| 1|1.000540895285929161|
| 2|1.097289726627339219|
| 3|0.963925596369865420|
| 4|0.400642772674179290|
| 5|1.136213095583983134|
| 6|1.563124989279187345|
| 7|0.924395764582530139|
| 8|0.833237679638091343|
| 9|1.381905515925928345|
| 10|1.315542676739417356|
| 11|0.496544353345593242|
| 12|1.075150956754565637|
| 13|0.912020266273109506|
| 14|0.445620998720738948|
| 15|1.440258342829831504|
| 16|0.929157554709733613|
| 17|1.168496273549324876|
| 18|0.836936489952743701|
| 19|0.629466356196215569|
| 20|1.145973619225162914|
| 21|0.987205342817734242|
| 22|1.442075381077187609|
| 23|0.958558287841447591|
| 24|0.924638906376455542|
+------+--------------------+
I need to calculate a new Column named F, as a sort of recursive calculation :
F(I) = F(I- 1) * 0.25
+ b(I+ 1) * 0.50 + b(I) * 0.25
When I is the row index, and only for I= 1 the value of F(1) is:
f(i) = b(i) * 0.25
+ b(i+ 1) * 0.50 + b(i) * 0.25
How I should calculate that? Should I use lag and lead functions?

Related

Spark Window Functions: calculated once per frame/range?

This is a question about Window Functions in Spark.
Assume I have this DF
DATE_S | ID | STR | VALUE
-------------------------
1 | 1 | A | 0.5
1 | 1 | A | 1.23
1 | 1 | A | -0.4
2 | 1 | A | 2.0
3 | 1 | A | -1.2
3 | 1 | A | 0.523
1 | 2 | A | 1.0
2 | 2 | A | 2.5
3 | 2 | A | 1.32
3 | 2 | A | -3.34
1 | 1 | B | 1.5
1 | 1 | B | 0.23
1 | 1 | B | -0.3
2 | 1 | B | -2.0
3 | 1 | B | 1.32
3 | 1 | B | 523.0
1 | 2 | B | 1.3
2 | 2 | B | -0.5
3 | 2 | B | 4.3243
3 | 2 | B | 3.332
This is just an example! Assume that there are many more DATE_S for each (ID, STR), many more IDs and STRs, and many more entries per (DATE_S, ID, STR). Obviously there are multiple values per Combination (DATE_S, ID, STR)
Now I do this:
val w = Window.partitionBy("ID", "STR").orderBy("DATE_S").rangeBetween(-N, -1)
df.withColumn("RESULT", function("VALUE").over(w))
where N might lead to the inclusion of a large range of rows, from 100 to 100000 and more, depending on ("ID", "STR")
The result will be something like this
DATE_S | ID | STR | VALUE | RESULT
----------------------------------
1 | 1 | A | 0.5 | R1
1 | 1 | A | 1.23 | R1
1 | 1 | A | -0.4 | R1
2 | 1 | A | 2.0 | R2
3 | 1 | A | -1.2 | R3
3 | 1 | A | 0.523 | R3
1 | 2 | A | 1.0 | R4
2 | 2 | A | 2.5 | R5
3 | 2 | A | 1.32 | R6
3 | 2 | A | -3.34 | R7
1 | 1 | B | 1.5 | R8
1 | 1 | B | 0.23 | R8
1 | 1 | B | -0.3 | R9
2 | 1 | B | -2.0 | R10
3 | 1 | B | 1.32 | R11
3 | 1 | B | 523.0 | R11
1 | 2 | B | 1.3 | R12
2 | 2 | B | -0.5 | R13
3 | 2 | B | 4.3243| R14
3 | 2 | B | 3.332 | R14
There are identical "RESULT"s because for every row with identical (DATE_S, ID, ST), the values that go into the calculation of "function" are the same.
My question is this:
Does spark call "function" for each ROW (recalculating the same value multiple times) or calculate it once per range (frame?) of values and just pastes them on all rows that fall in the range?
Thanks for reading :)
From your data the result may not be the same if run twice from what I can see as there is no distinct ordering possibility. But we leave that aside.
Whilst there is codegen optimization, it is nowhere to be found that it checks in the way you state for if the next invocation is the same set of data to process for the next row. I have never read of that type of optimization. There is fusing due to lazy evaluation approach, but that is another matter. So, per row it calculates again.
From a great source: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-windows.html
... At its core, a window function calculates a return value for every
input row of a table based on a group of rows, called the frame. Every
input row can have a unique frame associated with it. ...
... In other words, when executed, a window function computes a value
for each and every row in a window (per window specification). ...
The biggest issue is to have suitable number of partitions for parallel processing, which is expensive, but this is big data. partitionBy("ID", "STR") is the clue here and that is a good thing.

Explode date interval over a group by and take last value in pyspark

I have a dataframe which contains some products, a date and a value. Now the dates have different gaps inbetween recorded values that I want to fill out. Such that I have a recorded value for every hour from the first time the product was seen to the last, if there is no record I want to use the latest value.
So, I have a dataframe like:
| ProductId | Date | Value |
|-----------|-------------------------------|-------|
| 1 | 2020-03-12T00:00:00.000+0000 | 4 |
| 1 | 2020-03-12T01:00:00.000+0000 | 2 |
| 2 | 2020-03-12T01:00:00.000+0000 | 3 |
| 2 | 2020-03-12T03:00:00.000+0000 | 4 |
| 1 | 2020-03-12T05:00:00.000+0000 | 4 |
| 3 | 2020-03-12T05:00:00.000+0000 | 2 |
I want to create a new dataframe that looks like:
| ProductId | Date | Value |
|-----------|-------------------------------|-------|
| 1 | 2020-03-12T00:00:00.000+0000 | 4 |
| 1 | 2020-03-12T01:00:00.000+0000 | 2 |
| 1 | 2020-03-12T02:00:00.000+0000 | 2 |
| 1 | 2020-03-12T03:00:00.000+0000 | 2 |
| 1 | 2020-03-12T04:00:00.000+0000 | 2 |
| 1 | 2020-03-12T05:00:00.000+0000 | 4 |
| 2 | 2020-03-12T01:00:00.000+0000 | 3 |
| 2 | 2020-03-12T02:00:00.000+0000 | 3 |
| 2 | 2020-03-12T03:00:00.000+0000 | 4 |
| 3 | 2020-03-12T05:00:00.000+0000 | 2 |
My code so far:
def generate_date_series(start, stop):
start = datetime.strptime(start, "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
stop = datetime.strptime(stop, "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
return [start + datetime.timedelta(hours=x) for x in range(0, (stop-start).hours + 1)]
spark.udf.register("generate_date_series", generate_date_series, ArrayType(TimestampType()))
df = df.withColumn("max", max(col("Date")).over(Window.partitionBy("ProductId"))) \
.withColumn("min", min(col("Date")).over(Window.partitionBy("ProductId"))) \
.withColumn("Dato", explode(generate_date_series(col("min"), col("max"))) \
.over(Window.partitionBy("ProductId").orderBy(col("Dato").desc())))
window_over_ids = (Window.partitionBy("ProductId").rangeBetween(Window.unboundedPreceding, -1).orderBy("Date"))
df = df.withColumn("Value", last("Value", ignorenulls=True).over(window_over_ids))
Error:
TypeError: strptime() argument 1 must be str, not Column
So the first question is obviously how do I create and call the udf correctly so I don't run into the above error.
The second question is how do I complete the task, such that I get my desired dataframe?
So after some searching and experimenting I found a solution. I defined a udf that returns a date range between two dates with 1 hour intervals. And I then do a forward fill
I fixed the issue with the following code:
def missing_hours(t1, t2):
return [t1 + timedelta(hours=x) for x in range(0, int((t2-t1).total_seconds()/3600))]
missing_hours_udf = udf(missing_hours, ArrayType(TimestampType()))
window = Window.partitionBy("ProductId").orderBy("Date")
df_missing = df.withColumn("prev_timestamp", lag(col("Date"), 1, None).over(window)) \
.filter(col("prev_timestamp").isNotNull()) \
.withColumn("Date", explode(missing_hours_udf(col("prev_timestamp"), col("Date")))) \
.withColumn("Value", lit(None)) \
.drop("prev_timestamp")
df = df_original.union(df_missing)
window = Window.partitionBy("ProductId").orderBy("Date") \
.rowsBetween(-sys.maxsize, 0)
# define the forward-filled column
filled_values_column = last(df['Value'], ignorenulls=True).over(window)
# do the fill
df = df.withColumn('Value', filled_values_column)

Python selecting different number of rows for each group of a mutlilevel index

I have a data frame with a multilevel index. I would like to sort this data frame based on a specific column and extract the first n rows for each group of the first index, but n is different for each group.
For example:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| | 10 | 1 | 2 |
| 2 | 20 | 2 | 1 |
| | 50 | 1 | 1 |
the result should look like this:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| 2 | 20 | 2 | 1 |
I got this far:
df.groupby(level[0,1]).sum().sort_values(['Index1','Sort_In_descending_order'],ascending=False).groupby('Index1').head(2)
However the .head(2) picks 2 element of each group independent of the number in the column "How_manyRows_toChoose".
Some pice of code would be great!
Thank you!
Use lambda function in GroupBy.apply with head and add parameter group_keys=False for avoid duplicated index values:
#original code
df = (df.groupby(level[0,1])
.sum()
.sort_values(['Index1','Sort_In_descending_order'],ascending=False))
df = (df.groupby('Index1', group_keys=False)
.apply(lambda x: x.head(x['How_manyRows_toChoose'].iat[0])))
print (df)
Sort_In_descending_order How_manyRows_toChoose
Index1 Index2
1 20 3 2
40 2 2
2 20 2 1

Converting a list of values to between -1 and 1

I have a questionnaire with answers in a number of different formats. I want the range to be between -1 and 1. However, not all ranges include negative numbers.
I need to create an excel formula to convert the value to the following dependent upon the range.
+---+--------+
| A |To this |
+---+--------+
|-3 | -1 |
|-2 | -0.66 |
|-1 | -0.33 |
| 0 | 0 |
| 1 | 1 |
+---+--------+
Or
+---+--------+
| A |To this |
+---+--------+
| 0 | 0 |
| 1 | 0.25 |
| 2 | 0.5 |
| 3 | 0.75 |
| 4 | 1 |
+---+--------+
Or
+---+--------+
| A |To this |
+---+--------+
| 1 | 0.2 |
| 2 | 0.4 |
| 3 | 0.6 |
| 4 | 0.8 |
| 5 | 1 |
+---+--------+
Or
+---+--------+
| A |To this |
+---+--------+
|-2 | -1 |
|-1 | -0.5 |
| 0 | 0 |
| 1 | 0.5 |
| 2 | 1 |
+---+--------+
etc.
This formula should do the trick:
=IFERROR(IF(A1<=0,-1*A1/(MIN(A:A)+MIN(0,MAX(A:A))),A1/(MAX(A:A))),0)
This produces this example output when autofilled down:
-3 -1
-2 -0.666666667
-1 -0.333333333
0 0
1 0.2
2 0.4
3 0.6
4 0.8
5 1
Note: this includes 0 for both sets of -1,0 and 0,1
If the range of input numbers is finite, even with negative numbers, you can use the general range mapping formula as below.
If the range of input numbers is [X1:X2] and the range of output numbers is [Y1:Y2] (in your case [-1:+1]) then number x is mapped to number y in the output range with the following formula:
y = (x - X1) * (Y2 - Y1) / (X2 - X1) + Y1
when X2-X1 != 0

Non-exact match vlookup with specific critiera

I have to column of data (time A and time B) and I would like to find out for each data a in A if there is a value b in B that meets the criteria of b-a = +/−0.007. I am trying to use vlookup but I cannot specify the criteria of b-a = +/−0.007. Can I do this using vlookup or there is other ways to do it in excel? Many thanks in advance for help!
The data example is shown below.
+----------------+------------------+
| Time A | Time B |
+----------------+------------------+
| 0.000 | 0.000 |
| 1.001 | 1.001 |
| 1.852 | 1.852 |
| 2.725 | 2.729 |
| 3.356 | 3.359 |
| 4.061 | 4.070 |
| 4.423 | 4.431 |
| 4.634 | 4.642 |
| 4.750 | 4.637 |
| 5.390 | 5.398 |
| 5.788 | 5.788 |
| 6.515 | 6.522 |
| 7.010 | 7.010 |
| 7.672 | 7.500 |
| 8.017 | 7.900 |
| 8.073 | 8.200 |
+----------------+------------------+
You could use this VBA solution:
Sub main()
Dim i As Integer
Dim j As Integer
For i = 2 To 16
For j = 2 To 16
If Abs(Cells(j, 2) - Cells(i, 1)) < 0.007 Then
Cells(i, 3) = j
End If
Next j
Next i
End Sub
It in column C it outputs the matching row index from column B:

Resources