How to do a rolling sum in PySpark? [duplicate] - apache-spark

This question already has answers here:
Python Spark Cumulative Sum by Group Using DataFrame
(4 answers)
Closed 2 years ago.
Given the column A as shown in the following example I'd like to have the column B where each record is the sum of the current record in A and previous record in B:
+-------+
| A | B |
+-------+
| 0 | 0 |
| 0 | 0 |
| 1 | 1 |
| 0 | 1 |
| 1 | 2 |
| 1 | 3 |
| 0 | 3 |
| 0 | 3 |
So in a way I would be interested into consider previous record into my operation. I'm aware of the F.lag function but I don't see how it can work in this way. Any ideas on how to get this operation done?
I'm open to rephrasing if the idea can be expressed in a better way.

It seems you're trying to do a rolling sum of A. You can do a sum over a window, e.g.
from pyspark.sql import functions as F, Window
df2 = df.withColumn('B', F.sum('A').over(Window.orderBy('ordering_col')))
But you would need a column to order by, otherwise the "previous record" is not well-defined because Spark dataframes are unordered.

Related

Spark replicating rows with values of a column from different dataset

I am trying to replicate rows inside a dataset multiple times with different values for a column in Apache Spark. Lets say I have a dataset as follows
Dataset A
| num | group |
| 1 | 2 |
| 3 | 5 |
Another dataset have different columns
Dataset B
| id |
| 1 |
| 4 |
I would like to replicate the rows from Dataset A with column values of Dataset B. You can say a join without any conditional criteria that needs to be done. So resulting dataset should look like.
| id | num | group |
| 1 | 1 | 2 |
| 1 | 3 | 5 |
| 4 | 1 | 2 |
| 4 | 3 | 5 |
Can anyone suggest how the above can be achieved? As per my understanding, join requires a condition and columns to be matched between 2 datasets.
What you want to do is called CartesianProduct and df1.crossJoin(df2) will achieve it. But be careful with it because it is a very heavy operation.

Athena/Presto - UNNEST MAP to columns

Assume i have a table like this,
table: qa_list
id | question_id | question | answer |
---------+--------------+------------+-------------
1 | 100 | question1 | answer |
2 | 101 | question2 | answer |
3 | 102 | question3 | answer |
4 | ...
... | ...
and a query that gives below result (since I couldn't find a direct way to transpose the table),
table: qa_map
id | qa_map
--------+---------
1 | {question1=answer,question2=answer,question3=answer, ....}
Where qa_map is the result of a map_agg of arbitrary number of questions and answers.
Is there a way to UNNEST qa_map to an arbitrary number of columns as shown below?
id | Question_1 | Answer_1 | Question_2 | Answer_2 | Question_3 | ....
---------+-------------+-----------+-------------+-----------+-------------+
1 | question | answer | question | answer | question | ....
AWS Athena/Presto-0.172
No, there is no way to write a query that results in different number of columns depending on the data. The columns must be known before query execution starts. The map you have is as close as you are going to get.
If you include your motivation for wanting to do this there may be other ways we can help you achieve your end goal.

PySpark - Select users seen for 3 days a week for 3 weeks a month

I know this is a very specific problem and it is not usual to post this kind of question on stackoverflow, but I am in the strange situation of having an idea of a naive algorithm that would solve my issue, but not being able to implement it. Hence my question.
I have a data frame
|user_id| action | day | week |
------------------------------
| d25as | AB | 2 | 1 |
| d25as | AB | 3 | 2 |
| d25as | AB | 5 | 1 |
| m3562 | AB | 1 | 3 |
| m3562 | AB | 7 | 1 |
| m3562 | AB | 9 | 1 |
| ha42a | AB | 3 | 2 |
| ha42a | AB | 4 | 3 |
| ha42a | AB | 5 | 1 |
I want to create a dataframe with users that are seem at least 3 days a week for at least 3 weeks a month. The "day" column goes from 1 to 31 and the "week" column goes from 1 to 4.
The way I thought about doing it is :
split dataframe into 4 dataframes for each week
for every week_dataframe count days seen per user.
count for every user how many weeks with >= 3 days they were seen.
only add to the new df the users seen for >= 3 such weeks.
Now I need to do this in Spark and in a way that scales and I have no idea how to implement it. Also ,if you have a better idea of an algorithm than my naive approach, that would really be helpful.
I suggest using groupBy function with selecting users with where selector:
df.groupBy('user_id', 'week')\
.agg(countDistinct('day').alias('days_per_week'))\
.where('days_per_week >= 3')\
.groupBy('user_id')\
.agg(count('week').alias('weeks_per_user'))\
.where('weeks_per_user >= 3' )
#eakotelnikov is correct.
But if anyone is facing the error
NameError: name 'countDistinct' is not defined
then please use below statement prior to execute eakotelnikov solution
from pyspark.sql.functions import *
Adding another solution for this problem
tdf.registerTempTable("tbl")
outdf = spark.sql("""
select user_id , count(*) as weeks_per_user from
( select user_id , week , count(*) as days_per_week
from tbl
group by user_id , week
having count(*) >= 3
) x
group by user_id
having count(*) >= 3
""")
outdf.show()

Divide two "Calculated Values" within Spofire Graphical Table

I have a spotfire question. Is it possible to divide two "calculated value" columns in a "graphical table".
I have a Count([Type]) calculated value. I then limit the data within the second calculated value to arrive at a different number of Count[Type].
I would like to divide the two in a third calculated value column.
ie.
Calculated value column 1:
Count([Type]) = 100 (NOT LIMITED)
Calculated value column 2:
Count([Type]) = 50 (Limited to [Type]="Good")
Now I would like to say 50/100 = 0.5 in the third calculated value column.
If it is possible to do this all within one calculated column value that is even better. Graphical Tables do not let you have if statements in the custom expression, the only way is to limit data. So I am struggling, any help is appreciated.
Graphical tables do allow IF() in custom expressions. In order to accomplish this you are going to have to move your logic away from the Limit Data Using Expressions and into your expression directly. Here should be your three Axes expressions:
Count([Type])
Count(If([Type]="Good",[Type]))
Count(If([Type]="Good",[Type])) / Count([Type])
Data Set
+----+------+
| ID | Type |
+----+------+
| 1 | Good |
| 1 | Good |
| 1 | Good |
| 1 | Good |
| 1 | Good |
| 1 | Bad |
| 1 | Bad |
| 1 | Bad |
| 1 | Bad |
| 2 | Good |
| 2 | Good |
| 2 | Good |
| 2 | Good |
| 2 | Bad |
| 2 | Bad |
| 2 | Bad |
| 2 | Bad |
+----+------+
Results

Performance: Group by a subset of previous grouping columns

I have a DataFrame with two categorical columns, similar to the following example:
+----+-------+-------+
| ID | Cat A | Cat B |
+----+-------+-------+
| 1 | A | B |
| 2 | B | C |
| 5 | A | B |
| 7 | B | C |
| 8 | A | C |
+----+-------+-------+
I have some processing to do that needs two steps: The first one needs the data to be grouped by both categorical columns. In the example, it would generate the following DataFrame:
+-------+-------+-----+
| Cat A | Cat B | Cnt |
+-------+-------+-----+
| A | B | 2 |
| B | C | 2 |
| A | C | 1 |
+-------+-------+-----+
Then, the next step consists on grouping only by CatA, to calculate a new aggregation, for example:
+-----+-----+
| Cat | Cnt |
+-----+-----+
| A | 3 |
| B | 2 |
+-----+-----+
Now come the questions:
In my solution, I create the intermediate dataframe by doing
val df2 = df.groupBy("catA", "catB").agg(...)
and then I aggregate this df2 to get the last one:
val df3 = df2.groupBy("catA").agg(...)
I assume it is more efficient than aggregating the first DF again. Is it a good assumption? Or it makes no difference?
Are there any suggestions of a more efficient way to achieve the same results?
Generally speaking it looks like a good approach and should be more efficient than aggregating data twice. Since shuffle files are implicitly cached at least part of the work should be performed only once. So when you call an action on df2 and subsequently on df3 you should see that stages corresponding to df2 have been skipped. Also partial structure enforced by the first shuffle may reduce memory requirements for the aggregation buffer during the second agg.
Unfortunately DataFrame aggregations, unlike RDD aggregations, cannot use custom partitioner. It means that you cannot compute both data frames using a single shuffle based on a value of catA. It means that second aggregation will require separate exchange hash partitioning. I doubt it justifies switching to RDDs.

Resources