Spark Aggregating a single column based on several overlapping windows

Spark Aggregating a single column based on several overlapping windows - apache-spark

Let's say I have the following table with money spent per day (timestamp)
timestamp| spent
0 | 0
1 | 0
2 | 1
3 | 4
4 | 0
5 | 0
6 | 1
7 | 3
The result I'm looking for is a table adding columns for the cummulative money spent in the last "n" days, for example the last 2 days and the last 5 days. Resulting in something like this.
timestamp | spent | spent-2d |spent-5d | ....
0 | 0 | null | null | ...
1 | 0 | 0 | null | ...
2 | 1 | 1 | null | ...
3 | 4 | 5 | null | ...
4 | 0 | 4 | 5 | ...
5 | 0 | 0 | 5 | ...
6 | 1 | 1 | 6 | ...
7 | 3 | 4 | 8 | ....
One possible solution is to add lagged columns and then sum but for say, 180 days I would need to add 180 columns and I want to to this process with not just one but several columns in the dataframe. For example for 100-500 columns I want the lagged sum over 1,2,5,7,15,30,90 and 180 days. So adding 180*500 columns seems to be a bad idea.
Any other ideas to make this in a very efficient way?

Window "rangeBetween" method can be used, example for 5 days column:
val lastFiveDaysWindow = Window
.orderBy("timestamp")
.rangeBetween(Window.currentRow - 4, Window.currentRow)
df
.withColumn("spent-5d",
when(
$"timestamp" >= 4,
sum("spent").over(lastFiveDaysWindow)
)
)
Note: Only for small Dataframes, warning exists:
No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
For bigger DataFrames, inner join can be used, like in answer here:
Count of all element less than the value in a row

Related

How to copy data within case-set in case control study using SPSS?

I'm doing a case-control study about ovarian cancer. I want to do stratified analyses for the different histotypes but haven't found a good way of doing it in SPSS. I was thinking about copying the information about the diagnoses from the cases to the controls, but I don't know the proper syntax to do it.
So - what I want to do is to find the diagnosis within the case-control pair, copy it, and paste it into the same variable for all the controls within that pair. Does anyone know a good way to do this?
ID = unique ID for the individual, casecontrol = 1 for case, 0 for control, caseset = stratum, ID for each matched group of individuals.
My dataset looks like this:
ID | casecontrol | caseset | diagnosis
1 | 1 | 1 | 1
2 | 0 | 1 | 0
3 | 0 | 1 | 0
4 | 0 | 1 | 0
5 | 1 | 2 | 3
6 | 0 | 2 | 0
7 | 0 | 2 | 0
8 | 0 | 2 | 0
And I want it to look like this:
ID | casecontrol | caseset | diagnosis
1 | 1 | 1 | 1
2 | 0 | 1 | 1
3 | 0 | 1 | 1
4 | 0 | 1 | 1
5 | 1 | 2 | 3
6 | 0 | 2 | 3
7 | 0 | 2 | 3
8 | 0 | 2 | 3
Thank you very much.

According to your example, in each value of caseset you have one line with diagnosis equals some positive number, and in the rest of the lines diagnosis equals zero (or is missing?).
If this is true, all you need to do is this:
aggregate out=* mode=add overwrite=yes /break=caseset /diagnosis=max(diagnosis).
The above command will overwrite the original data, so make sure you have that data backed up, or use a different name for the aggregated data (eg /FullDiagnosis=max(diagnosis) .

How to take x most frequent location by hour with pyspark?

I have a spark dataframe with hour of day, locationID and frequency.
Frequency is how many times a locationID appears in that hour of day.
+----+----------+---------+
|hour|locationID|frequency|
+----+----------+---------+
| 0 | 1 | 20 |
| 0 | 2 | 11 |
| 0 | 9 | 6 |
| 1 | 3 | 32 |
| 1 | 1 | 22 |
| 1 | 5 | 4 |
I want to take the 2 most frequent locationID per hour.

This can be done with a row_number window function. The window groups by hour and orders the frequency in descending order. Thereafter, filter for the top 2 rows.
from pyspark.sql import Window
from pyspark.sql.functions import row_number,desc
w = Window.partitionBy(df.hour).orderBy(df.frequency.desc())
rnum_df = df.withColumn('rnum',row_number().over(w))
rnum_df.filter(rnum_df.rnum <= 2).show()

PySpark - Select users seen for 3 days a week for 3 weeks a month

I know this is a very specific problem and it is not usual to post this kind of question on stackoverflow, but I am in the strange situation of having an idea of a naive algorithm that would solve my issue, but not being able to implement it. Hence my question.
I have a data frame
|user_id| action | day | week |
------------------------------
| d25as | AB | 2 | 1 |
| d25as | AB | 3 | 2 |
| d25as | AB | 5 | 1 |
| m3562 | AB | 1 | 3 |
| m3562 | AB | 7 | 1 |
| m3562 | AB | 9 | 1 |
| ha42a | AB | 3 | 2 |
| ha42a | AB | 4 | 3 |
| ha42a | AB | 5 | 1 |
I want to create a dataframe with users that are seem at least 3 days a week for at least 3 weeks a month. The "day" column goes from 1 to 31 and the "week" column goes from 1 to 4.
The way I thought about doing it is :
split dataframe into 4 dataframes for each week
for every week_dataframe count days seen per user.
count for every user how many weeks with >= 3 days they were seen.
only add to the new df the users seen for >= 3 such weeks.
Now I need to do this in Spark and in a way that scales and I have no idea how to implement it. Also ,if you have a better idea of an algorithm than my naive approach, that would really be helpful.

I suggest using groupBy function with selecting users with where selector:
df.groupBy('user_id', 'week')\
.agg(countDistinct('day').alias('days_per_week'))\
.where('days_per_week >= 3')\
.groupBy('user_id')\
.agg(count('week').alias('weeks_per_user'))\
.where('weeks_per_user >= 3' )

#eakotelnikov is correct.
But if anyone is facing the error
NameError: name 'countDistinct' is not defined
then please use below statement prior to execute eakotelnikov solution
from pyspark.sql.functions import *
Adding another solution for this problem
tdf.registerTempTable("tbl")
outdf = spark.sql("""
select user_id , count(*) as weeks_per_user from
( select user_id , week , count(*) as days_per_week
from tbl
group by user_id , week
having count(*) >= 3
) x
group by user_id
having count(*) >= 3
""")
outdf.show()

powerpivot using a calculated value in another calculation

I have the following tables
Orders:
OrderID|Cost|Quarter|User
-------------------------
1 | 10 | 1 | 1
2 | 15 | 1 | 2
3 | 3 | 2 | 1
4 | 5 | 3 | 3
5 | 8 | 4 | 2
6 | 9 | 2 | 3
7 | 6 | 3 | 3
Goals:
UserID|Goal|Quarter
-------------------
1 | 20 | 1
1 | 15 | 2
2 | 12 | 2
2 | 15 | 3
3 | 5 | 3
3 | 7 | 4
Users:
UserID|Name
-----------
1 | John
2 | Bob
3 | Homer
What I'm trying to do is to sum up all orders that one user had, divide it by the sum of his goals, then sum up all orders, devide the result by the sum of all goals and then add this result to the previous result of all Users.
The result should be:
UserID|Name |Goal|CostSum|Percentage|Sum all
---------------------------------------------------
1 |John | 35 | 13 | 0.37 |
2 |Bob | 27 | 23 | 0.85 |
3 |Homer| 12 | 20 | 1.67 |
the calculation is as follow:
CostSum: 10+3=13
Goal: 20+15=35
Percentage: CostSum/Goal=13/35=0.37
Sum all: 10+15+3+5+8+9+6=56
Goal all: 20+15+12+15+5+7=74
percentage all= Sum_all/Goal_all=56/74=0.76
Result: percentage+percentage_all=0.37+0.76=1.13 for John
1.61 for Bob
2.43 for Homer
My main problem is the last step. I cant get it to add the whole percentage. It will always filter the result so making it wrong.

To do this you're going to need to create some measures.
(I will assume you've already set your pivot table to be in tabular layout with subtotals switched off - this allows you to set UserID and Name next to each other in the row labels section.)
This is what our output will look like.
First let's be sure you've set up your relationships correctly - it should be like this:
I believe you already have the first 5 columns set up in your pivot table, so we need to create measures for CostSumAll, GoalSumAll, PercentageAll and Result.
The key to making this work is to ensure PowerPivot ignores the row label filter for your CostSumAll and GoalSumAll measures. The ALL() function acts as an override filter when used in CALCULATE() - you just have to specify which filters you want to ignore. In this case, UserID and Name.
CostSumAll:
=CALCULATE(SUM(Orders[Cost]),ALL(Users[UserID]),ALL(Users[Name]))
GoalSumAll:
=CALCULATE(SUM(Goals[Goal]),ALL(Users[UserID]),ALL(Users[Name]))
PercentageAll:
=Orders[CostSumAll]/Orders[GoalSumAll]
Result:
=Orders[Percentage]+Orders[PercentageAll]
Download - Example file available for download here. (Don't actually read it in Google Docs - it won't be able to handle the PowerPivot stuff. Save locally to view.)

Week average pivot table in Excel

I have the following data on excel:
day | week | #pieces
1 | 1 | 5
2 | 1 | 5
3 | 1 | 5
4 | 1 | 5
5 | 1 | 5
1 | 2 | 5
2 | 2 | 5
3 | 2 | 5
4 | 2 | 5
5 | 2 | 5
1 | 3 | 5
2 | 3 | 5
I did a pivot table that gets the total of #pieces per week.
Now I want to get the average of #pieces per week. I tried to use a calculated field using #pieces/5 but this is not always true for the current week. See week 3 for example:
Week | Average #pieces
1 | 5
2 | 5
3 | 2 (this number '2' should be also '5')
Does anyone know how to do that?
Thanks

As far as I understand what you're trying to do, you don't need a calculated field (which is messing you up, by the way, because you're dividing by the hard-coded 5).
I would pivot with week number as your row label and average of pieces as your values. You can change sum to average in the value field options of the pivot table.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark Aggregating a single column based on several overlapping windows - apache-spark

Related

How to copy data within case-set in case control study using SPSS?

How to take x most frequent location by hour with pyspark?

PySpark - Select users seen for 3 days a week for 3 weeks a month

powerpivot using a calculated value in another calculation

Week average pivot table in Excel

Categories

Resources