We have a spark application, in which data is shared across different executors. But we also need to compare data between executors, where some data is present in executor-1, and some data is present in executor-2. We wanted to know how can we achieve in spark?
For example: have a file with following details:
Name, Date1, Date2
A, 2019-01-01, 2019-01-23
A, 2019-02-12, 2019-03-21
A, 2019-04-01, 2019-05-31
A, 2019-06-02, 2019-12-30
B, 2019-01-01, 2019-01-21
B, 2019-02-10, 2019-03-21
B, 2019-04-01, 2019-12-31
I need to find total gaps between these elements by checking date2 of first row, against date1 of second row, and so on.. i.e.
For example: for Name A: (2019-02-12 - 2019-01-23) + (2019-04-01 - 2019-03-21) + (2019-06-02 -2019-05-31) + (2019-12-31 - 2019-12-30) .. Year is ending on 2019-12-31, so there is gap of 1 day
and also number of gaps (if difference between above formula per date > 0)
will be 4.
For Name B: (2019-02-10 - 2019-01-21) + (2019-04-01 -
2019-03-21), and number of gaps would be 2.
One approach is to use collectAsList(), which retrieves all data to driver, but is there a different efficient way to compare them directly across executors, if yes how can we do that?
Just write an SQL query with lag windowing, qualifying, check the adjacent rows for date ad date minus 1, with major key qualification being Name. Sort as well within Name.
You need not worry about Executors, Spark will hash for you automatically based on Name to a Partition serviced by an Executor.
Related
I have a range of cells in excel. How to increment numbers when meeting data validation from three different columns?
I tried using formula COUNTIF($A$2:A2,A2) which creates a number sequence. But I have other data to validate from another column for it to return the correct number sequence.
First validation: count the emp no in column range A1:A5 which return a result under Hierarchy column.
Second validation: check the % value under column L as per below level of hierarchy in which the problem comes from.
1 - 0.25
2 - 0.25
3 - 0.5
4 - 0.5
5 - 1
Third validation: check the type of Relation (see Relation column) that needs to check when returning number of sequence too. Below is the Relation Level table.
I don't know on how to join these three conditions for the result to be as below.
My really problem here is on how will i get a sequence number if a person does have 3 children and should be tagged as 2,3,4 (next to spouse which is 1) then the next relation which is parent will be tagged then as next number sequence from the last count of child wherein will be 5 given that as per relation table, Parent level is 3 but it will be adjusted as per count of relations a person has. And for this specific instance, if Parent count will be 5, it still should have 0.5 EE % (see relation table level vs % hierarchy level) even though the count of number is 5. I hope this will make sense. But let me know if you have any questions.
Hope someone could help me on this coz I am not that expert when it comes to excel formula. Thank you!
I have a data frame that looks like this- user ID and dates of activity. I need to calculate the average difference between dates using RDD functions (such as reduce and map) and not SQL.
The dates for each ID needs to be sorted by order before calculating the difference, as I need the difference between each consecutive dates.
ID
Date
1
2020-09-03
1
2020-09-03
2
2020-09-02
1
2020-09-04
2
2020-09-06
2
2020-09-16
the needed outcome for this example will be:
ID
average difference
1
0.5
2
7
thanks for helping!
You can use datediff with window function to calculate the difference, then take an average.
lag is one of the window function and it will take a value from the previous row within the window.
from pyspark.sql import functions as F
# define the window
w = Window.partitionBy('ID').orderBy('Date')
# datediff takes the date difference from the first arg to the second arg (first - second).
(df.withColumn('diff', F.datediff(F.col('Date'), F.lag('Date').over(w)))
.groupby('ID') # aggregate over ID
.agg(F.avg(F.col('diff')).alias('average difference'))
)
I need a count of how many date items fall within Data 1 & Data 2
ie:
x-1 will have a count of 2
x-2 will have a count of 1
-x-3 will have a count of 2
-y-1 will have a count of 2
What would be the best way to go abouts when approaching this?
Data 1
Data 2
Date
x
1
Date 1
x
1
Date 1
x
1
Date 2
x
2
Date 3
x
2
Date 3
y
1
Date 1
y
1
Date 1
I see only one way to interpret with the available information:
To count the number of times Date_to_test falls within Date_1 and Date_2 (screenshot below, sheet here), you could use either the sum or something like a countifs (with interim calc):
sum approach
=SUM(1*($C$2:$C$11<=$B$2:$B$11)*($A$2:$A$11<=$C$2:$C$11))
countifs + interim calc
helper
=1*(C2<=B2)*(A2<=C2)
(additional column, drag down)
countifs
=COUNTIFS($D$2:$D$11,1)
Screenshot
Alternative
as for the 'sum' approach, sumproduct variants (e.g. =SUMPRODUCT(1*($C$2:$C$11<=$B$2:$B$11),1*($B$2:$B$11>=$A$2:$A$11))) are calculation/memory intensive
despite the countifs + helper approach containing more 'visible' data - these values need only be calculated once, the countifs can then be determined independently (assuming no updates to the helper column) - thus making it more memory/calculation efficient depending upon your calculation mode, screen-updating preferences
Caveat
if, by some misfortune re: interpreting your question, you are referring to some other means of establishing whether "date items fall within Data 1 & Data 2", then without knowing what this is, there very low likelihood of being able to guess this correctly
I'm creating an excel spreadsheet to track when an item is received as well as when a response to the item having been received has been made (ie: my mail was delivered at 1:00pm (item received) but I didn't check the mail until 5:00pm (response to item having been received)).
I need to track both the date and time of the item being received and want to separate these in two separate columns. At the moment this translates to:
Column A: Date item received
Column B: Time item received
Column L: Date item was responded to having been received
Column M: Time item was responded to having been received
In essence I'm looking to run calculations on the response time between when the item is received and when it has been responded to (ie: average response time, number of responses in less than an hour, and even things like the number of responses that took between 2 and 3 hours where Bob was the person who responded).
The per-line pseudo code would look something like:
(Lr + Mr) - (Ar + Br) ' where L,M,A,B are the columns and 'r' is the row number.
An example, with the following data:
1. A B L M
2. 1/5/19 10:00 1/5/19 12:00
3. 1/5/19 21:00 1/6/19 1:00
4. 1/5/19 22:00 1/5/19 23:00
5. 1/6/19 3:00 1/6/19 4:00
The outcome for the average response time would be 2 hours (average(rows 2-5) = average(2, 4, 1, 1) = 2)
The number of items with an average response times would be as follows:
(<=1 hour) = 2
(>1 & <=2) = 2
(>2 & <=3) = 0
(>3) = 1
I don't know (or can find) a function that will perform this and then let me use it within something like a countifs() or averageifs() function.
While I could do this (fairly easily) in VBA, the practical implementation of this spreadsheet limits me to standard Excel. I suspect that sumproduct() will be fundamental to make this work, but I feel that I need something like a sumsum() function (which doesn't exist) and I'm not familiar with sumproduct() to better understand what to even look for to set something like this up.
If you are not so familiar with SUMPRODUCT() or the likes I would suggest one helper column. Like so:
You can see the formula used is:
=((C2+D2)-(A2+B2))
You can probably do all type of calculations on this helper column. Note, column is formatted hh:mm. However, if you want to look into SUMPRODUCT() you could think about these:
Formula in H2:
=SUMPRODUCT(--(ROUND((((A2:A5+B2:B5)-(C2:C5+D2:D5))*-24),2)<=1))
Formula in H3:
=SUMPRODUCT((ROUND((((A2:A5+B2:B5)-(C2:C5+D2:D5))*-24),2)>1)*(ROUND((((A2:A5+B2:B5)-(C2:C5+D2:D5))*-24),2)<=2))
Formula in H4:
=SUMPRODUCT((ROUND((((A2:A5+B2:B5)-(C2:C5+D2:D5))*-24),2)>2)*(ROUND((((A2:A5+B2:B5)-(C2:C5+D2:D5))*-24),2)<3))
Formula in H5:
=SUMPRODUCT(--(ROUND((((A2:A5+B2:B5)-(C2:C5+D2:D5))*-24),2)>3))
The helper column is the easiest approach. It gives you the time differences that you can then easily analyse however you want. Analysis without the helper column is possible, but the approach differs depending on what type of analysis you want to do.
For the example you provided, which is counting the number of time differences grouped into ranges, you would use the FREQUENCY function:
=FREQUENCY(C2:C5+D2:D5-A2:A5-B2:B5,F2:F4)
In F2:F4 (called the "bins"), enter the upper limit of each range you want to count. The Frequency function counts up to and including the first value, then counts from there up to and including the second value, and so on. Enter the bins as times, e.g. 1:00 for 1 hour.
Note that Frequency is an array-entered and an array-returning function. This you means you need to first select the range that will contain all output values, G2:G5 in this example, then enter the function, then press CTRL+SHIFT+ENTER
Also note that Frequency returns an array that is one element larger than the number of bins specified. The extra element is the count of all values greater than the largest bin specified.
I am measuring room utilization (time used/time available) from a data dump. Each row contains the available time for the day and the time used for a particular case.
The image is a simplified version of the data.
If you read the yellow and green highlights (Room 1):
In room 1, there are 200 available minutes on 1/1/2016.
Case 1 took 60 minutes, case 2 took 50 minutes.
There are 500 available minutes on 1/2/2016, and only one case occurred that day, using 350 minutes.
Room 1 utilization = (60 + 50 + 350)/(200 + 500)
The problem with summing the available time is that it double counts the 200 minutes for 1/1/2016, giving: Utilization = (60+50+350)/(200+200+500)
There are hundreds of rows in this data (and there will be multiple data dumps of differing #'s of rows) with multiple cases occurring each day.
I am trying to use a pivot table, but I cannot obtain the 'sum of averages' for a particular room (see image). I am using a macro to pull the numbers out of the grand total column.
Is this possible? Do you see another way to obtain utilization?
(note: there are lots of other columns in the data, like case start, case end, day of week, etc, that are not used in this calculation but are available)
The reason that you're getting 300 for both Average of Available Time columns is because the grand total is a grand total based on the overall average and not a sum of the averages.
Room 1: 200 + 200 + 500 / 3 = 300
Room 2: 300 + 300 + 300 / 3 = 300
I could not comment on the original question, so my solution is based on a few assumptions.
Assumption #1: The data will always be grouped. E.G. All cases in room 1 on a given day will grouped in sequential rows.
Assumption #2: The available time column is a single value for the whole day, there will never be differing available times on the same day.
Solution: Use column E as the Actual Available Time. This column will use a formula to determine if the current row has a unique combination (Date + Room + Available Time) to the previous and if so, the cell will contain that row's available time.
Formula to use in E2:
=IF(AND($A1 = $A2, $B1 = $B2, $C1 = $C2), 0, $C2)
Extend the formula as far down as necessary and then include the new column in your PivotTable data range.
End Result
I created a unique reference by combining columns and then used sumif/countif/countif.
So the formula in column E would be:
=sumif(colB,cellB,ColC)/Countif(colB,cellE)/Countif(colB,cellE)
Doesn't matter if the data is in order or not then.
Extend the formula as far down as necessary and then include the new column in your PivotTable data range.
The easiest method I would recommend is this.
=SUM(H:H)-GETPIVOTDATA("Average of Available Time",$G$3)
The first term sums the H column, and the second term subtracts the grand total value. It is a dynamic solution, and will change to fit the size of the pivot table.
My assumptions are that the Pivot Table was originally placed in cell G3.