Getting the latest value in a time range or null - apache-spark

I have a huge data set e.g.
| Date | ID | Value |
+------------+----+-------+
| 10-10-2020 | 1 | 1 |
| 10-11-2020 | 1 | 2 |
| 10-12-2020 | 1 | 3 |
| 10-13-2020 | 1 | 4 |
| 10-10-2020 | 2 | 5 |
| 10-11-2020 | 2 | 6 |
| 10-12-2020 | 2 | 7 |
| 10-09-2020 | 3 | 8 |
| 10-08-2020 | 4 | 9 |
As you can see this example contains of 4 IDs within different date ranges.
I have a special logic, which calculates some derived values with RangeBetween function. Let's assume it is a simple sum over the defined time range.
What I need to do is to generate such a result (explained below):
| ID | Value sum (last 2 days) | Value sum (last 4 days) | Value sum (prev 2 days) | Value sum (prev 4 days) | Result (2 days) | Result (4 days) |
+----+-------------------------+-------------------------+-------------------------+-------------------------+-----------------+-----------------+
| 1 | 7 (3+4) | 10 (1+2+3+4) | 5 (3+2) | 6 (3+2+1) | 7 | 10 |
| 2 | 7 | 18 (5+6+7) | 11 (5+6) | 11 (5+6) | 7 | 18 |
| 3 | null | null | null | 8 | null | 0 |
//exclude | 4 | null | null | null | null | null | null |
This example assumes that today is 10-13-2020.
For each Id I need to get a sum of the value in 2 ranges: 2 and 4 days
1. the table contains 2 calculations for the same ranges starting from now and the day before (columns last and prev X days)
2. if all values exist in a range - simply result the sum of the range (example with ID = 1)
3. if some of values are not specified in a range assume it is zero (example with ID = 2)
4. if values do not exist in the defined range, but there is at least 1 value in the range with the day before - assume there was a sum yesterday, but no such today - set it to zero (example #3)
5. if no value values in the range and the day before - do not include in the result set (example #4)
Right now I have a code:
let last2Days =
Window
.PartitionBy('ID')
.OrderBy(Functions.Col('Date').Cast("timestamp").Cast("long"))
.RangeBetween(-1, 0)
let prev2Days =
Window
.PartitionBy('ID')
.OrderBy(Functions.Col('Date').Cast("timestamp").Cast("long"))
.RangeBetween(-2, -1)
df
.WithColumn('last2daysSum', Functions.Sum('value').Over(last2Days))
.WithColumn('prev2daysSum', Functions.Sum('value').Over(last4Days))
.WithColumn('result2Days', Functions.Col('last2daysSum'))
.Where(Functions.Col('Date').EqualTo(Functions.Lit('10-13-2020')))
This works for example #1 (when result is taken from last2daysSum)
1. is there a simple way to get a proper result for #2 (the latest record within defined time range)?
2. combine the previous question and condition `if last = null && prev != null then 0 else if last = null && prev = null then null else last` - example #3?
3. how to exclude records as per example #4?
Is that possible to solve it with no reshuffling?

For Question #1 If you only want to calculate for one specific date then a groupBy and agg is simpler and should execute faster. The trick is to use when inside aggregate functions like sum.
For Questions #2 and #3 you can coalesce to zero and filter out fully null rows before that. If you need to filter for a broader range than you want to display (so include rows that had values days before but do not now) you can add an extra calculation for the longer period a drop that after filtering. See below for code example.
import org.apache.spark.sql.functions._
val data = Seq(
("2020-10-10", 1, 1),
("2020-10-11", 1, 2),
("2020-10-12", 1, 3),
("2020-10-13", 1, 4),
("2020-10-10", 2, 5),
("2020-10-11", 2, 6),
("2020-10-12", 2, 7),
("2020-10-09", 3, 8),
("2020-10-08", 4, 9)
).toDF("Date", "ID", "Value").withColumn("Date", to_date($"Date"))
def sumLastNDays(now: java.sql.Timestamp, start: Int, end: Int = 0) =
sum(when($"Date".between(date_sub(lit(now), start-1), date_sub(lit(now), end)), $"Value"))
val now = java.sql.Timestamp.valueOf("2020-10-13 00:00:00")
data
.groupBy($"ID")
.agg(
sumLastNDays(now, 2).as("last2DaysSum"),
sumLastNDays(now, 4).as("last4DaysSum"),
sumLastNDays(now, 4, 2).as("prev2DaysSum"),
sumLastNDays(now, 5).as("last5DaysSum")
)
.filter($"last5DaysSum".isNotNull)
.drop($"last5DaysSum")
.withColumn("last4DaysSum", coalesce($"last4DaysSum", lit(0)))
.withColumn("last2DaysSum", coalesce($"last2DaysSum", lit(0)))
.withColumn("prev2DaysSum", coalesce($"prev2DaysSum", lit(0)))
.orderBy($"ID")
.show()
Result:
+---+------------+------------+------------+
| ID|last2DaysSum|last4DaysSum|prev2DaysSum|
+---+------------+------------+------------+
| 1| 7| 10| 3|
| 2| 7| 18| 11|
| 3| 0| 0| 0|
+---+------------+------------+------------+
Note: I'm not sure if you meant prev2Days to be the previous 2 day interval before the current 2 day interval or the yesterday's last 2 day interval, because in the expected results table ID 1 has Oct. 11-12 summed and ID 2 has Oct. 10-11 summed for prev2Days, but either way you can adjust the range params if you want something else. I assumed that prev2Days does not overlap with last2Days, just change it to sumLastNDays(now, 3, 1) if you want overlapping 2 day ranges.

Related

Pandas groupby compare count equal values in 2 columns in excel with subrows

I have an excel file like this:
link
.----.-------------.-------------------------.-----------------.
| | ID | Shareholder - Last name | DM Cognome |
:----+-------------+-------------------------+-----------------:
| 1. | 01287560153 | MASSIRONI | Bocapine Ardaya |
:----+-------------+-------------------------+-----------------:
| | | CAGNACCI | |
:----+-------------+-------------------------+-----------------:
| 2. | 05562881002 | | Directors |
:----+-------------+-------------------------+-----------------:
| 3. | 04113870655 | SABATO | Sabato |
:----+-------------+-------------------------+-----------------:
| | | VILLARI | |
:----+-------------+-------------------------+-----------------:
| 4. | 01419190846 | SALMERI | Salmeri |
:----+-------------+-------------------------+-----------------:
| | | MICALIZZI | Lipari |
:----+-------------+-------------------------+-----------------:
| | | LIPARI | |
'----'-------------'-------------------------'-----------------'
I open this file with pandas and ffill the ID column since there are subrows. Then groupby by ID to get the count of any equal values on the Shareholder - Last name and DM\nCognome columns. However I can't. In this case the result should be 0 row1 0 row2 1 row3 2 row4.
It should be noted that row 4 is consist of 3 subrow and row3 also consist of 2 subrow.(ex)
I have 2 questions:
What is the best way to read an unorganised excel file like above and do lots of comparisons, replacing values etc.
How can I achieve the results that I mentioned earlier.
Here is what I did, but it doesn't work:
data['ID'] = data['ID'].fillna(method='ffill')
data.groupby('ID', sort=False, as_index=False)['Shareholder - Last name', 'DM\nCognome'].apply(lambda x: (x['Shareholder - Last name']==x['DM\nCognome']).count())
First, read as input the table (keeping the ID as string instead of float):
df = pd.read_excel("Workbook1.xlsx", converters={'ID':str})
df = df.drop("Unnamed: 0", axis=1) #drop this column since it is not useful
Fill the ID and if a shareholder is missing replace Nan with "Missing":
df['ID'] = df['ID'].fillna(method='ffill')
df["Shareholder - Last name"] = df["Shareholder - Last name"].fillna("missing")
Convert to lowercase the surnames:
df["Shareholder - Last name"] = df["Shareholder - Last name"].str.lower()
Custom function to count how many householders occur in the other column:
def f(group):
s = pd.Series(group["DM\nCognome"].str.lower())
count = 0
for surname in group["Shareholder - Last name"]:
count += s.str.count(surname).sum()
return count
And finally get the count for each ID:
df.groupby("ID",sort=False)[["Shareholder - Last name", "DM\nCognome"]].apply(lambda x: f(x))
Output:
ID
01287560153 0.0
05562881002 0.0
04113870655 1.0
01419190846 2.0

How to divide two cells based on match?

In a table 1, I have,
+---+---+----+
| | A | B |
+---+---+----+
| 1 | A | 30 |
| 2 | B | 20 |
| 3 | C | 15 |
+---+---+----+
On table 2, I have
+---+---+---+----+
| | A | B | C |
+---+---+---+----+
| 1 | A | 2 | 15 |
| 2 | A | 5 | 6 |
| 3 | B | 4 | 5 |
+---+---+---+----+
I want the number in second column to divide the number in table 1, based on match, and the result in third column.
The number present in the bracket is the result needed. What is the formula that I must apply in third column in table 2?
Please help me on this.
Thanks in advance
You can use a vlookup() formula to go get the dividend. (assuming table 1 is on Sheet1 and table 2 in Sheet2 where we are doing this formula):
=VLOOKUP(A1,Sheet1!A:B, 2, FALSE)/Sheet2!B1
Since you mention table, with structured references, though it seems you are not applying those here:
=VLOOKUP([#Column1],Table1[#All],2,0)/[#Column2]

How to determine the order in summarize and calculate values in power pivot

i am very new to power pivot and there is this one thing I haven't been able to understand fully. I have this table consisting of Week, value 1 and Value 2.
I want to first summarize all the values for week 1,2,3 and so forth and then divide the sum of value 1 with the sum of value 2. However, when i do a measure power pivot first divide value 1 with value 2 on each row and then summarize them.
This probably is a very basic question but if someone could shed some light on this for me I would be more than happy.
It is not clear what the resulting table you would to see is and this is important to understand in order to determine the correct DAX for a measure.
However given the following input data in table "tablename"
| Week | Value 1 | Value 2 |
| 2018 w1 | 200 | 4 |
| 2018 w2 | 300 | 5 |
| 2018 w3 | 250 | 3 |
| 2018 w4 | 100 | 4 |
The most obvious measure would be
Value1 by Value2 =
divide
( calculate(sum('tablename'[Value 1]))
, calculate(sum('tablename'[Value 2]))
)
This would mean that if you brought this into a table with Week in the context then you would get the following
| Week | Value 1 | Value 2 | Value1 by Value2 |
| 2018 w1 | 200 | 4 | 50 |
| 2018 w2 | 300 | 5 | 60 |
| 2018 w3 | 250 | 3 | 83.33 |
| 2018 w4 | 100 | 4 | 25 |
or if you used this for all weeks your table would be
| Value1 by Value2 |
| 53.125 |

pandas - create new columns based on existing columns / conditional average

I am new to Pandas and I am trying to learn column creation based on conditions applied to already existing columns. I am working with cellular data and this is how my source data looks like (the 2 columns to the right are empty to begin with):
DEVICE_ID | MONTH | TYPE | DAY | COUNT | LAST_MONTH| SEASONAL_AVG
8129 | 201601 | VOICE | 1 | 8 | |
8129 | 201502 | VOICE | 1 | 5 | |
8129 | 201501 | VOICE | 1 | 2 | |
8321 | 201403 | DATA | 3 | 1 | |
2908 | 201302 | TEXT | 5 | 4 | |
8129 | 201406 | VOICE | 2 | 3 | |
8129 | 201306 | VOICE | 2 | 7 | |
3096 | 201501 | DATA | 5 | 6 | |
8129 | 201301 | VOICE | 1 | 2 | |
I created a dataframe with this data and named it df.
df = pd.DataFrame({'DEVICE_ID' : [8129, 8129,8129,8321,2908,8129,8129,3096,8129],
'MONTH' : [201601,201502,201501,201403,201302,201406,201306,201501,201301],
'TYPE' : ['VOICE','VOICE','VOICE','DATA','TEXT','VOICE','VOICE','DATA','VOICE'],
'DAY' : [1,1,1,3,5,2,2,5,1],
'COUNT' : [8,5,2,1,4,3,7,6,2]
})
I am trying to create two additional columns to df: 'LAST_MONTH' and 'SEASONAL_AVG'. Logic for these two columns:
LAST_MONTH: for the corresponding DEVICE_ID & TYPE & DAY combination return the previous month's COUNT. Ex: For row 1 (DEVICE_ID: 8129, TYPE: VOICE, DAY: 1, MONTH 201502), LAST_MONTH will be COUNT from row 2 (DEVICE_ID: 8129, TYPE: VOICE, DAY: 1, MONTH 201501. If there is no record for the previous month, LAST_MONTH will be zero.
SEASONAL_AVG: for the corresponding DEVICE_ID & TYPE & DAY combination return the average of corresponding month from all previous years (data starts from 201301). Ex: SEASONAL_AVG for row 0 = average of COUNTs of rows 2 and 8. There will always be at least one record for corresponding month from the past. Need not be for for all TYPEs and DAYs combinations, but at least some of the possible combinations will be present for all DEVICE_IDs.
Your help is greatly appreciated! Thanks!
EDIT1:
def last_month(record):
year = int(str(record['MONTH'])[:4])
month = int(str(record['MONTH'])[-2:])
if month in (2,3,4,5,6,7,8,9,10):
x = str(0)+str(month-1)
y = int(str(year)+str(x))
last_month = int(y)
elif month == 1:
last_month = int(str(year-1)+str(12))
else:
last_month = int(str(year)+str(month-1))
day = record['DAY']
cellular_type = record['TYPE']
#return record['COUNT']
return record['COUNT'][(record['MONTH'] == last_month) & (record['DAY'] == day) & (record['TYPE'] == cellular_type)]
df['last_month'] = df.apply (lambda record: last_month(record),axis=1)

How to assing a given value randomly in Excel?

I have 30 peoples waiting for result out of 10. I want to assign each of 30 peoples: 7, 8 or 10 result randomly.
Peoples in excel are:
A1:A30
Random numbers in Excel
B1, C1, D1 //Which means B1=8, C1=7 and D1=10
The random number to keep on:
E1 to E30
To get a random integer in a range, use the randbetween(start,end) function. This will produce a random integer between the start and end parameters inclusively. Since your numbers are not contiguous, you can simply index them and perform a lookup using vlookup(randbetween(startindex,endindex),...) to get a random value from the table.
Check out this Example Excel File I created.
Use the following steps to get your desired result:
List the people in column A
Create a lookup table in Columns G and H
containing your desired result values.
In your result column (column E in the example below), add the formula: =vlookup(randbetween(1,3),G:H,2,false)
Column E will now contain either the numbers 7,8,or 10 for each person.
If you want to generalize this and allow any number of different values in your result lookup table, you can change the formula in column E to: =vlookup(randbetween(1,counta(G:G)-1),G:H,2,false).
Note: The -1 is only needed if your lookup table has a header row.
This will select a random value from all non-empty rows in your result lookup table.
In the example below, I added a header row to row 1, and the people start in row 2, for clarity.
+---+---------+---+---+---+--------+---+-----------+--------------+
| | A | B | C | D | E | F | G | H |
+---+---------+---+---+---+--------+---+-----------+--------------+
| 1 | Names | | | | RESULT | | Result ID | RESULT Value |
+---+---------+---+---+---+--------+---+-----------+--------------+
| 2 | Person1 | | | | 7 | | 1 | 7 |
+---+---------+---+---+---+--------+---+-----------+--------------+
| 3 | Person2 | | | | 7 | | 2 | 8 |
+---+---------+---+---+---+--------+---+-----------+--------------+
| 4 | Person3 | | | | 10 | | 3 | 10 |
+---+---------+---+---+---+--------+---+-----------+--------------+
| 5 | Person4 | | | | 8 | | | |
+---+---------+---+---+---+--------+---+-----------+--------------+

Resources