Spark - Map udf to windows in spark dataframe - apache-spark

Problem Statement:
Have to group InputDf based on multiple columns (accountGuid, appID, deviceGuid, deviceMake) and order each group by time
Need to check if the test Df exists in the exact sequence in each window
If it exists, create a new dataframe with these columns below:
"project", "accountGuid", "appID", "deviceGuid", "deviceMake", "testCase", "result"
eg:
project
accountGuid
appID
deviceGuid
deviceMake
testCase
result
Profiles
guid1
9.0.77
dGuid1
AndroidTV
Select Admin Profile
pass
Input Df:
accountGuid
appID
deviceGuid
deviceMake
tabName
screenName
actionName
time
guid1
9.0.77
dGuid1
AndroidTV
Profiles
Profiles
Click - Profile Admin
9:01
guid2
9.0.77
dGuid2
AndroidPhone
Profiles
Profiles
Click - Profile Admin
8:00
guid1
9.0.77
dGuid1
AndroidTV
Profiles
Profiles
Page Load
9:03
guid1
9.0.77
dGuid1
AndroidTV
Profiles
Profiles
Click - Add Profile
9:05
guid3
9.0.77
dGuid3
FireTV
Profiles
Add Profile
Click - Name
9:02
guid2
9.0.77
dGuid2
AndroidPhone
Profiles
Profiles
Page Load
8:03
Test Df:
testCase - Select Admin Profile
tabName
screenName
actionName
Profiles
Profiles
Click - Profile Admin
Profiles
Profiles
Page Load
Approach taken:
I have created a window and assigned a unique guid to each window along with row_number
val windowSpec = Window.partitionBy("accountGuid", "appID", "deviceGuid", "deviceMake")
.orderBy("time")
val partitionedDf = inputDf.withColumn("groupguid", first(expr("uuid()")).over(windowSpec))
.withColumn("row_number", row_number.over(windowSpec))
partitionedDf Output:
accountGuid
appID
deviceGuid
deviceMake
tabName
screenName
actionName
time
groupguid
row_number
guid1
9.0.77
dGuid1
AndroidTV
Profiles
Profiles
Click - Profile Admin
9:01
1234
1
guid1
9.0.77
dGuid1
AndroidTV
Profiles
Profiles
Page Load
9:03
1234
2
guid1
9.0.77
dGuid1
AndroidTV
Profiles
Profiles
Click - Add Profile
9:05
1234
3
guid2
9.0.77
dGuid2
AndroidPhone
Profiles
Profiles
Click - Profile Admin
8:00
1456
1
guid2
9.0.77
dGuid2
AndroidPhone
Profiles
Profiles
Page Load
8:03
1456
2
guid3
9.0.77
dGuid3
FireTV
Profiles
Add Profile
Click -Name
9:02
8907
1
Getting all the uniqueIDs in a list
val uniqueIds = partitionedDf
.select("groupguid")
.distinct()
.map(row=>{(row.getString(0))})
.collect
.toList
Looping over all windows and checking if the testDf exists and present in the exact sequence
for (id <- uniqueIds) //loop to get each state
{
val windowDf = partitionedDf
.filter($"groupguid" === id)
val groupingDf = windowDf.select($"tabName", $"screenName", $"actionName", $"row_number")
isAvailable = areRowsAvailable(groupingDf, testDf)
if (isAvailable) {
-- check if the test dataframe is present in the correct sequence
}
}
def areRowsAvailable(groupingDf: DataFrame, testDf: DataFrame): Boolean = {
val df = testDf.join(groupingDf,
testDf("tabName") === groupingDf("tabName") &&
testDf("screenName") === groupingDf("screenName") &&
testDf("actionName") === groupingDf("actionName"), "left_anti")
df.rdd.isEmpty()
}
But in this approach, I could see that as it is looping over all the windows thus the processing is happening sequentially and all executors are not getting used and the job takes close to a day to finish with inputDf size of 20GB data.
Expectation: Want to know how to distribute the load across all windows in the dataframe rather than looping over each and processing sequentially.
As I am new to spark, not sure if there is a better way of doing this, should I create a UDF for this ?

Related

SQL - HELP - Write a query to get the Full name, email id, phone of tenants who are married and paying rent > 9000 using subqueries

I have two tables from which the queries to be executed below is the query which have written, need help in joining the link between the query
select FIRST_NAME+ ' '+ LAST_NAME as FULL_NAME,PHONE,EMAIL
FROM PROFILES
WHERE PROFILE_ID IN
((
SELECT PROFILE_ID
FROM PROFILES
WHERE MARITIAL_STATUS= 'Y' ) and
( SELECT PROFILE_ID
FROM TENANCY_HISTORIES
WHERE RENT> '9000'));
You are using and in the list of ID's output for the in clause. Try as following:
select FIRST_NAME+ ' '+ LAST_NAME as FULL_NAME,PHONE,EMAIL
FROM PROFILES
WHERE
(PROFILE_ID IN
(
SELECT PROFILE_ID
FROM PROFILES
WHERE MARITIAL_STATUS= 'Y' ) or PROFILE_ID IN
( SELECT PROFILE_ID
FROM TENANCY_HISTORIES
WHERE RENT> '9000'));

BigQuery Struct Aggregation

I am processing an ETL job on BigQuery, where I am trying to reconcile data where there may be conflicting sources. I frist used array_agg(distinct my_column ignore nulls) to find out where reconciliation was needed and next I need to prioritize data per column base on the source source.
I thought to array_agg(struct(data_source, my_column)) and hoped I could easily extract the preferred source data for a given column. However, with this method, I failed aggregating data as a struct and instead aggregated data as an array of struct.
Considered the simplified example below, where I will prefer to get job_title from HR and dietary_pref from Canteen:
with data_set as (
select 'John' as employee, 'Senior Manager' as job_title, 'vegan' as dietary_pref, 'HR' as source
union all
select 'John' as employee, 'Manager' as job_title, 'vegetarian' as dietary_pref, 'Canteen' as source
union all
select 'Mary' as employee, 'Marketing Director' as job_title, 'pescatarian' as dietary_pref, 'HR' as source
union all
select 'Mary' as employee, 'Marketing Manager' as job_title, 'gluten-free' as dietary_pref, 'Canteen' as source
)
select employee,
array_agg(struct(source, job_title)) as job_title,
array_agg(struct(source, dietary_pref)) as dietary_pref,
from data_set
group by employee
The data I get for John with regard to the job title is:
[{'source':'HR', 'job_title':'Senior Manager'}, {'source': 'Canteen', 'job_title':'Manager'}]
Whereas I am trying to achieve:
[{'HR' : 'Senior Manager', 'Canteen' : 'Manager'}]
With a struct output, I was hoping to then easily access the preferred source using my_struct.my_preferred_source. I this particular case I hope to invoke job_title.HR and dietary_pref.Canteen.
Hence in pseudo-SQL here I imagine I would :
select employee,
AGGREGATE_JOB_TITLE_AS_STRUCT(source, job_title).HR as job_title,
AGGREGATE_DIETARY_PREF_AS_STRUCT(source, dietary_pref).Canteen as dietary_pref,
from data_set group by employee
The output would then be:
I'd like help here solving this. Perhaps that's the wrong approach altogether, but given the more complex data set I am dealing with I thought this would be the preferred approach (albeit failed).
Open to alternatives. Please advise. Thanks
Notes: I edited this post after Mikhail's answer, which solved my problem using a slightly different method than I expected, and added more details on my intent to use a single struct per employee
Consider below
select employee,
array_agg(struct(source as job_source, job_title) order by if(source = 'HR', 1, 2) limit 1)[offset(0)].*,
array_agg(struct(source as dietary_source, dietary_pref) order by if(source = 'HR', 2, 1) limit 1)[offset(0)].*
from data_set
group by employee
if applied to sample data in your question - output is
Update:
use below for clarified output
select employee,
array_agg(job_title order by if(source = 'HR', 1, 2) limit 1)[offset(0)] as job_title,
array_agg(dietary_pref order by if(source = 'HR', 2, 1) limit 1)[offset(0)] as dietary_pref
from data_set
group by employee
with output

Python: sqlalchemy - map result only

I have some sql to run which is not single table based. Below is one example(on sqlite)
SELECT C.REGION METRICSCOPE, C.METRIC METRICOPTION, ROUND(1.0*C.COUNT/T.COUNT, 4) Percentage, T.COUNT COUNT
FROM
(SELECT REGION, $metric METRIC, COUNT(*) COUNT
FROM TICKET T, USER U
WHERE T.ASSIGNEDTO = U.USERNAME
AND ASOF BETWEEN '$startDate' AND '$endDate'
GROUP BY REGION, $metric ) C,
(SELECT REGION, COUNT(*) COUNT
FROM TICKET T, USER U
WHERE ASOF BETWEEN '$startDate' AND '$endDate'
AND T.ASSIGNEDTO = U.USERNAME
GROUP BY region) T
WHERE C.REGION = T.REGION
I want to run the sql and then map the result to a class, then jsonify the class objects & return to my webpage.
It seems to me that sqlalchemy use the table based map(each class needs to define the tablename) which is not suitable for my case.
Is it possible to map the result only? Appreciate if you can provide an example for me.
Thanks

Python/Peewee query with fn.MAX and alias results in "no such attribute"

I have a peewee query that looks like this:
toptx24h = Transaction.select(fn.MAX(Transaction.amount).alias('amount'), User.user_name).join(User,on=(User.wallet_address==Transaction.source_address)).where(Transaction.created > past_dt).limit(1)
My understanding is this should be equivalent to:
select MAX(t.amount) as amount, u.user_name from transaction t inner join user u on u.wallet_address = t.source_address where transaction.created > past_dt limit 1
My question is how to I access the results user_name and amount
When I try this, I get an error saying top has no attribute named amount
for top in toptx24h:
top.amount # No such attribute amount
I'm just wondering how i can access the amount and user_name from the select query.
Thanks
I think you need a GROUP BY clause to ensure you're grouping by User.username.
I wrote some test code and confirmed it's working:
with self.database.atomic():
charlie = TUser.create(username='charlie')
huey = TUser.create(username='huey')
data = (
(charlie, 10.),
(charlie, 20.),
(charlie, 30.),
(huey, 1.5),
(huey, 2.5))
for user, amount in data:
Transaction.create(user=user, amount=amount)
amount = fn.MAX(Transaction.amount).alias('amount')
query = (Transaction
.select(amount, TUser.username)
.join(TUser)
.group_by(TUser.username)
.order_by(TUser.username))
with self.assertQueryCount(1):
data = [(txn.amount, txn.user.username) for txn in query]
self.assertEqual(data, [
(30., 'charlie'),
(2.5, 'huey')])

CakePHP: Pagination of complex query

I have 3 tables: accounts, transactions, statements. Account hasMany transactions & statements (foreign key is account_id).
I need to retrieve the account_name, balance of billed transactions, total & count of unbilled (pending) transactions and last statement-date for all accounts with either and outstanding balance and/or unbilled transactions.
This query returns the data as expected:
SELECT
Account.id, Account.account_name,
Billed.balance, Pending.items,
Pending.amount, Stmnt.latest
FROM
accounts as Account
LEFT JOIN
(SELECT
account_id,
(SUM(`Transaction`.`debit`) - SUM(`Transaction`.`credit`)) as balance
FROM
`transactions` AS `Transaction`
WHERE
`Transaction`.`statement_id` > 0 AND
`Transaction`.`void` = 0
GROUP BY
account_id
) AS Billed ON Billed.account_id = Account.id
LEFT JOIN
(SELECT
`account_id`,
(SUM(`Transaction`.`debit`) - SUM(`Transaction`.`credit`)) as amount,
COUNT(id) AS items
FROM
`transactions` AS `Transaction`
WHERE
`Transaction`.`statement_id` = 0 AND
`Transaction`.`void` = 0
GROUP BY
account_id
) AS Pending ON Pending.account_id = Account.id
LEFT JOIN
(SELECT
`account_id`,
MAX(Statement.statement_date) AS latest
FROM
`statements` AS `Statement`
WHERE
`Statement`.`void` = 0
GROUP BY
account_id
) AS Stmnt ON Stmnt.account_id = Account.id
GROUP BY
account.id
HAVING
Billed.balance > 0 OR
Pending.pending_ct > 0
ORDER BY
account_name ASC
I'm having difficulty successfully translating this into CakePHP 2.1 find or paginate-friendly options.
Any thoughts would be greatly appreciated.

Resources