Physical Plan and Optimizing a Non-Equi Join in Spark SQL - apache-spark

I am using Spark SQL 2.4.0. I have a couple of tables as below:
CUST table:
id | name | age | join_dt
-------------------------
12 | John | 25 | 2019-01-05
34 | Pete | 29 | 2019-06-25
56 | Mike | 35 | 2020-01-31
78 | Alan | 30 | 2020-02-25
REF table:
eff_dt
------
2020-01-31
The requirement is to select all the records from CUST whose join_dt is <= eff_dt in the REF table. So, for this simple requirement, I put together the following query:
version#1:
select
c.id,
c.name,
c.age,
c.join_dt
from cust c
inner join ref r
on c.join_dt <= r.eff_dt;
Now, this creates a BroadcastNestedLoopJoin in the physical plan and hence the query takes a long time to process this.
Question 1:
Is there a better way to implement this same logic without a BNLJ being induced and execute the query faster? Is it possible to alleviate the BNLJ ?
Part 2:
Now,I broke the query into 2 parts as:-
version#2:
select c.id, c.name, c.age, c.join_dt
from cust c
inner join ref r
on c.join_dt = r.eff_dt --equi join
union all
select c.id, c.name, c.age, c.join_dt
from cust c
inner join ref r
on c.join_dt < r.eff_dt; --theta join
Now, for the Query in Version#1, the physical plan shows that the CUST table is scanned only once, whereas the physical plan for the Query in Version#2 indicates that the same input table CUST is scanned twice (Once for each of the 2 queries combined with a union). However, I am surprised to find that Version#2 executes faster than version#1.
Question 2:
How does version#2 execute faster than version#1 although version#2 scans the table twice as opposed to once in case of version#1, and also the fact that both the versions induce a BNLJ ?
Can anyone please clarify. Please let me know if additional information is required.
Thanks.

Related

Cassandra MAX function returning mismatched rows

Hi I am trying to get the max coauthor publication from a table in Cassandra, however its returning me mismatched rows when I query
select coauthor_name, MAX(num_of_colab) AS max_2020 from coauthor_by_author where pid = '40/2499' and year=2020;.
It returns:
which is wrong because 9 belongs to another coauthor.
Here is my create statement for the table:
CREATE TABLE IF NOT EXISTS coauthor_by_author (
pid text,
year int,
coauthor_name text,
num_of_colab int,
PRIMARY KEY ((pid), year, coauthor_name, num_of_colab)
) WITH CLUSTERING ORDER BY (year desc);
As proof, here is part of the original table:
As you can see Abdul Hanif Bin Zaini number publication as coauthor should only be 1.
The MAX() function is working as advertised but I think your understanding of how it works is incorrect. Let me illustrate with an example.
Here is the schema for my table of authors:
CREATE TABLE authors_by_coauthor (
author text,
coauthor text,
colabs int,
PRIMARY KEY (author, coauthor)
)
Here is a sample data of authors, their corresponding co-authors, and the number of times they collaborated:
author | coauthor | colabs
---------+-----------+--------
edda | ramakanta | 5
edda | ruzica | 9
anita | dakarai | 8
anita | sophus | 12
anita | uche | 4
cassius | ceadda | 14
cassius | flaithri | 13
Anita has three co-authors:
cqlsh> SELECT * FROM authors_by_coauthor WHERE author = 'anita';
author | coauthor | colabs
--------+----------+--------
anita | dakarai | 8
anita | sophus | 12
anita | uche | 4
And the top number of collaborations for Anita is 12:
SELECT MAX(colabs) FROM authors_by_coauthor WHERE author = 'anita';
system.max(colabs)
--------------------
12
Similarly, Cassius has two co-authors:
cqlsh> SELECT * FROM authors_by_coauthor WHERE author = 'cassius';
author | coauthor | colabs
---------+----------+--------
cassius | ceadda | 14
cassius | flaithri | 13
with 14 as the most collaborations:
cqlsh> > SELECT MAX(colabs) FROM authors_by_coauthor WHERE author = 'cassius';
system.max(colabs)
--------------------
14
Your question is incomplete since you haven't provided the full sample data but I suspect you're expecting to get the name of the co-author with the most collaborations. This CQL query will NOT return the result you're after:
SELECT coauthor_name, MAX(num_of_colab)
FROM coauthor_by_author
WHERE ...
In SELECT coauthor_name, MAX(num_of_colab), you are incorrectly assuming that the result of MAX(num_of_colab) corresponds to the coauthor_name. Aggregate functions will only ever return ONE row so the result set only ever contains one co-author. The co-author Abdul ... just happens to be the first row in the result so is listed with the MAX() output.
When using aggregate functions, it only makes sense to specify the function in the SELECT statement on its own:
SELECT function(col_name) FROM table WHERE ...
Specifying other columns in the query selectors is meaningless with aggregate functions. Cheers!

access each group in spark keyvaluegroupeddataset

I am new to Spark and I have a specific question about how to use spark to address my problem, which may be simple.
Promblem:
I have a model, which predicts the sales of products. Each product also belongs to a category like shoes, clothes etc. And we also have actual sales data. So the data look like this:
+------------+----------+-----------------+--------------+
| product_id | category | predicted_sales | actual_sales |
+------------+----------+-----------------+--------------+
| pid1 | shoes | 100.0 | 123 |
| pid2 | hat | 232 | 332 |
| pid3 | hat | 202 | 432 |
+------------+----------+-----------------+--------------+
product_id category predicted_sales actual_sales
What I'd like to do is: I want to calculate the number(or percentage) of intersection between top 5% products ranked by actual_sales and top 5% products ranked by predicted_sales for each category.
Doing this for the whole products instead of for each category would be easy, something like below:
def getIntersectionRatio(df:dataframe, per :Int): Double = {
val limit_num = (df.count() * per / 100.0).toInt
var intersection = df.orderBy("actual_sales").limit(limit_num)
.join(df.orderBy("predicted_sales").limit(limit_num), Seq("product_id"), "inner")
intersection.count() * 100.0 / limit_num
}
However, I need to calculate the intersection for each category. The result will be something like this
+-----------+------------------------+
| Category | intersection_percentage|
+-----------+------------------------+
My ideas
User Defined Aggreation Fuction or Aggregators
I think I can achieve my goal if I use groupBy or GroupByKey with UDAF or Aggregators but they are too inefficient because they take 1 row each time and I will have to store each row in the buffer inside UDAF or Aggregator.
df.groupby("category").agg(myUdaf)
class myUdaf extends UserDefinedAggregateFunction {
//Save all the rows to an arraybuffer
//and then transform the buffer back to df
//And then we the the same thing as we did for whole product in getIntersectionRatio defined previously
}
Self implemented partitioning
I can select the distinct categories and the use map to process each category, in which I join the element with df to get the partition
df.select("category").distinct.map(myfun(df))
def myfun(df: dataframe)(row : Row):Row = {
val dfRow = row.toDF //not supported but feasible with other apis
val group = df.join(broadcast(dfRow), seq(category), inner)
getIntersectionRatio(group)
}
Do we have a better solution for this?
Thanks in advance!

COGNOS Report: COUNT IF

I am not sure how to go about creating a custom field to count instances given a condition.
I have a field, ID, that exists in two formats:
A#####
B#####
I would like to create two columns (one for A and one for B) and count instances by month. Something like COUNTIF ID STARTS WITH A for the first column resulting in something like below. Right now I can only create a table with the total count.
+-------+------+------+
| Month | ID A | ID B |
+-------+------+------+
| Jan | 100 | 10 |
+-------+------+------+
| Feb | 130 | 13 |
+-------+------+------+
| Mar | 90 | 12 |
+-------+------+------+
Define ID A as...
CASE
WHEN ID LIKE 'A%' THEN 1
ELSE 0
END
...and set the Default aggregation property to Total.
Do the same for ID B.
Apologies if I misunderstood the requirement, but you maybe able to spin the list into crosstab using the section off the toolbar, your measure value would be count(ID).
Try this
Query 1 to count A , filtering by substring(ID,1,1) = 'A'
Query 2 to count B , filtering by substring(ID,1,1) = 'B'
Join Query 1 and Query 2 by Year/Month
List by Month with Count A and Count B

Excel : Get the most frequent value for each group

I Have a table ( excel ) with two columns ( Time 'hh:mm:ss' , Value ) and i want to get most frequent value for each group of row.
for example i have
Time | Value
4:35:49 | 122
4:35:49 | 122
4:35:50 | 121
4:35:50 | 121
4:35:50 | 111
4:35:51 | 122
4:35:51 | 111
4:35:51 | 111
4:35:51 | 132
4:35:51 | 132
And i want to get most frequent value of each Time
Time | Value
4:35:49 | 122
4:35:50 | 121
4:35:51 | 132
Thanks in advance
UPDATE
The first answer of #scott with helper column is the correct one
See the pic
You could use a helper column:
First it will need a helper column so in C I put
=COUNTIFS($A$2:$A$11,A2,$B$2:$B$11,B2)
Then in F2 I put the following Array Formula:
=INDEX($B$2:$B$11,MATCH(MAX(IF($A$2:$A$11=E2,IF($C$2:$C$11 = MAX(IF($A$2:$A$11=E2,$C$2:$C$11)),$B$2:$B$11))),$B$2:$B$11,0))
It is an array formula and must be confirmed with Ctrl-Shift-Enter. Then copied down.
I set it up like this:
Here is one way to do this in MS Access:
select tv.*
from (select time, value, count(*) as cnt
from t
group by time, value
) as tv
where exists (select 1
from (select top 1 time, value, count(*) as cnt
from t as t2
where t.time = t2.time
group by time, value
order by count(*) desc, value desc
) as x
where x.time = tv.time and x.value = tv.value
);
MS Access doesn't support features such as window functions or CTEs that make this type of query easier in other databases.
Would that work? I haven't tried and got inspired here
;WITH t3 AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY time ORDER BY c DESC, value DESC) AS rn
FROM (SELECT COUNT(*) AS c, time, value FROM t GROUP BY time, value) AS t2
)
SELECT *
FROM t3
WHERE rn = 1

Is there a way to make clustering order by data type and not string in Cassandra?

I created a table in CQL3 in the cqlsh using the following CQL:
CREATE TABLE test (
locationid int,
pulseid int,
name text, PRIMARY KEY(locationid, pulseid)
) WITH CLUSTERING ORDER BY (locationid ASC, pulseid DESC);
Note that locationid is an integer.
However, after I inserted data, and ran a select, I noticed that locationid's ascending sort seems to be based upon string, and not integer.
cqlsh:citypulse> select * from test;
locationid | pulseid | name
------------+---------+------
0 | 3 | test
0 | 2 | test
0 | 1 | test
0 | 0 | test
10 | 3 | test
5 | 3 | test
Note the 0 10 5. Is there a way to make it sort via its actual data type?
Thanks,
Allison
In Cassandra, the first part of the primary key is the 'partition key'. That key is used to distribute data around the cluster. It does this in a random fashion to achieve an even distribution. This means that you can not order by the first part of your primary key.
What version of Cassandra are you on? In the most recent version of 1.2 (1.2.2), the create statement you have used an example is invalid.

Resources