How to under the query plan of spark

How to under the query plan of spark - apache-spark

What does the build right mean of below query plan text?
BroadcastHashJoin [i_item_sk#2], [ss_item_sk#25], Inner, BuildLeft
Does that mean the right table is the table get broadcast?
Also, could I confirm that the table contains the column ss_item_sk is the right table from the query plan text?
Thanks.

buildSide is the side that going to be broadcasted. In your case left relation is broadcasted.
Not always both sides can be broadcasted:
inner join - we can broadcast both sides
full outer join - BHJ is not supported
right outer join - we only can broadcast the left side
left outer, left semi, left anti - we only can broadcast the right side
Also, could I confirm that the table contains the column ss_item_sk is
the right table from the query plan text?
Yes

Related

How to Perform left Anti and Right Antijoins in Dataflow?

I can see i can do only
Left right inner,Full ,Cross joins in Dataflow.
I can t see a left Anti or Right Anti joins in Dataflow. So how to perform those joins like that of in Sql in Azure data factory

As mentioned in document you can able to see built in joins and perform Full outer, inner, Left outer, Right outer and Cross joins in Data flow as shown in below image.
But as per your requirement you can't see built in anti joins like left anti join, right anti join in data flow like SQL.
For anti joins in Data flow we can perform join conditions as shown in below image.
In join conditions what you have to do is as shown in below image from left column select source1 and from right column select souce2 and in between there is filter operation like [= =,! =,<,>, <=,>=, = = =, < = > ].As per our requirement we can perform any of these operations in left join.
I created data flow as shown below by taking source1 as employee data and source2 as depart and combined these sources using left join.
After choosing left join in join conditions column1 as employee data and column2 as depart and in filter i am using === operator
After performing left join and join condition below is the output got.
Here is the Source1= employee data input:
Sorce 2= depart input:
Alternative method:
As of now it is not possible in joins but you can try it by using Exists.
Source1 in dataflow
Source2 data in dataflow
Next both sources are joined by EXIST activity. In below image you can find exist type & exists conditions. In Exist Activity Exist type is Doesn't exist
After validation you can see required left anti join output in Data preview as shown below

Merging attributes in pentaho (kettle)

I have 2 tables, both have 2 primary keys (anys_mes_dia and aircraftreg) and each table has other attributes. I want to join both tables by the 2 PK.
The thing is, for some [any_mes_dia,aircraftreg] I have all the attributes of both tables but for others I only have the attribute of one table.
How can I join this tables so as to get [anys_mes_dia,aircraftreg,dy,add,cn] and only nulls in the attribute that a specitic row doesn't have.
Here an image of what I have (some rows only have aircraftreg_1, any_mes_dia1 and CN).

In the Merge join step you have the option to define the type of join, in this case you could use the LEFT/RIGHT OUTER join (depending on which table is leading) to get the results you want.

Computing the size of a derived table in Spark SQL query

Is it possible to approximate the size of a derived table (in kb/mb/gb etc) in a Spark SQL query ? I don't need the exact size but an approximate value will do, which would allow me to plan my queries better by determining if a table could be broadcast in a join, or if using a filtered subquery in a Join will be better than using the entire table etc.
For e.g. in the following query, is it possible to approximate the size (in MB) of the derived table named b ? This will help me figure out if it will be better to use the derived table in the Join vs using the entire table with the filter outside -
select
a.id, b.name, b.cust
from a
left join (select id, name, cust
from tbl
where size > 100
) b
on a.id = b.id
We use Spark SQL 2.4. Any comments appreciated.

I have had to something similar before (to work out how many partitions to split to when writing).
What we ended up doing was working out an average row size and doing a count on the DataFrame then multiplying it by the row count.

Athena sub-query and LEFT JOIN data scanned optimization

There is a table with parquet data format of 20 GB and simple query will give results by scanning only 1GB of data.
select columns from table1 where id in (id1, id2, idn)
If same query is executed with a sub-query such as -
select columns from table1 where id in (select id from table2 limit n) This query will give results by scanning 20GB, whole the table.Even n is very small number as 10, 50 or 5000.
Same happen with LEFT JOIN.
SELECT table1.* FROM
table2 LEFT JOIN table1
ON table2.id=table1.id
Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query?
Any best practices of How currently users runs LEFT JOIN or sub-query without full table scan on Athena ?
Similar questions- Question -1, Question -2

Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query?
This is most commonly covered by "Dynamic filtering".
Currently there is no way to do this.
Athena is based on Presto and Presto doesn't support dynamic filtering yet, but will likely support it in the next release (Presto 321). You can track the issue here: https://github.com/prestosql/presto/issues/52
Athena is based on Presto 0.172 currently, so it still needs to upgrade.

Joining two result sets into one

I wanted to know if there's a way to join two or more result sets into one.
I actually need to execute more than one query and return just one result set. I can't use the UNION or the JOIN operators because I'm working with Cassandra (CQL)
Thanks in advance !

Framework like Playorm provide support for JOIN (INNER and LEFT JOINs)queries in Cassandra.
http://buffalosw.com/wiki/Command-Line-Tool/
You may see more examples at:
https://github.com/deanhiller/playorm/blob/master/src/test/java/com/alvazan/test/TestJoins.java

If your wanting to query multiple rows within the same column family you can use the IN keyword:
SELECT * FROM testCF WHERE key IN ('rowKeyA', 'rowKeyB', 'rowKeyZ') LIMIT 10;
This will get you back 10 results from each row.
If your needing to join results from different CFs, or query with differing WHERE clauses, then you need to run multiple queries and merge the results in code - cassandra doesn't cater for that kind of thing.

PlayOrm can do joins, but you may need to have PlayOrm partitioning on so you still scale. (ie. you dont' want to join 1 billion rows with 1 billion rows). Typically instead you do a join of one partition with another partition or a partition on the Account table joining a partition on the Users table. ie. make sure you design for scale still.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to under the query plan of spark - apache-spark

Related

How to Perform left Anti and Right Antijoins in Dataflow?

Merging attributes in pentaho (kettle)

Computing the size of a derived table in Spark SQL query

Athena sub-query and LEFT JOIN data scanned optimization

Joining two result sets into one

Categories

Resources