How to join on null columns with Spark Sql? - apache-spark

I am running some queries with joins using Spark Sql 3.1 where the same columns in both tables can contain null values like:
select ...
from a
join b
on a.col_with_nulls = b.col_with_nulls
and a.col_without_nulls = b.col_without_nulls
However the query when it comes to null values is not working in the on condition. I have also tried with:
select ...
from a
join b
on a.col_with_nulls is not distinct from b.col_with_nulls
and a.col_without_nulls = b.col_without_nulls
as suggested in other solutions here but I keep on get the same result. Any idea?

You can use <=> or eqNullSafe() in order retain nulls in join.

Related

How to avoid key column name duplication in join?

I'm trying to join two tables in spark sql. Each table has 50+ columns. Both has column id as the key.
spark.sql("select * from tbl1 join tbl2 on tbl1.id = tbl2.id")
The joined table has duplicated id column.
We can of course specify which id column to keep like below:
spark.sql("select tbl1.id, .....from tbl1 join tbl2 on tbl1.id = tbl2.id")
But since we have so many columns in both tables, I do not want to type all the other column names in the query above. (other than id column, no other duplicated column names).
what should I do? thanks.
If id is the only column name in common, you can take advantage of the USING clause:
spark.sql("select * from tbl1 join tbl2 using (id) ")
The using clause matches columns that have the same name in both tables. When using select *, the column appears only once.
Assuming, you want to preserve the "duplicates", you can try to use the internal row-id or equivalents for your help. This helped me in the past, if I had to delete exactly one of two identical rows.
select *,ctid from table;
outputs in postgresql also the internal counter id. Your before exact identical rows become different now. I don't know about spark.sql, but I assume, that you can access a similar attribute there.
val joined = spark
.sql("select * from tbl1")
.join(
spark.sql("select * from tbl2"),
Seq("id"),
"inner" // optional
)
joined should have only one id column. Tested with Spark 2.4.8

Mulitple tables join in Hive getting error - Both left and right aliases encountered in join

I am trying to join 3 tables. Following are the table details.
I am expecting following results
Here is my query and getting error as "both left and right aliases encountered in join 'id'".
This was due to joining 3rd table with 1st and 2nd table(last full join statement).
select coalesce(a.id,b.id,c.id) as id,
ref1,ref2,ref3
from v_cmo_test1 a
FULL JOIN v_cmo_test2 b on (a.id = b.id)
FULL JOIN v_cmo_test3 c on (c.id in (a.id,b.id))
If I am using below query, id 3 is repeating in the table which I don't want.
select coalesce(a.id,b.id,c.id) as id,
ref1,ref2,ref3
from v_cmo_test1 a
FULL JOIN v_cmo_test2 b on a.id = b.id
FULL JOIN v_cmo_test3 c on c.id = a.id
Could any one help me on how to achieve the expected results and really appreciate for your help.
Thanks, Babu
This is a very tricky requirement. data is incorrect because you are using test1 as driver, outer joins arent working properly. And this can occur with other tables. So, i am joining two tables at a time to achieve what you want.
select coalesce(inner_sq.id,c.id) as id,ref1,ref2,ref3
from
(select coalesce(a.id,b.id,c.id) as id,ref1,ref2
from v_cmo_test1 a
FULL JOIN v_cmo_test2 b on a.id = b.id
) inner_sq
FULL JOIN v_cmo_test3 c on c.id = inner_sq.id
Inner_sq query output -
1,bab,kim
2,xxx,yyy
3,,mmm
When you full join above with test3, you should get your output.

Perform a Correlated Scalar SubQuery in Spark Dataframe Java API (spark v2.3.0)

I have read that in spark you can easily do a correlated scalar subquery like so:
select
column1,
(select column2 from table2 where table2.some_key = table1.id)
from table1
What I have not figured out is how to do this in the DataFrame API. The best I can come up with is to do a join. The problem with this is that in my specific case I am joining with a enum-like lookup table that actually applies to more than one column.
Below is an example of the DataFrame code.
Dataset<Row> table1 = getTable1FromSomewhere();
Dataset<Row> table2 = getTable2FromSomewhere();
table1
.as("table1")
.join(table2.as("table2"),
col("table1.first_key").equalTo(col("table2.key")), "left")
.join(table2.as("table3"),
col("table1.second_key").equalTo(col("table3.key")), "left")
.select(col("table1.*"),
col("table2.description").as("first_key_description"),
col("table3.description").as("second_key_description"))
.show();
Any help would be greatly appreciated on figuring out how to do this in the DataFrame API.
What I have not figured out is how to do this in the DataFrame API.
Because there is simply no DataFrame API that can express that directly (without explicit JOIN). It can possibly change in the future:
https://issues.apache.org/jira/browse/SPARK-23945
https://issues.apache.org/jira/browse/SPARK-18455
Does SparkSQL support subquery?

Joining two dataframes in Spark

When I'm trying to join two data frames using
DataFrame joindf = dataFrame.join(df, df.col(joinCol)); //.equalTo(dataFrame.col(joinCol)));
My program is throwing below exception
org.apache.spark.sql.AnalysisException: join condition 'url' of type
string is not a boolean.;
Here joinCol value is url
Need inputs as what could possibly cause these exceptions
join variants which take as a second argument Column expect that it can be evaluated as a boolean expression.
If you want a simple equi-join based on a column name use a version which takes a column name as a String:
String joinCol = "foo";
dataFrame.join(df, joinCol);
What that means is that the join condition should evaluate to an expression. Lets say we want to join 2 dataframes based on id, so what we can do is :
With Python:
df1.join(df2, df['id'] == df['id'], 'left') # 3rd parameter is type of join which in this case is left join
With Scala:
df1.join(df2, df('id') === df('id')) // create inner join based on id column
You cannot use df.col(joinCol) as this is not an expression. In order to join 2 dataframes you need to identify the columns you wanted to join
Let's say you have a DataFrame emp and dept, joining these two dataframes should look like below in Scala
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"inner")
.show(false)
This example is taken from Spark SQL Join DataFrames

Read records from joining Hive tables with Spark

We can easily read records from a Hive table in Spark with this command:
Row[] results = sqlContext.sql("FROM my_table SELECT col1, col2").collect();
But when I join two tables, such as:
select t1.col1, t1.col2 from table1 t1 join table2 t2 on t1.id = t2.id
How to retrive the records from the above join query?
SparkContext.sql method always returns DataFrame so there is no practical difference between JOIN and any other type of query.
You shouldn't use collect method though, unless fetching data to the driver is really a desired outcome. It is expensive and will crash if data cannot fit in the driver memory.

Resources