Read records from joining Hive tables with Spark - apache-spark

We can easily read records from a Hive table in Spark with this command:
Row[] results = sqlContext.sql("FROM my_table SELECT col1, col2").collect();
But when I join two tables, such as:
select t1.col1, t1.col2 from table1 t1 join table2 t2 on t1.id = t2.id
How to retrive the records from the above join query?

SparkContext.sql method always returns DataFrame so there is no practical difference between JOIN and any other type of query.
You shouldn't use collect method though, unless fetching data to the driver is really a desired outcome. It is expensive and will crash if data cannot fit in the driver memory.

Related

How to join on null columns with Spark Sql?

I am running some queries with joins using Spark Sql 3.1 where the same columns in both tables can contain null values like:
select ...
from a
join b
on a.col_with_nulls = b.col_with_nulls
and a.col_without_nulls = b.col_without_nulls
However the query when it comes to null values is not working in the on condition. I have also tried with:
select ...
from a
join b
on a.col_with_nulls is not distinct from b.col_with_nulls
and a.col_without_nulls = b.col_without_nulls
as suggested in other solutions here but I keep on get the same result. Any idea?
You can use <=> or eqNullSafe() in order retain nulls in join.

Same query resulting in different outputs in Hive vs Spark

Hive 2.3.6-mapr
Spark v2.3.1
I am running same query:
select count(*)
from TABLE_A a
left join TABLE_B b
on a.key = c.key
and b.date > '2021-01-01'
and date_add(last_day(add_months(a.create_date, -1)),1) < '2021-03-01'
where cast(a.TIMESTAMP as date) >= '2021-01-20'
and cast(a.TIMESTAMP as date) < '2021-03-01'
But getting 1B rows as output in hive, while 1.01B in spark-sql.
By some initial analysis, it seems like all the extra rows in spark are having timestamp column as 2021-02-28 00:00:00.000000.
Both the TIMESTAMP and create_date columns have data type string.
What could be the reason behind this?
I will give you one possibility, but I need more information.
If you drop an external table, the data remains and spark can read it, but the metadata in Hive says it doesn't exist and doesn't read it.
That's why you have a difference.

Insert selective columns to hive

I want to insert selective columns to Hive and I am unable to do so. This is what I was trying via spark
val df2 = spark.sql("SELECT Device_Version,date, SUM(size) as size FROM table1 WHERE date='2019-06-13' GROUP BY date, Device_Version")
df2.createOrReplaceTempView("tempTable")
spark.sql("Insert into table2 PARTITION (date,ID) (Device_Version) SELECT Device_Version, date, '1' AS ID FROM tempTable")
My aim is to only insert selective fields to the table t2. Table t2 has many other columns which I want to be padded as null. I can do the padding as long as I can specify the order. I do not want the order to be taken by default.
Something like ...
spark.sql("Insert into table2 PARTITION (date,cuboid_id) (Device_Version,OS) SELECT Device_Version, null as os, date, '10001' AS CUBOID_ID FROM tempTable")
Is there any way to do this ? Any options are welcome.

Perform a Correlated Scalar SubQuery in Spark Dataframe Java API (spark v2.3.0)

I have read that in spark you can easily do a correlated scalar subquery like so:
select
column1,
(select column2 from table2 where table2.some_key = table1.id)
from table1
What I have not figured out is how to do this in the DataFrame API. The best I can come up with is to do a join. The problem with this is that in my specific case I am joining with a enum-like lookup table that actually applies to more than one column.
Below is an example of the DataFrame code.
Dataset<Row> table1 = getTable1FromSomewhere();
Dataset<Row> table2 = getTable2FromSomewhere();
table1
.as("table1")
.join(table2.as("table2"),
col("table1.first_key").equalTo(col("table2.key")), "left")
.join(table2.as("table3"),
col("table1.second_key").equalTo(col("table3.key")), "left")
.select(col("table1.*"),
col("table2.description").as("first_key_description"),
col("table3.description").as("second_key_description"))
.show();
Any help would be greatly appreciated on figuring out how to do this in the DataFrame API.
What I have not figured out is how to do this in the DataFrame API.
Because there is simply no DataFrame API that can express that directly (without explicit JOIN). It can possibly change in the future:
https://issues.apache.org/jira/browse/SPARK-23945
https://issues.apache.org/jira/browse/SPARK-18455
Does SparkSQL support subquery?

Spark Dataset with Subselect in Where Condition

I try to recreate a SQL query in Spark SQL. Normally i would insert into a table like this:
INSERT INTO Table_B
(
primary_key,
value_1,
value_2
)
SELECT DISTINCT
primary_key,
value_1,
value_2
FROM
Table_A
WHERE NOT EXISTS
(
SELECT 1 FROM
Table_B
WHERE
Table_B.primary_key = Table_A.primary_key
);
Spark SQL is straightforward and I can load data from a TempView in a new dataset. Unfortunately i don't know how to reconstruct the where clause.
Dataset<Row> Table_B = spark.sql("SELECT DISTINCT primary_key, value_1, value_2 FROM Table_A").where("NOT EXISTS ... ???" );
Queries with not exists in TSQL can be rewritten with left join with "where":
SELECT Table_A.*
FROM Table_A Left Join Table_B on Table_B.primary_key = Table_A.primary_key
Where Table_B.primary_key is null
Maybe, similar approach can be used in Spark, with left join. For example, for dataframes, smthing like:
aDF.join(bDF,aDF("primary_key")===bDF("primary_key"),"left_outer").filter(isnull(col("other_b_not_nullable_column")))
SparkSQL doesn't currently have EXISTS & IN. "(Latest) Spark SQL / DataFrames and Datasets Guide / Supported Hive Features"
EXISTS & IN can always be rewritten using JOIN or LEFT SEMI JOIN. "Although Apache Spark SQL currently does not support IN or EXISTS subqueries, you can efficiently implement the semantics by rewriting queries to use LEFT SEMI JOIN." OR can always be rewritten using UNION. AND NOT can be rewritten using EXCEPT.

Resources