I try to recreate a SQL query in Spark SQL. Normally i would insert into a table like this:
INSERT INTO Table_B
(
primary_key,
value_1,
value_2
)
SELECT DISTINCT
primary_key,
value_1,
value_2
FROM
Table_A
WHERE NOT EXISTS
(
SELECT 1 FROM
Table_B
WHERE
Table_B.primary_key = Table_A.primary_key
);
Spark SQL is straightforward and I can load data from a TempView in a new dataset. Unfortunately i don't know how to reconstruct the where clause.
Dataset<Row> Table_B = spark.sql("SELECT DISTINCT primary_key, value_1, value_2 FROM Table_A").where("NOT EXISTS ... ???" );
Queries with not exists in TSQL can be rewritten with left join with "where":
SELECT Table_A.*
FROM Table_A Left Join Table_B on Table_B.primary_key = Table_A.primary_key
Where Table_B.primary_key is null
Maybe, similar approach can be used in Spark, with left join. For example, for dataframes, smthing like:
aDF.join(bDF,aDF("primary_key")===bDF("primary_key"),"left_outer").filter(isnull(col("other_b_not_nullable_column")))
SparkSQL doesn't currently have EXISTS & IN. "(Latest) Spark SQL / DataFrames and Datasets Guide / Supported Hive Features"
EXISTS & IN can always be rewritten using JOIN or LEFT SEMI JOIN. "Although Apache Spark SQL currently does not support IN or EXISTS subqueries, you can efficiently implement the semantics by rewriting queries to use LEFT SEMI JOIN." OR can always be rewritten using UNION. AND NOT can be rewritten using EXCEPT.
Related
I am running some queries with joins using Spark Sql 3.1 where the same columns in both tables can contain null values like:
select ...
from a
join b
on a.col_with_nulls = b.col_with_nulls
and a.col_without_nulls = b.col_without_nulls
However the query when it comes to null values is not working in the on condition. I have also tried with:
select ...
from a
join b
on a.col_with_nulls is not distinct from b.col_with_nulls
and a.col_without_nulls = b.col_without_nulls
as suggested in other solutions here but I keep on get the same result. Any idea?
You can use <=> or eqNullSafe() in order retain nulls in join.
Hive 2.3.6-mapr
Spark v2.3.1
I am running same query:
select count(*)
from TABLE_A a
left join TABLE_B b
on a.key = c.key
and b.date > '2021-01-01'
and date_add(last_day(add_months(a.create_date, -1)),1) < '2021-03-01'
where cast(a.TIMESTAMP as date) >= '2021-01-20'
and cast(a.TIMESTAMP as date) < '2021-03-01'
But getting 1B rows as output in hive, while 1.01B in spark-sql.
By some initial analysis, it seems like all the extra rows in spark are having timestamp column as 2021-02-28 00:00:00.000000.
Both the TIMESTAMP and create_date columns have data type string.
What could be the reason behind this?
I will give you one possibility, but I need more information.
If you drop an external table, the data remains and spark can read it, but the metadata in Hive says it doesn't exist and doesn't read it.
That's why you have a difference.
I am using spark 2.2.1 and hive2.1. I am trying to insert overwrite multiple partitions into existing partitioned hive/parquet table.
Table was created using sparkSession.
I have a table 'mytable' with partitions P1 and P2.
I have following set on sparkSession object:
"hive.exec.dynamic.partition"=true
"hive.exec.dynamic.partition.mode"="nonstrict"
Code:
val df = spark.read.csv(pathToNewData)
df.createOrReplaceTempView("updateTable") //here 'df' may contains data from multiple partitions. i.e. multiple values for P1 and P2 in data.
spark.sql("insert overwrite table mytable PARTITION(P1, P2) select c1, c2,..cn, P1, P2 from updateTable") // I made sure that partition columns P1 and P2 are at the end of projection list.
I am getting following error:
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: Partition spec {p1=, p2=, P1=1085, P2=164590861} contains non-partition columns;
dataframe 'df' have records for P1=1085, P2=164590861 . It looks like issue with casing (lower vs upper). I tried both cases in my query but it's still not working.
EDIT:
Insert statement works with static partitioning but that is not what I am looking for:
e.g. following works
spark.sql("insert overwrite table mytable PARTITION(P1=1085, P2=164590861) select c1, c2,..cn, P1, P2 from updateTable where P1=1085 and P2=164590861")
Create table stmt:
`CREATE TABLE `my_table`(
`c1` int,
`c2` int,
`c3` string,
`p1` int,
`p2` int)
PARTITIONED BY (
`p1` int,
`p2` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'maprfs:/mds/hive/warehouse/my.db/xc_bonus'
TBLPROPERTIES (
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.schema.numPartCols'='2',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{.spark struct metadata here.......}';
'spark.sql.sources.schema.partCol.0'='P1', //Spark is using Capital Names for Partitions; while hive is using lowercase
'spark.sql.sources.schema.partCol.1'='P2',
'transient_lastDdlTime'='1533665272')`
In above, spark.sql.sources.schema.partCol.0 uses all uppercase while PARTITIONED BY statement uses all lowercase for partitions columns
Based on the Exception and also assuming that the table 'mytable' was created as a partitioned table with P1 and P2 as partitions. One way to overcome this exception would be to force a dummy partition manually before executing the command. Try doing
spark.sql("alter table mytable add partition (p1=default, p2=default)").
Once successful, execute your insert overwrite statement. Hope this helps?
As I mentioned in EDIT section issue was in fact with difference in partition columns casing (lower vs upper) between hive and spark! I created hive table with all Upper cases but hive still internally stored it as lowercases but spark metadata kept is as Upper cases as intended by me. Fixing create statement with all lower case partition columns fixed the issue with subsequent updates!
If you are using hive 2.1 and spark 2.2 make sure following properties in create statement have same casing.
PARTITIONED BY (
p1int,
p2int)
'spark.sql.sources.schema.partCol.0'='p1',
'spark.sql.sources.schema.partCol.1'='p2',
I have read that in spark you can easily do a correlated scalar subquery like so:
select
column1,
(select column2 from table2 where table2.some_key = table1.id)
from table1
What I have not figured out is how to do this in the DataFrame API. The best I can come up with is to do a join. The problem with this is that in my specific case I am joining with a enum-like lookup table that actually applies to more than one column.
Below is an example of the DataFrame code.
Dataset<Row> table1 = getTable1FromSomewhere();
Dataset<Row> table2 = getTable2FromSomewhere();
table1
.as("table1")
.join(table2.as("table2"),
col("table1.first_key").equalTo(col("table2.key")), "left")
.join(table2.as("table3"),
col("table1.second_key").equalTo(col("table3.key")), "left")
.select(col("table1.*"),
col("table2.description").as("first_key_description"),
col("table3.description").as("second_key_description"))
.show();
Any help would be greatly appreciated on figuring out how to do this in the DataFrame API.
What I have not figured out is how to do this in the DataFrame API.
Because there is simply no DataFrame API that can express that directly (without explicit JOIN). It can possibly change in the future:
https://issues.apache.org/jira/browse/SPARK-23945
https://issues.apache.org/jira/browse/SPARK-18455
Does SparkSQL support subquery?
We can easily read records from a Hive table in Spark with this command:
Row[] results = sqlContext.sql("FROM my_table SELECT col1, col2").collect();
But when I join two tables, such as:
select t1.col1, t1.col2 from table1 t1 join table2 t2 on t1.id = t2.id
How to retrive the records from the above join query?
SparkContext.sql method always returns DataFrame so there is no practical difference between JOIN and any other type of query.
You shouldn't use collect method though, unless fetching data to the driver is really a desired outcome. It is expensive and will crash if data cannot fit in the driver memory.