I have following two queries in Hive to get some specific result.
select * from table1 where col1 IN (a, b, c)
select * from table1 where col1=a OR col1=b OR col1=c
As per my understanding IN will be converted internally to sequence of ORs.
Executed locally in spark-sql but did not find any sort of performance difference(like execution timing,filtered data scanning etc).
So what difference we can see in IN and OR based on the functionality.
Any help will be appreciated.
Col1 in (a,b,c) is a macro that expands to Col1=a or Col1=b or Col1=c.
There is no performance difference
Related
Below query uses row_number() and it is introducing a shuffle move when executing the query.
SELECT
f.col1
,f.col2
,f.col3
,f.col4
,rowNum=row_number() OVER (PARTITION BY f.col2 ORDER BY f.col4 desc)
FROM #currentData e
left join dbo.targetTable f on
e.col2 =f.col2
#currentData temporary table and targetTable both are distributed on col2 column.
I have also created indexes on columns that are used in the row_number() -- (col2 asc,col4 desc) but it didn't get rid of the shuffle move.
I have tried creating a covering index to cover all of the columns in the select statement and columns in the row_number but that didn't resolve the issue either.
Both of the tables have index on the join column (col2).
Also made sure statistics are up to date on these 2 tables.
Query takes long time to process due to the shuffle move, is there any other way to improve the below query performance?
Appreciate the help!
I just found this out the hard way.. unfortunately. I haven't got the time to fully understand but i managed to reduce query performance by 90% through removing the ROW_NUMBER function.
to my understanding the ROW_NUMBER introduces the necessity that each node should have all data to be able to calculate the row_number based on the order by clause. And if the order by (or partition) origins from a big table that's a lot of shuffle going on. Because we use row_number just as a primary key generator I was able to get rid of it but i assume this probably happens with rank etc also.
By removing the row_number the query plan actually does what it should do. Joining without data movement.
Interested to see if anyone has a solution or better explanation.
Is it possible to approximate the size of a derived table (in kb/mb/gb etc) in a Spark SQL query ? I don't need the exact size but an approximate value will do, which would allow me to plan my queries better by determining if a table could be broadcast in a join, or if using a filtered subquery in a Join will be better than using the entire table etc.
For e.g. in the following query, is it possible to approximate the size (in MB) of the derived table named b ? This will help me figure out if it will be better to use the derived table in the Join vs using the entire table with the filter outside -
select
a.id, b.name, b.cust
from a
left join (select id, name, cust
from tbl
where size > 100
) b
on a.id = b.id
We use Spark SQL 2.4. Any comments appreciated.
I have had to something similar before (to work out how many partitions to split to when writing).
What we ended up doing was working out an average row size and doing a count on the DataFrame then multiplying it by the row count.
There is a table with parquet data format of 20 GB and simple query will give results by scanning only 1GB of data.
select columns from table1 where id in (id1, id2, idn)
If same query is executed with a sub-query such as -
select columns from table1 where id in (select id from table2 limit n) This query will give results by scanning 20GB, whole the table.Even n is very small number as 10, 50 or 5000.
Same happen with LEFT JOIN.
SELECT table1.* FROM
table2 LEFT JOIN table1
ON table2.id=table1.id
Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query?
Any best practices of How currently users runs LEFT JOIN or sub-query without full table scan on Athena ?
Similar questions- Question -1, Question -2
Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query?
This is most commonly covered by "Dynamic filtering".
Currently there is no way to do this.
Athena is based on Presto and Presto doesn't support dynamic filtering yet, but will likely support it in the next release (Presto 321). You can track the issue here: https://github.com/prestosql/presto/issues/52
Athena is based on Presto 0.172 currently, so it still needs to upgrade.
I would like to understand how dynamic filtering works.
What I know about it is that , say there are 2 tables A (with million rows) and B (with 10k rows).
Now while performing a join between A and B if a predicate is applied on B , then via Dynamic filtering we can avoid full scan for A.
This helps in less data getting shuffled.
My questions are :
1) Does this automatically happen in spark or do I have to enable
some property?
2) Is there a way through which I have to provide my filters ,
before the sql is executed ?
3) Are there any downsides to this approach?
4) Any link that provides an explanation on this ?
I am learning spark sql and noticed that this is possible:
SELECT a, b,
Row_number() OVER (partition BY a, b ORDER BY start_time DESC ) AS r ,
Count(*) OVER (partition BY a, b) AS count
FROM tbl
WHERE ...
**HAVING r <= 10**
As far as I know, having clause is something that can be applied only to an aggregation in a group-by clause. Impala does not recognise this syntax, nor is it documented in the only reference I was able to find for spark sql.
What's up with that? Are the semantics the same as putting the same condition in a where clause in an outer query (like I normally would?)
This issue is now resolved - see here https://issues.apache.org/jira/plugins/servlet/mobile#issue/IMPALA-2215
However, it may not be updated for older versions.