Reusing subqueries in AWS Athena generate large amount of data scanned - subquery

On AWS Athena, I am trying to reuse computed data using a WITH clause, e.g.
WITH temp_table AS (...)
SELECT ...
FROM temp_table t0, temp_table t1, temp_table t2
WHERE ...
If the query is fast, the "Data scanned" goes through the roof. As if temp_table is computed for each time it is reference in the FROM clause.
I don't see the issue if I create a temp table separately and use it multiple times in the query.
Is there a way to really reuse a subquery multiple times without any penalty?

Related

Improve performance of a query using row_number() in Azure Synapse Analytics dedicated sql pool

Below query uses row_number() and it is introducing a shuffle move when executing the query.
SELECT
f.col1
,f.col2
,f.col3
,f.col4
,rowNum=row_number() OVER (PARTITION BY f.col2 ORDER BY f.col4 desc)
FROM #currentData e
left join dbo.targetTable f on
e.col2 =f.col2
#currentData temporary table and targetTable both are distributed on col2 column.
I have also created indexes on columns that are used in the row_number() -- (col2 asc,col4 desc) but it didn't get rid of the shuffle move.
I have tried creating a covering index to cover all of the columns in the select statement and columns in the row_number but that didn't resolve the issue either.
Both of the tables have index on the join column (col2).
Also made sure statistics are up to date on these 2 tables.
Query takes long time to process due to the shuffle move, is there any other way to improve the below query performance?
Appreciate the help!
I just found this out the hard way.. unfortunately. I haven't got the time to fully understand but i managed to reduce query performance by 90% through removing the ROW_NUMBER function.
to my understanding the ROW_NUMBER introduces the necessity that each node should have all data to be able to calculate the row_number based on the order by clause. And if the order by (or partition) origins from a big table that's a lot of shuffle going on. Because we use row_number just as a primary key generator I was able to get rid of it but i assume this probably happens with rank etc also.
By removing the row_number the query plan actually does what it should do. Joining without data movement.
Interested to see if anyone has a solution or better explanation.

Perfomance related question on spark + cassandra (JAVA code)

i am using cassandra as my dumping ground on which i have multiple jobs running to process the data and update different system. below are the job related filters
Job 1. data filter based on active_flag and update_date_time and expiry_time and process the filtered data.
Job 2. data filter based on update_date_time process the data
Job 3. data filter based on created_date_time and active flag
db columns on which where condition would run are (one or many columns in one query)
Active -> yes/no
created_date -> timestamp
expiry_time -> timestamp
updated_date -> timestamp
My question on these conditions :-
how should i form my cassandra primary key? as i dont see any way to acheive uniqueness on this (id is present but thats not required for me to process data).
do i even need the primary key if i use the filtering on spark code using table scan?
considering this for millions of records processing.
Answering to your question - you need to have a primary key, even if it consists only of the partition key :-)
More detailed answer really depends on how often these jobs are running, how much data overall, how many nodes in the cluster, what hardware is used, etc. Usually, we're trying to push as much filtering to Cassandra as possible, so it will return only relevant data, not everything. The most effective this filtering happens on the first clustering column, for example, if I want to process only newly created entries, then I can use the table with following structure:
create table test.test (
pk int,
tm timestamp,
c2 int,
v1 int,
v2 int,
primary key(pk, tm, c2));
and then I can fetch only newly created entries by using:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("test", "test").load()
val filtered = data.filter("tm >= cast('2019-03-10T14:41:34.373+0000' as timestamp)")
Or I can fetch entries in the given time period:
val filtered = data.filter("""ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)
AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)""")
The effect of the filter pushdown could be checked by executing explain on the dataframe, and checking the PushedFilters section - conditions that are marked with * will be executed on Cassandra side...
But it's not always possible to design tables to match all queries, so you'll need to design primary key for jobs that are executed most often. In your case, update_date_time could be a good candidate for that, but if you put it as clustering column, then you'll need to take care when updating it - you'll need to perform change as batch, something like this:
begin batch
delete from table where pk = ... and update_date_time = old_timestamp;
insert into table (pk, update_date_time, ...) values (..., new_timestamp, ...);
apply batch;
or something like this.

Athena sub-query and LEFT JOIN data scanned optimization

There is a table with parquet data format of 20 GB and simple query will give results by scanning only 1GB of data.
select columns from table1 where id in (id1, id2, idn)
If same query is executed with a sub-query such as -
select columns from table1 where id in (select id from table2 limit n) This query will give results by scanning 20GB, whole the table.Even n is very small number as 10, 50 or 5000.
Same happen with LEFT JOIN.
SELECT table1.* FROM
table2 LEFT JOIN table1
ON table2.id=table1.id
Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query?
Any best practices of How currently users runs LEFT JOIN or sub-query without full table scan on Athena ?
Similar questions- Question -1, Question -2
Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query?
This is most commonly covered by "Dynamic filtering".
Currently there is no way to do this.
Athena is based on Presto and Presto doesn't support dynamic filtering yet, but will likely support it in the next release (Presto 321). You can track the issue here: https://github.com/prestosql/presto/issues/52
Athena is based on Presto 0.172 currently, so it still needs to upgrade.

Spark not allowing separate queries on same data sources within same Spark SQL query

Let's take an example of 2 newly created dataframes empDF and deptDF.
Create a view
empDF.createOrReplaceTempView("table1")
deptDF.createOrReplaceTempView("table2")
spark.sql("select * from table1 as t1
join table2 as t2 on (...) where t1.col1 not in
(select t3.col2 from table1 as t3)"
)
Surprisingly, run time exception occurs complaining "table1" view or table does not exist. This is happening when two different query is performed on same data source.
May I have some idea please.

Sorting Results by time in Cassandra

I'm trying to get some time series data from Cassandra
My table is presented on picture , and when I query, I'm getting data as presented next:
first I'm seeing all false data regardless of time when I inserted them in Cassandra, and next I'm seeing all true data.
My question is: how can I sort or roder data by time when I inserted, consistently, in order to I'm be able to get data in order when I insert them.
When I try "select c1 from table1 order by c2", I get the following error "ORDER BY is only supported when the partition key is restricted by an EQ or an IN."
Thank you
My boolean table
Assuming that your schema is something like:
CREATE TABLE table1 (
c1,
c2,
PRIMARY KEY (c1))
This will result in 2 partitions in your table (c1 = true and c1=false). Each partition will be managed by a single node.
Your initial query will retrieve data from your table across all partitions. So it will go to the first partition, retrieve all the rows then the second and retrieve all the rows, which is why you're seeing the results you do.
Cassandra is optimised for retrieving data across one partition only, so you should look at adjusting your schema to allow that - to use ORDER BY in the query, you need to be retrieving data across one partition only.
Depending on your use case, you could look at bucketing your data or performing the sorting in your application.

Resources