In Azure SYNAPSE I have two tables table A with 6 millions of records and Table B with 2 millions when I run a simple left join query it takes around 20 minutes to execute but when I run same join query in On premises SQL SERVER it gives output in 1 sec. I have round robin distribution in synapse and columns are indexed, What can be the reason for this issue?
Related
Let's say I have the following join (modified from Spark documentation):
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= cast(impressionTime as date) AND
clickTime <= cast(impressionTime as date) + interval 1 day
""")
)
Assume that both tables have trillions of rows for 2 years of data. I think that joining everything from both tables is unnecessary. What I want to do is create subsets, similar to this: create 365 * 2 * 2 smaller dataframes so that there is 1 dataframe for each day of each table for 2 years, then create 365 * 2 join queries and take a union of them. But that is inefficient. I am not sure how to do it properly. I think I should add table.repartition(factor/multiple of 365 * 2) for both tables and add write.partitionBy(cast(impressionTime as date), cast(impressionTime as date)) to the streamwriter, and set the number of executors times cores to a factor or multiple of 365 * 2.
What is a proper way to do this? Does Spark analyze the query and optimizes it so that the entries from a single day are automatically put in the same partition? What if I am not joining all records from the same day, but rather from the same hour but there are very few records from 11pm to 1am? Does Spark know that it is most efficient to partition by day or will it be even more efficient?
Initially just trying to specify what i have understood from your question. You have two tables with two years worth of data and it has around trillion records in both of them. You want to join them efficiently based on the timeframe that you provided . for example could be for any specific month of any year or could be any specific custom dates but it should only read that much data and not all the data.
Now to answer your question you can do something as below:
First of all when you are writing data to create the table , you should partition the table by day column so that you have each day data in separate directory/partition for both the tables. Spark won't do that by default for you. You will have to decide that based on your dataset.
Second now when you are reading the data and performing the joins it should not be done on whole table. You will have to read the data from the specific partitions only by applying filter condition on the dataframe so that spark would apply partition pruning and it would read only the partitions that satisfy the condition in filter clause.
Once you have filtered the data at the time of reading from the table and stored it in a dataframe then you should join those dataframe based on the key relationship and that would be most efficient and performant way of doing it at first shot.
If it is still not fast enough you can look at bucketing your data along with partition but in most cases it is not required.
I have a table in oracle db which gets multiple transactions done (lets say around 100 million inserts,updates or deletes in a day). I want to get all the transactions happening in that table to be brought into hive for processing through spark or hive.
For example:
lets say a record in that oracle table goes through initial insert operation followed by 5 updates to same/different columns and finally gets deleted. I want to capture all such operations for all the records in that table and import into hive.
We want to find records with number of operations that exceed a threshold for specific columns and pull a report on them.
Has anyone come across such a use case? Appreciate any help in achieving this.
There is a table with parquet data format of 20 GB and simple query will give results by scanning only 1GB of data.
select columns from table1 where id in (id1, id2, idn)
If same query is executed with a sub-query such as -
select columns from table1 where id in (select id from table2 limit n) This query will give results by scanning 20GB, whole the table.Even n is very small number as 10, 50 or 5000.
Same happen with LEFT JOIN.
SELECT table1.* FROM
table2 LEFT JOIN table1
ON table2.id=table1.id
Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query?
Any best practices of How currently users runs LEFT JOIN or sub-query without full table scan on Athena ?
Similar questions- Question -1, Question -2
Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query?
This is most commonly covered by "Dynamic filtering".
Currently there is no way to do this.
Athena is based on Presto and Presto doesn't support dynamic filtering yet, but will likely support it in the next release (Presto 321). You can track the issue here: https://github.com/prestosql/presto/issues/52
Athena is based on Presto 0.172 currently, so it still needs to upgrade.
I have a table in my database and I have it indexed over three columns: PropertyId, ConceptId and Sequence. This particular table has about 90,000 rows in it and it is indexed over these three properties.
Now, when I run this query, the total time required is greater than 2 minutes:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
However, if I paginate the query like so:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
the aggregate time (x goes from 0 to 8) required is only around 20 seconds.
This seems counterintuitive to me because the pagination requires additional operations over and beyond simpler queries and we're adding on the additional latency required for sequential network calls because I haven't parallelized this query at all. And, I know it's not a caching issue because running these queries one after the other does not affect the latencies very much.
So, my question is this: why is one so much faster than the other?
This seems counterintuitive to me because the pagination requires additional operations over and beyond simpler queries
Pagination queries some times works very fast,if you have the right index...
For example,with below query
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
the maximum rows you might read is 20000 only..below is an example which proves the same
RunTimeCountersPerThread Thread="0" ActualRows="60" ActualRowsRead="60"
but with select * query.. you are reading all the rows
After a prolonged search into what's going on here, I discovered that the reason behind this difference in performance (> 2 minutes) was due to hosting the database on Azure. Since Azure partitions any tables you host on it across multiple partitions (i.e. multiple machines), running a query like:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
would run more slowly because the query pulls data from all the partitions in before ordering them, which could result in multiple queries across multiple partitions on the same table. By paginating the query over indexed properties I was looking at a particular partition and querying over the table stored there, which is why it performed significantly better than the un-paginated query.
To prove this, I ran another query:
SELECT *
FROM MSC_NPV
ORDER BY Narrative
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
This query ran anemically when compared to the first paginated query because Narrative is not a primary key and therefore is not used by Azure to build a partition key. So, ordering on Narrative required the same operation as the first query and additional operations on top of that because the entire table had to be gotten beforehand.
I have to create a table consisting of 30 columns in Hbase , the problem is that I have 5 different queries running on 5 columns.The question basically is the table is having indexes on multiple columns in Oracle , how to use that in hbase.
All the 5 queries are sla bound.
What can be the possible approaches to start with.
Thanks.