I have a SQL table with approximately 80 million rows. Columns first name and last name are full text indexed. Also we have appropriate clustered and non clustered indexes.
We have different types of queries and everything works fine regarding performance and response time. but we have few cases when we are searching for specific first and last name in some time period and response time is few minutes (normally is 3-10 seconds).
When I look at the execution plan I see that there is index eager spool step and it costs 90%. before that step I have full text step. So I we cant figure out why we have good response time on million other combinations, (between in that queries we don't have eager spool step) but in this specific combination (first and last name, and they are very common) we have this issue.
I am using MS SQL Server 2012.
Related
I am running this on AWS Athena based on PrestoDB. My original plan was to query data 3 months in the past to analyze that data. However, even the query times for 2 hours in the past takes more than 30 minutes, at which point the Query times out. Is there any more efficient way for the query to be carried out?
SELECT column1, dt, column 2
FROM database1
WHERE date_parse(dt, '%Y%m%d%H%i%s') > CAST(now() - interval '1' hour AS timestamp)
The date column is recorded in the form of a string YYYYmmddhhmmss
Likely, the problem is that the query applies a function on the column being filtered. This is inefficient, becase the database needs to convert the entire column before it is able to filter it. One says that this predicate is non-SARGable.
Your primary effort should go into fixing your data model and store dates as dates rather than strings.
That said, the string format that you are using to represent dates still makes it possible to use direct filtering. The idea is to convert the filter value to the target string format (rather than converting the column value to a date):
where dt > date_format(now() - interval '1' hour, '%Y%m%d%H%i%s')
There are a lot of different factors that influence the time it takes for Athena to execute a query. The amount of data is usually dominates, but other important factors are data format (there's a huge difference between CSV and Parquet for example), and the number of files. In contrast to many other new database situations the complexity of the query is less often an important factor, and your query is very straightforward and is not the problem (it doesn't help that you apply a function in both sides of the WHERE condition, but it's not a big deal in Athena since the filtering is brute force and applying a function on each row isn't that big a deal compared to IO in an engine like Athena.
If you provide more information about the number of files, the data format, and so on we can probably help you better, because without that kind of information it could be just about anything. My suspicion is that you have something like a single prefix with tens or hundreds of millions of files – this is the worst possible case for Athena.
When Athena plans a query it lists the table's location on S3. S3's list operation has a page size of 1000, so if there are more files than that Athena will have to list sequentially until it gets the full listing. This cannot be parallelised, and it's also not very fast.
You need to avoid, almost at all cost, having more than 1000 files in the same prefix. If you have more files than that you can add prefixes (directories), because Athena will list S3 as if it was a file system, and parallelise listings of prefixes. A 1000 files each in table-data/a/, table-data/b/, table-data/c/ is much better than 3000 files in table-data/.
The reason why I suspect it's lots of small files rather than a lot of data is that if it was a lot of data you would probably have said so – and lots of data is actually something Athena is really good at. Ripping though terabytes of data is no problem unless it's a billion tiny files.
I have a table in my database and I have it indexed over three columns: PropertyId, ConceptId and Sequence. This particular table has about 90,000 rows in it and it is indexed over these three properties.
Now, when I run this query, the total time required is greater than 2 minutes:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
However, if I paginate the query like so:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
the aggregate time (x goes from 0 to 8) required is only around 20 seconds.
This seems counterintuitive to me because the pagination requires additional operations over and beyond simpler queries and we're adding on the additional latency required for sequential network calls because I haven't parallelized this query at all. And, I know it's not a caching issue because running these queries one after the other does not affect the latencies very much.
So, my question is this: why is one so much faster than the other?
This seems counterintuitive to me because the pagination requires additional operations over and beyond simpler queries
Pagination queries some times works very fast,if you have the right index...
For example,with below query
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
the maximum rows you might read is 20000 only..below is an example which proves the same
RunTimeCountersPerThread Thread="0" ActualRows="60" ActualRowsRead="60"
but with select * query.. you are reading all the rows
After a prolonged search into what's going on here, I discovered that the reason behind this difference in performance (> 2 minutes) was due to hosting the database on Azure. Since Azure partitions any tables you host on it across multiple partitions (i.e. multiple machines), running a query like:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
would run more slowly because the query pulls data from all the partitions in before ordering them, which could result in multiple queries across multiple partitions on the same table. By paginating the query over indexed properties I was looking at a particular partition and querying over the table stored there, which is why it performed significantly better than the un-paginated query.
To prove this, I ran another query:
SELECT *
FROM MSC_NPV
ORDER BY Narrative
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
This query ran anemically when compared to the first paginated query because Narrative is not a primary key and therefore is not used by Azure to build a partition key. So, ordering on Narrative required the same operation as the first query and additional operations on top of that because the entire table had to be gotten beforehand.
We've set up an Azure Search Index on our Azure SQL Database of ~2.7 million records all contained in one Capture table. Every night, our data scrapers grab the latest data, truncate the Capture table, then rewrite all the latest data - most of which will be duplicates of what was just truncated, but with a small amount of new data. We don't have any feasible way of only writing new records each day, due to the large amounts of unstructured data in a couple fields of each record.
How should we best manage our index in this scenario? Running the indexer on a schedule requires you to indicate this "high watermark column." Because of the nature of our database (erase/replace once a day) we don't have any column that would apply here. Further, what really needs to happen for our Azure Search Index is either it also needs to go through a full daily erase/replace, or some other approach so that we don't keep adding 2.7 million duplicate records every day to the index. The former likely won't work for us because it takes 4 hours minimum to index our whole database. That's 4 hours where clients (worldwide) may not have a full dataset to query on.
Can someone from Azure Search make a suggestion here?
What's the proportion of the data that actually changes every day? If that proportion is small, then you don't need to recreate the search index. Simply reset the indexer after the SQL table has been recreated, and trigger reindexing (resetting an indexer clears its high water mark state, but doesn't change the target index). Even though it may take several hours, your index is still there with the mostly full dataset. Presumably if you update the dataset once a day, your clients can tolerate hours of latency for picking up latest data.
Working in Cognos Report Studio 10.2.1. I have two query items. First query item is the base table which results in some million records. Second query item is coming from a different table. I need to LEFT OUTER JOIN the first query item with other. In the third query item post the join, I am filtering on a date column which is in formatYYYYMM to give me records falling under 201406 i.e the current Month and Year. This is the common column in both the table apart from AcctNo which is used to join both the tables. The problem is, when I try to view Tabular datathe report takes forever to run. After waiting patiently for 30 mins, I just have to cancel the report. When I add the same filter criteria to the 1st query item on the date column and then view the third query item, it gives me the output. But in the long run, I have to join multiple tables with this base table and in one of the table the filter criteria needs to give output for two months. I am converting a SAS code to Cognos, In SAS code, there is no filter on the base table and even then the join query takes few seconds to run.
My question is: Is there any way to improve the performance of the query so that it runs and more importantly runs in less time? Pl note: Modelling my query in FM is not an option in this case.
I was able to get this resolved myself after many trial and errors.
What I did is created a copy of 1st Query item, and filtered 1st query item with current month and year and the for the copy of 1st query item added a filter for two months. That way I was able to run my query and get the desired results.
Though this is a rare case scenario, hope it helps someone else.
We've got a windows azure table storage system going on where we have various entity types that report values during the day so we've got the following partition and row key scenario:
There are about 4000 - 5000 entities. There are 6 entity types and the types are roughly evenly distributed. so around 800'ish each.
ParitionKey: entityType-Date
Row key: entityId
Each row records the values for an entity for that particular day. This is currently JSON serialized.
The data is quite verbose.
We will periodically want to look back at the values in these partitions over a month or two months depending on what our website users want to look at.
We are having a problem in that if we want to query a month of data for one entity we find that we have to query 31 partition keys by entityId.
This is very slow initially but after the first call the result is cached.
Unfortunately the nature of the site is that there will be a varying number of different queries so it's unlikely the data will benefit much from caching.
We could obviously make the partitions bigger i.e. perhaps a whole week of data and expand the rowKeys to entityId and date.
What other options are open to me, or is simply the case that Windows Azure tables suffer fairly high latency?
Some options include
Make the 31 queries in parallel
Make a single query on a partition key range, that is
Partition key >= entityType-StartDate and Partition key <= entityType-EndDate and Row key = entityId.
It is possible that depending on your data, this query may have less latency than your current query.