SQL Query by code or codename?, Which is supposed to be more optimized? - dynamics-crm-2011

I was checking through my reports written using SQL queries to see if I can further optimize my codes when I suddenly wondered.
"Would the system be faster if I filter by code instead of codename?"
Query #1:
SELECT owneridname, scheduledstart
FROM dbo.FilteredAppointment
WHERE statecode IN (1, 3)
Query #2:
SELECT owneridname, scheduledstart
FROM dbo.FilteredAppointment
WHERE statecode IN ('1', '3')
Query #3:
SELECT owneridname, scheduledstart
FROM dbo.FilteredAppointment
WHERE statecodename IN ('Completed', 'Scheduled')
My initial thoughts:
The codename is often get from the StringMap table and code usually
resides in the base table. So it should be faster if I filter the
query by codes instead. Looking at the three queries above, I would
think that query by string would be slower.
But results show otherwise. The result shows query #3 is the fastest among the 3
Since it's querying against the same view, it should not show vast difference in resource usage.
Again I was wrong. Query #3 was more than 50% less exhausting compared to Query #1 and #3
Results for the 3 queries in terms of "Est Subtree cost"
Query #1: Est Subtree cost: 0.325311
Query #2: Est Subtree cost: 0.325311
Query #3: Est Subtree cost: 0.190786
Anyone can explain why does it behave in this manner?

First of all, when it comes to performance comparisons, I always refer to Eric Lippert's Excellent Which is Faster? blog.
I believe you're viewing your test results incorrectly. Since you're not actually doing tests but estimates, the cost shown is a percentage of the query as a whole. It can not be directly compared to the percentage costs from another estimate.
For example, let's say units of the cost are a percentage of the CPU required. Let's also assume that since they are less complicated Query #1 & Query #2 only require 100 CPU cycles, but Query 3 requires 1000 CPU Cycles. If we apply the math of the percentages to this is the result:
Query 1: 33 Cycles
Query 2: 33 Cycles
Query 3: 190 Cycles
Even though Query 3 is a smaller percentage, it takes more actual resources.
I'm also going to guess that the performance difference is entirely negligible, and therefore you should be looking at other issues like readability / maintainability.
WHERE statecodename IN ('Completed', 'Scheduled') is definitely more readable, but much less maintainable (What if someone changes the text values to 'Complete' and 'Schedule'?. Therefore I would suggest this instead:
WHERE statecodename IN (1, 3) -- ('Completed', 'Scheduled')
And stop preemptively worrying about performance.
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." - Donald Knuth

Related

PySpark groupby strange behaviour

I am querying a large (2 trillion records) parquet file using PySpark, partitioned by two columns, month and day .
If I run a simple query as:
SELECT month, day, count(*) FROM mytable
WHERE month >= 201801 and month< 202301 -- two years data
GROUP BY month, day
ORDER BY month, day
the query is executed in 5 min or less. Super good performance!
If, I remove the where condition, it will bring whole data lake information (4 years). This query will take 1.5 hours to execute.
This behaviour is far from normal. I guess might be related to the large amount of data being queried in the workers node, leading to GC or shuffle, but is just a guess
How can I debug above situation?
My understanding is that Spark should be clever enough to calculate per partion (since is a distributed environment), and take around 5 * 2 (double years), not so much big different
Edit1: Adding information from SparkUI
I will put the screenshots of the two runs, 4 years data, 1.7 hours, and 3 years data, 7.5 min. First, always the 4 years data
General overview
Job Page
Stage 1 - Heavy stage
Stage 2
SQL
Edit 2 - New findings - Scheduler delay
In the heavy task, I have found out an scheduler delay
If this is the case, what is the approach?
Thanks a lot!
I have found what was the problem.
By increasing the memory and cores (not really important) of the
Driver, the problem was solved.
How to reach this conclusion?
First, I knew my data was not very skewed (as pointed by #samkart and #Leonid Vasilev). but, I checked again.
Second, all metrics were very similar to each other, without great number differences, soooo, it had to be something.
Third and lastly, I open the Stage Event line, and found a very interesting issue, see edit 2.
After further investigating why my scheduler was so delayed, I really didn't find the real reason, but this sentence gave me the hint. The problem was in the driver
Scheduler delay (blue) is the time spent waiting. There is something
that the executors are waiting for - often this is waiting for the
driver that controls and coordinates the jobs.
source: enter link description here
In that post, the author also mention something very important that I wish to add
See all that red and blue? This is a sure sign that something is up.
What we really want to see is lots of green - the proportion of time
spent doing work - I mean real work - the part where Spark does the
number crunching.
TDLR:
Biggest problem came from Scheduler delay, very related to driver. Increasing the Memory (and vCPUs), solved the issue.

Spark optimize "DataFrame.explain" / Catalyst

I've got a complex software which performs really complex SQL queries (well not queries, Spark plans you know). <-- The plans are dynamic, they change based on user input so I can't "cache" them.
I've got a phase in which spark takes 1.5-2min building the plan. Just to make sure, I added "logXXX", then explain(true), then "logYYY" and it takes 1minute 20 seconds for the explain to execute.
I've trying breaking the lineage but this seems to cause worse performance because the actual execution time becomes longer.
I can't parallelize driver work (already did, but this task can't be overlapped with anything else).
Any ideas/guide on how to improve the plan builder in Spark? (like for example, flags to try enabling/disabling and such...)
Is there a way to cache plans in Spark? (so I can run that in parallel and then execute it)
I've tried disabling all possible optimizer rules, setting min iterations to 30... but nothing seems to affect that concrete point :S
I tried disabling wholeStageCodegen and it helped a little, but the execution is longer so :).
Thanks!,
PS: The plan does contain multiple unions (<20, but quite complex plans inside each union) which are the cause for the time, but splitting them apart also affects execution time.
Just in case it helps someone (and if no-one provides more insights).
As I couldn't manage to reduce optimizer times (and well, not sure if reducing optimizer times would be good, as I may lose execution time).
One of the latest parts of my plan was scanning two big tables and getting one column from each one of them (using windows, aggregations etc...).
So I splitted my code in two parts:
1- The big plan (cached)
2- The small plan which scans and aggregates two big tables (cached)
And added one more part:
3- Left Join/enrich the big plan with the output of "2" (this takes like 10seconds, the dataset is not so big) and finish the remainder computation.
Now I launch both actions (1,2) in parallel (using driver-level parallelism/threads), cache the resulting DataFrames and then wait+ afterwards perform 3.
With this, while Spark driver (thread 1) is calculating the big plan (~2minutes) the executors will be executing part "2" (which has a small plan, but big scans/shuffles) and then both get "mixed" in like 10-15seconds, which a good improvement in execution time over the 1:30 I save while calculating the plan.
Comparing times:
Before I would have
1:30 Spark optimizing time + 6 minutes execution time
Now I have
max
(
1:30 Spark Optimizing time + 4 minutes execution time,
0:02 Spark Optimizing time + 2 minutes execution time
)
+ 15 seconds joining both parts
Not so much, but quite a few "expensive" people will be waiting for it to finish :)

Spark: rdd.countApprox() vs rdd.count()

Could someone please explain the difference between RDD countApprox() vs count() and also if possible can answer which is the fastest ? it would be of great help we have a requirement where count() is very slow takes about 30 min's ** ...tried countApprox() it was **fast for the first run (**About 1.2 min) and then slowed to 30 min's .....
this is how we used it not sure if it's the best way to use
rdd.countApprox(timeout=800, confidence=0.5)
Count() - Returns you the number of elements in an RDD.
CountApprox - Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.
countApprox(timeout: Long, confidence: Double)
Default: confidence = 0.95
Note: As per the spark source code, support for countApprox is marked 'Experimental'.
With timeout=800, you should have seen an approximate count in <1min.
Are you sure nothing else is causing this slowdown of 30mins.
Share your code/code-snippet to get more accurate inputs from other members.
Not my answer, but there is a very useful and important answer here.
In very short, countApprax.getFinalValue blocks even if this is longer than the timeout.
getInitialValue does not block and so you will get a response within the timeout.
BUT, as I learned from painful experience, even if you use getInitalValue the process will continue to final value.
If you are repeating this in a loop, the getFinalValue will be running for multiple RDDs long after you have retrieved the result from getInitialValue. This can then lead to OOM conditions and broadcast errors that are difficult to diagnose
rdd.count() is an action, which is an eager operation.
This means that all the other transformations that you had written before that will start executing now because of Spark's lazy evaluation. So, essentially its not only Count() operation that's taking all the time but, all the other operations which were waiting to get executed.
Now coming back to the question of count() vs countApprox().
Count is just like doing a select count(*) from Table. countApprox can have a timeout and confidence level which returns back a result which is approximately correct and a number you can live with.
We should use countApprox when we are more interested in knowing an approximate number and save time for example in a streaming application.
Count() should be used when you need the exact count for example to log something or for auditing.

oracle: Is there a way to check what sql_id downgraded to serial or lesser degree over the period of time

I would like to know if there is a way to check sql_ids that were downgraded to either serial or lesser degree in an Oracle 4-node RAC Data warehouse, version 11.2.0.3. I want to write a script and check the queries that are downgraded.
SELECT NAME, inst_id, VALUE FROM GV$SYSSTAT
WHERE UPPER (NAME) LIKE '%PARALLEL OPERATIONS%'
OR UPPER (NAME) LIKE '%PARALLELIZED%' OR UPPER (NAME) LIKE '%PX%'
NAME VALUE
queries parallelized 56083
DML statements parallelized 6
DDL statements parallelized 160
DFO trees parallelized 56249
Parallel operations not downgraded 56128
Parallel operations downgraded to serial 951
Parallel operations downgraded 75 to 99 pct 0
Parallel operations downgraded 50 to 75 pct 0
Parallel operations downgraded 25 to 50 pct 119
Parallel operations downgraded 1 to 25 pct 2
Does it ever refresh? What conclusion can be drawn from above output? Is it for a day? month? hour? since startup?
This information is stored as part of Real-Time SQL Monitoring. But it requires licensing the Diagnostics and Tuning packs, and it only stores data for a short period of time.
Oracle 12c can supposedly store SQL Monitoring data for longer periods of time. If you don't have Oracle 12c, or if you don't have those options licensed, you'll need to create your own monitoring tool.
Real-Time SQL Monitoring of Parallel Downgrades
select /*+ parallel(1000) */ * from dba_objects;
select sql_id, sql_text, px_servers_requested, px_servers_allocated
from v$sql_monitor
where px_servers_requested <> px_servers_allocated;
SQL_ID SQL_TEXT PX_SERVERS_REQUESTED PX_SERVERS_ALLOCATED
6gtf8np006p9g select /*+ parallel ... 3000 64
Creating a (Simple) Historical Monitoring Tool
Simplicity is the key here. Real-Time SQL Monitoring is deceptively simple and you could easily spend weeks trying to recreate even a tiny portion of it. Keep in mind that you only need to sample a very small amount of all activity to get enough information to troubleshoot. For example, just store the results of GV$SESSION or GV$SQL_MONITOR (if you have the license) every minute. If the query doesn't show up from sampling every minute then it's not a performance issue and can be ignored.
For example: create a table create table downgrade_check(sql_id varchar2(100), total number), and create a job with DBMS_SCHEDULER to run insert into downgrade_check select sql_id, count(*) total from gv$session where sql_id is not null group by sql_id;. Although the count from GV$SESSION will rarely be exactly the same as the DOP.
Other Questions
V$SYSSTAT is updated pretty frequently (every few seconds?), and represents the total number of events since the instance started.
It's difficult to draw many conclusions from those numbers. From my experience, having only 2% of your statements downgraded is a good sign. You likely either have good (usually default) settings and not too many parallel jobs running at once.
However, some parallel queries run for seconds and some run for weeks. If the wrong job is downgraded even a single downgrade can be disastrous. Storing some historical session information (or using DBA_HIST_ACTIVE_SESSION_HISTORY) may help you find out if your critical jobs were affected.

Performance issues in Sybase 12.5 to Sybase 15 migration

We are in the process of migrating our DB to Sybase 15. The stored procedures which were working fine in Sybase 12.5 have a poor performance in Sybase 15. However when we add 'set merge_join off' Syabse 15 performs faster. Is there any way to use the sybase 12.5 stored procs as it is in Sybase 15 / or with minimal changes? Do we have any alternate ways apart from rewriting the whole stored proc?
I think this depends on how much time and energy you have to investigate Sybase 15 and use its new optimisers.
If this is a small app and you just want it working without clueing up on some or all of the new optimisers, index statistics, datachange, login triggers, then use either compatibility mode or maybe better, restrict the Optimiser to be allrows_oltp, avoiding dss and mix (which would use hash joins and merge joins respectively.)
If it's a big system and you have time, I think you should find out about the above, allow at least mix if not dss too, and make sure you
have index statistics up to date (much more important to have stats on 2nd and subsequent cols of indexes to opimise right for merge and hash joins.)
understand DATACHANGE (to find tables that need stats updates.)
login triggers (can be v useful to configure some sessions/users down or up optimisation levels - see sypron website for Rob Verschoor's write-up.)
make sure you've got access to sp_showplan (use a tool, or get sa_role, or use Rob Verschoor's CIS technique to grant.)
The new optimisers are good, but I think it's true to say that they take time and energy to understand and make work. If you don't have time and energy and don't need the extra performance, just stick to allrows_oltp, or even compatibility mode (I don't have experience of the latter, but somehow it seems wrong to me.)
There is compatibility mode in sybase 15.
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc00967.1550/html/MigrationGuide/CBHJACAF.htm
I would say try to find root cause of issue, We too had a issue with one of our procs where timing went up from 27 mins to 40 mins. When diagnosed and fixed proc just took 6 mis to complete (which was 27 mins). ASE15 optimizer and query processing is much better then 12.5.
If you dont have time just set the compatibility mode at session level for this proc.
"set compatibility_mode on"
But do compare the results.
Additionally if you have time do try using DBCC (302,310) and 3604 (for redirection) to understand why optimizer is using such LAVA operator.
Excellent Article by Rob V
Sybase 15 optimizer uses more algorithm of joins i.e Merge Join, Hash join, nested loop join, etc
Where as in Sybase 12.5, the most used algorithm for join is Nested loop join.
Apart from switching the compatibility mode on (This will use Sybase 12.5 optimizer and wont give you any benefits of Sybase 15 optimizer), you can play with various optimization goals.
In your case I suggest you set the optimization goal to "allrows_oltp", which will only use nested loop joins in your queries, at server level.
-- server-wide default:
sp_configure 'optimization goal', 0, 'allrows_oltp'
-- session-level setting (overrides server-wide setting):
set plan optgoal allrows_oltp
-- query-level setting (overrides server-wide and session-level settings):
select * from T1, T2 where T1.a = T2.b plan '(use optgoal allrows_oltp)'
allrows_oltp resembles Sybase 12.5 way very closely, and should be tried first before trying any other optimization goals.
Note: After setting to allrows_oltp, do proper testing to see if any other query got affected by this
More info about optimization goals can be found here

Resources