Performance issues in Sybase 12.5 to Sybase 15 migration - sap-ase

We are in the process of migrating our DB to Sybase 15. The stored procedures which were working fine in Sybase 12.5 have a poor performance in Sybase 15. However when we add 'set merge_join off' Syabse 15 performs faster. Is there any way to use the sybase 12.5 stored procs as it is in Sybase 15 / or with minimal changes? Do we have any alternate ways apart from rewriting the whole stored proc?

I think this depends on how much time and energy you have to investigate Sybase 15 and use its new optimisers.
If this is a small app and you just want it working without clueing up on some or all of the new optimisers, index statistics, datachange, login triggers, then use either compatibility mode or maybe better, restrict the Optimiser to be allrows_oltp, avoiding dss and mix (which would use hash joins and merge joins respectively.)
If it's a big system and you have time, I think you should find out about the above, allow at least mix if not dss too, and make sure you
have index statistics up to date (much more important to have stats on 2nd and subsequent cols of indexes to opimise right for merge and hash joins.)
understand DATACHANGE (to find tables that need stats updates.)
login triggers (can be v useful to configure some sessions/users down or up optimisation levels - see sypron website for Rob Verschoor's write-up.)
make sure you've got access to sp_showplan (use a tool, or get sa_role, or use Rob Verschoor's CIS technique to grant.)
The new optimisers are good, but I think it's true to say that they take time and energy to understand and make work. If you don't have time and energy and don't need the extra performance, just stick to allrows_oltp, or even compatibility mode (I don't have experience of the latter, but somehow it seems wrong to me.)

There is compatibility mode in sybase 15.
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc00967.1550/html/MigrationGuide/CBHJACAF.htm

I would say try to find root cause of issue, We too had a issue with one of our procs where timing went up from 27 mins to 40 mins. When diagnosed and fixed proc just took 6 mis to complete (which was 27 mins). ASE15 optimizer and query processing is much better then 12.5.
If you dont have time just set the compatibility mode at session level for this proc.
"set compatibility_mode on"
But do compare the results.
Additionally if you have time do try using DBCC (302,310) and 3604 (for redirection) to understand why optimizer is using such LAVA operator.
Excellent Article by Rob V

Sybase 15 optimizer uses more algorithm of joins i.e Merge Join, Hash join, nested loop join, etc
Where as in Sybase 12.5, the most used algorithm for join is Nested loop join.
Apart from switching the compatibility mode on (This will use Sybase 12.5 optimizer and wont give you any benefits of Sybase 15 optimizer), you can play with various optimization goals.
In your case I suggest you set the optimization goal to "allrows_oltp", which will only use nested loop joins in your queries, at server level.
-- server-wide default:
sp_configure 'optimization goal', 0, 'allrows_oltp'
-- session-level setting (overrides server-wide setting):
set plan optgoal allrows_oltp
-- query-level setting (overrides server-wide and session-level settings):
select * from T1, T2 where T1.a = T2.b plan '(use optgoal allrows_oltp)'
allrows_oltp resembles Sybase 12.5 way very closely, and should be tried first before trying any other optimization goals.
Note: After setting to allrows_oltp, do proper testing to see if any other query got affected by this
More info about optimization goals can be found here

Related

In HPCC ECL, when running a LOCAL, LOOKUP JOIN. Does the RHS dataset gets copied to all nodes, or kept distributed due to LOCAL?

Say I have a cluster of 400 machines, and 2 datasets. some_dataset_1 has 100M records, some_dataset_2 has 1M. I then run:
ds1:=DISTRIBUTE(some_dataset_1,hash(field_a));
ds2:=DISTRIBUTE(some_dataset_2,hash(field_b));
Then, I run the join:
j1:=JOIN(ds1,ds2,LEFT.field_a=LEFT.field_b,LOOKUP,LOCAL);
Will the distribution of ds2 "mess up" the join, meaning parts of ds2 will be incorrectly scattered across the cluster leading to low match rate?
Or, will the LOOKUP keyword take precedence and the distributed ds2 will get copied in full to each node, thus rendering the distribution irrelevant, and allowing the join to find all the possible matches (as each node will have a full copy of ds2).
I know I can test this myself and come to my own conclusion, but I am looking for a definitive answer based on the way the language is written to make sure I understand and can use these options correctly.
For reference (from the Language Reference document v 7.0.0):
LOOKUP: Specifies the rightrecset is a relatively small file of lookup records that can be fully copied to every node.
LOCAL: Specifies the operation is performed on each supercomputer node independently, without requiring interaction with all other nodes to acquire data; the operation maintains the distribution of any previous DISTRIBUTE
It seems that with the LOCAL, the join completes more quickly. There does not seem to be a loss of matches on initial trials. I am working with others to run a more thorough test and will post the results here.
First, your code:
ds1:=DISTRIBUTE(some_dataset_1,hash(field_a));
ds2:=DISTRIBUTE(some_dataset_2,hash(field_b));
Since you're intending these results to be used in a JOIN, it is imperative that both datasets are distributed on the "same" data, so that the matching values end up on the same nodes so that your JOIN can be done with the LOCAL option. So this will only work correctly if ds1.field_a and ds2.field_b contain the "same" data.
Then, your join code. I assume you've made a typo in this post, because your join code needs to be (to work at all):
j1:=JOIN(ds1,ds2,LEFT.field_a=RIGHT.field_b,LOOKUP,LOCAL);
Using both LOOKUP and LOCAL options is redundant because a LOOKUP JOIN is implicitly a LOCAL operation. That means, your LOOKUP option does "override" the LOCAL in this insatnce.
So, all that means that you should either do it this way:
ds1:=DISTRIBUTE(some_dataset_1,hash(field_a));
ds2:=DISTRIBUTE(some_dataset_2,hash(field_b));
j1:=JOIN(ds1,ds2,LEFT.field_a=RIGHT.field_b,LOCAL);
Or this way:
j1:=JOIN(some_dataset_1,some_dataset_2,LEFT.field_a=RIGHT.field_b,LOOKUP);
Because the LOOKUP option does copy the entire right-hand dataset (in memory) to every node, it makes the JOIN implicitly a LOCAL operation and you do not need to do the DISTRIBUTEs. Which way you choose to do it is up to you.
However, I see from your Language Reference version that you may be unaware of the SMART option on JOIN, which in my current Language Reference (8.10.10) says:
SMART -- Specifies to use an in-memory lookup when possible, but use a
distributed join if the right dataset is large.
So you could just do it this way:
j1:=JOIN(some_dataset_1,some_dataset_2,LEFT.field_a=RIGHT.field_b,SMART);
and let the platform figure out which is best.
HTH,
Richard
Thank you, Richard. Yes, I am notorious for typo's. I apologize. As I use a lot of legacy code, I have not had a chance to work with the SMART option, but I will certainly keep that in mine for me and the team, - so thank you for that!
However, I did run a test to evaluate how the compiler and the platform would handles this scenario. I ran the following code:
sd1:=DATASET(100000,TRANSFORM({unsigned8 num1},SELF.num1 := COUNTER ));
sd2:=DATASET(1000,TRANSFORM({unsigned8 num1, unsigned8 num2},SELF.num1 := COUNTER , SELF.num2 := COUNTER % 10 ));
ds1:=DISTRIBUTE(sd1,hash(num1));
ds4:=DISTRIBUTE(sd1,random());
ds2:=DISTRIBUTE(sd2,hash(num1));
ds3:=DISTRIBUTE(sd2,hash(num2));
j11:=JOIN(sd1,sd2,LEFT.num1=RIGHT.num1 ):independent;
j12:=JOIN(sd1,sd2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j13:=JOIN(sd1,sd2,LEFT.num1=RIGHT.num1, LOCAL):independent;
j14:=JOIN(sd1,sd2,LEFT.num1=RIGHT.num1,LOOKUP,LOCAL):independent;
j21:=JOIN(ds1,ds2,LEFT.num1=RIGHT.num1 ):independent;
j22:=JOIN(ds1,ds2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j23:=JOIN(ds1,ds2,LEFT.num1=RIGHT.num1, LOCAL):independent;
j24:=JOIN(ds1,ds2,LEFT.num1=RIGHT.num1,LOOKUP,LOCAL):independent;
j31:=JOIN(ds1,ds3,LEFT.num1=RIGHT.num1 ):independent;
j32:=JOIN(ds1,ds3,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j33:=JOIN(ds1,ds3,LEFT.num1=RIGHT.num1, LOCAL):independent;
j34:=JOIN(ds1,ds3,LEFT.num1=RIGHT.num1,LOOKUP,LOCAL):independent;
j41:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1 ):independent;
j42:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j43:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1, LOCAL):independent;
j44:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1,LOOKUP,LOCAL):independent;
j51:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1 ):independent;
j52:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j53:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1, LOCAL,HASH):independent;
j54:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1,LOOKUP,LOCAL,HASH):independent;
dataset([{count(j11),'11'},{count(j12),'12'},{count(j13),'13'},{count(j14),'14'},
{count(j21),'21'},{count(j22),'22'},{count(j23),'23'},{count(j24),'24'},
{count(j31),'31'},{count(j32),'32'},{count(j33),'33'},{count(j34),'34'},
{count(j31),'41'},{count(j32),'42'},{count(j33),'43'},{count(j44),'44'},
{count(j51),'51'},{count(j52),'52'},{count(j53),'53'},{count(j54),'54'}
] , {unsigned8 num, string lbl});
On a 400 node cluster, the results come back as:
##
num
lbl
1
1000
11
2
1000
12
3
1000
13
4
1000
14
5
1000
21
6
1000
22
7
1000
23
8
1000
24
9
1000
31
10
1000
32
11
12
33
12
12
34
13
1000
41
14
1000
42
15
12
43
16
6
44
17
1000
51
18
1000
52
19
1
53
20
1
54
If you look at the row 12 in the result ( lbl 34 ), you will notice the match rate drops substantially, suggesting the compiler does indeed distribute the file (with the wrong hashed field) and disregard the LOOKUP option.
My conclusion is therefore that as always, it remains the developer's responsibility to ensure the distribution is right ahead of the join REGARDLESS of which join options are being used.
The manual page could be better. LOOKUP by itself is properly documented. and LOCAL by itself is properly documented. However, they represent two different concepts and can be combined without issue so that JOIN(,,, LOOKUP, LOCAL) makes sense and can be useful.
It is probably best to consider LOOKUP as a specific kind of JOIN matching algorithm and to consider LOCAL as a way to tell the compiler that you are not a novice and that you are absolutely sure the data is already where it needs to be to accomplish what you intend.
For a normal LOOKUP join the LEFT-hand side doesn't need to be sorted or distributed in any particular way and the whole RHS-hand side is copied to every slave. No matter what join value appears on the LEFT, if there is a matching value on the RIGHT then it will be found because the whole RIGHT dataset is present.
In a 400-way system with well-distributed join values, IF the LEFT side is distributed on the join value, then the LEFT dataset in each worker only contains 1/400th of the join values and only 1/400th of the values in the RIGHT dataset will ever be matched. Effectively, within each worker, 399/400th of the RIGHT data will be unused.
However, if both the LEFT and RIGHT datasets are distributed on the join value ... and you are not a novice and know that using LOCAL is what you want ... then you can specify a LOOKUP, LOCAL join. The RIGHT data is already where it needs to be. Any join value that appears in the LEFT data will, if the value exists, find a match locally in the RIGHT dataset. As a bonus, the RIGHT data only contains join values that could match ... it is only 1/400th of the LOOKUP only size.
This enables larger LOOKUP joins. Imagine your 400-way system and a 100GB RIGHT dataset that you would like to use in a LOOKUP join. Copying a 100GB dataset to each slave seems unlikely to work. However, if evenly distributed, a LOOKUP, LOCAL join only requires 250MB of RIGHT data per worker ... which seems quite reasonable.
HTH

Spark optimize "DataFrame.explain" / Catalyst

I've got a complex software which performs really complex SQL queries (well not queries, Spark plans you know). <-- The plans are dynamic, they change based on user input so I can't "cache" them.
I've got a phase in which spark takes 1.5-2min building the plan. Just to make sure, I added "logXXX", then explain(true), then "logYYY" and it takes 1minute 20 seconds for the explain to execute.
I've trying breaking the lineage but this seems to cause worse performance because the actual execution time becomes longer.
I can't parallelize driver work (already did, but this task can't be overlapped with anything else).
Any ideas/guide on how to improve the plan builder in Spark? (like for example, flags to try enabling/disabling and such...)
Is there a way to cache plans in Spark? (so I can run that in parallel and then execute it)
I've tried disabling all possible optimizer rules, setting min iterations to 30... but nothing seems to affect that concrete point :S
I tried disabling wholeStageCodegen and it helped a little, but the execution is longer so :).
Thanks!,
PS: The plan does contain multiple unions (<20, but quite complex plans inside each union) which are the cause for the time, but splitting them apart also affects execution time.
Just in case it helps someone (and if no-one provides more insights).
As I couldn't manage to reduce optimizer times (and well, not sure if reducing optimizer times would be good, as I may lose execution time).
One of the latest parts of my plan was scanning two big tables and getting one column from each one of them (using windows, aggregations etc...).
So I splitted my code in two parts:
1- The big plan (cached)
2- The small plan which scans and aggregates two big tables (cached)
And added one more part:
3- Left Join/enrich the big plan with the output of "2" (this takes like 10seconds, the dataset is not so big) and finish the remainder computation.
Now I launch both actions (1,2) in parallel (using driver-level parallelism/threads), cache the resulting DataFrames and then wait+ afterwards perform 3.
With this, while Spark driver (thread 1) is calculating the big plan (~2minutes) the executors will be executing part "2" (which has a small plan, but big scans/shuffles) and then both get "mixed" in like 10-15seconds, which a good improvement in execution time over the 1:30 I save while calculating the plan.
Comparing times:
Before I would have
1:30 Spark optimizing time + 6 minutes execution time
Now I have
max
(
1:30 Spark Optimizing time + 4 minutes execution time,
0:02 Spark Optimizing time + 2 minutes execution time
)
+ 15 seconds joining both parts
Not so much, but quite a few "expensive" people will be waiting for it to finish :)

SPARK parallelization of algorithm - non-typical, how to

I have a processing requirement that does not seem to fit the nice SPARK parallelization use cases. On the other hand, I may not see how it can be done in SPARK easily.
I am seeking the easiest way to parallelize the following situation:
Given a set of N records of record type A,
perform some processing on A records that generates a not yet existing set of initial results, say, of J records of record type B. Record type B has a data range aspect to it.
Then repeat the process for the A set of records not yet processed - the leftovers - for any records generated as part of B, but look to the left and to the right of the A records.
Repeat 3 until no new records generated.
This may sound odd, but it is nothing more than taking a set of trading records, and deciding for a given computed period Pn, if there is a bull or bear spread evident during this period. Once that initial period is found, then date-wise before Pn and after Pn, one can attempt to look for a bull or bear spread period that precedes or follows the initial Pn period. And so on. It all works correctly.
The algorithm I designed works on inserting records using SQL and some looping. The records generated do not exist initially and get created on the fly. I looked at dataframes and RDDs, but it is not so evident (to me) how one would do this.
Using SQL it is not such a difficult algorithm, but you need to work through the records of a given logical key set sequentially. Thus not a typical SPARK use case.
My questions are then:
How can I achieve at the very least parallelization?
Should we use mapPartitions in some way so as to at least get ranges of logical key sets to process, or is this simply not possible given the use case I attempt to present? I am going to try this, but feel I may be barking up the wrong tree here. It may just need to be a loop / while in the driver running single thread.
Some examples record A's shown in tabular format - as per how this algorithm works:
Jan Feb Mar Apr May Jun Jul Aug Sep
key X -5 1 0 10 9 -20 0 5 7
would result in record B's being generated as follows:
key X Jan - Feb --> Bear
key X Apr - Jun --> Bull
This falls into the category of non-typical Spark. Solved via looping within a loop in Spark Scala but with JDBC usage. Could as well have been a Scala JDBC program. Also variation with foreachPartition.

Spark streaming - waiting for data for window aggregations?

I have data in the format { host | metric | value | time-stamp }. We have hosts all around the world reporting metrics.
I'm a little confused about using window operations (say, 1 hour) to process data like this.
Can I tell my window when to start, or does it just start when the application starts? I want to ensure I'm aggregating all data from hour 11 of the day, for example. If my window starts at 10:50, I'll just get 10:50-11:50 and miss 10 minutes.
Even if the window is perfect, data may arrive late.
How do people handle this kind of issue? Do they make windows far bigger than needed and just grab the data they care about on every batch cycle (kind of sliding)?
In the past, I worked on a large-scale IoT platform and solved that problem by considering that the windows were only partial calculations. I modeled the backend (Cassandra) to receive more than 1 record for each window. The actual value of any given window would be the addition of all -potentially partial- records found for that window.
So, a perfect window would be 1 record, a split window would be 2 records, late-arrivals are naturally supported but only accepted up to a certain 'age' threshold. Reconciliation was done at read time. As this platform was orders of magnitude heavier in terms of writes vs reads, it made for a good compromise.
After speaking with people in depth on MapR forums, the consensus seems to be that hourly and daily aggregations should not be done in a stream, but rather in a separate batch job once the data is ready.
When doing streaming you should stick to small batches with windows that are relatively small multiples of the streaming interval. Sliding windows can be useful for, say, trends over the last 50 batches. Using them for tasks as large as an hour or a day doesn't seem sensible though.
Also, I don't believe you can tell your batches when to start/stop, etc.

oracle: Is there a way to check what sql_id downgraded to serial or lesser degree over the period of time

I would like to know if there is a way to check sql_ids that were downgraded to either serial or lesser degree in an Oracle 4-node RAC Data warehouse, version 11.2.0.3. I want to write a script and check the queries that are downgraded.
SELECT NAME, inst_id, VALUE FROM GV$SYSSTAT
WHERE UPPER (NAME) LIKE '%PARALLEL OPERATIONS%'
OR UPPER (NAME) LIKE '%PARALLELIZED%' OR UPPER (NAME) LIKE '%PX%'
NAME VALUE
queries parallelized 56083
DML statements parallelized 6
DDL statements parallelized 160
DFO trees parallelized 56249
Parallel operations not downgraded 56128
Parallel operations downgraded to serial 951
Parallel operations downgraded 75 to 99 pct 0
Parallel operations downgraded 50 to 75 pct 0
Parallel operations downgraded 25 to 50 pct 119
Parallel operations downgraded 1 to 25 pct 2
Does it ever refresh? What conclusion can be drawn from above output? Is it for a day? month? hour? since startup?
This information is stored as part of Real-Time SQL Monitoring. But it requires licensing the Diagnostics and Tuning packs, and it only stores data for a short period of time.
Oracle 12c can supposedly store SQL Monitoring data for longer periods of time. If you don't have Oracle 12c, or if you don't have those options licensed, you'll need to create your own monitoring tool.
Real-Time SQL Monitoring of Parallel Downgrades
select /*+ parallel(1000) */ * from dba_objects;
select sql_id, sql_text, px_servers_requested, px_servers_allocated
from v$sql_monitor
where px_servers_requested <> px_servers_allocated;
SQL_ID SQL_TEXT PX_SERVERS_REQUESTED PX_SERVERS_ALLOCATED
6gtf8np006p9g select /*+ parallel ... 3000 64
Creating a (Simple) Historical Monitoring Tool
Simplicity is the key here. Real-Time SQL Monitoring is deceptively simple and you could easily spend weeks trying to recreate even a tiny portion of it. Keep in mind that you only need to sample a very small amount of all activity to get enough information to troubleshoot. For example, just store the results of GV$SESSION or GV$SQL_MONITOR (if you have the license) every minute. If the query doesn't show up from sampling every minute then it's not a performance issue and can be ignored.
For example: create a table create table downgrade_check(sql_id varchar2(100), total number), and create a job with DBMS_SCHEDULER to run insert into downgrade_check select sql_id, count(*) total from gv$session where sql_id is not null group by sql_id;. Although the count from GV$SESSION will rarely be exactly the same as the DOP.
Other Questions
V$SYSSTAT is updated pretty frequently (every few seconds?), and represents the total number of events since the instance started.
It's difficult to draw many conclusions from those numbers. From my experience, having only 2% of your statements downgraded is a good sign. You likely either have good (usually default) settings and not too many parallel jobs running at once.
However, some parallel queries run for seconds and some run for weeks. If the wrong job is downgraded even a single downgrade can be disastrous. Storing some historical session information (or using DBA_HIST_ACTIVE_SESSION_HISTORY) may help you find out if your critical jobs were affected.

Resources