Use of Subquery in Informix for Left Outer Join - subquery

I have inherited a slow query in Informix. I suspect part of the slowness is due to the use of subqueries to do left outer joins. Here is a sample of the code:
FROM intide_rec AS IDE
LEFT OUTER JOIN (SELECT idp_cmpy_id, idp_idc_ctl_no, idp_itm_ctl_no, idp_brh, idp_invt_typ, idp_frm, idp_grd, idp_size, idp_fnsh, idp_whs, idp_mill, idp_heat, idp_tag_no, idp_num_size1, idp_num_size2, idp_num_size3, idp_num_size4, idp_num_size5, idp_wdth, idp_lgth, idp_idia, idp_odia, idp_ga_size, idp_ohd_mat_val, idp_ohd_pcs, idp_ohd_wgt, idp_invt_sts, idp_invt_qlty, idp_bgt_for, idp_ownr_id FROM intidp_rec) AS IDP ON (IDE.ide_cmpy_id = IDP.idp_cmpy_id AND IDE.ide_idc_ctl_no = IDP.idp_idc_ctl_no)
LEFT OUTER JOIN (SELECT prm_pep, prm_frm, prm_grd, prm_size, prm_fnsh FROM inrprm_rec) AS PRM ON
(IDP.idp_frm = PRM.prm_frm AND IDP.idp_grd = PRM.prm_grd AND IDP.idp_size = PRM.prm_size AND IDP.idp_fnsh = PRM.prm_fnsh)
Notice that the subqueries are simply retrieving columns. There is no manipulation of the columns. What is odd to me is why there are SELECT statements, i.e. subqueries, here.
Why not just remove the subqueries, move the columns out of the subqueries and into the main SELECT statement since there is no manipulation of columns and write the joins like this:
FROM intide_rec AS IDE
LEFT OUTER JOIN intidp_rec AS IDP ON (IDE.ide_cmpy_id = IDP.idp_cmpy_id AND IDE.ide_idc_ctl_no = IDP.idp_idc_ctl_no)
LEFT OUTER JOIN inrprm_rec AS PRM ON (IDP.idp_frm = PRM.prm_frm AND IDP.idp_grd = PRM.prm_grd AND IDP.idp_size = PRM.prm_size AND IDP.idp_fnsh = PRM.prm_fnsh)
What are your thoughts on the original code and subqueries vs the way I have rewritten the code? Is it inefficient from a performance perspective? Or is it acceptable from a performance perspective?
Thanks for any thoughts.

One way to provide some answer is to analyze the output from SET EXPLAIN ON for the two queries. Ideally, there shouldn't be a difference between the query plans. If the query plans are demonstrably 'the same' or 'equivalent', then the optimizer is doing its stuff well. Determining that they are equivalent may be harder than either of us would like. However, if there is a major difference in the query plans, the subqueries probably are slower and your rewrite should be at least as fast as the original and probably faster. Also, remember that query plans are only indicative of what the optimizer thinks will happen — time the different queries on production data as well.
You don't mention which version of Informix you're using or which platform you're using it on. It probably doesn't matter and it must be a relatively recent version to support the LEFT OUTER JOIN notation (this millennium rather than the last, at any rate). However, it is beneficial to state that. Note that only versions 12.10 and 14.10 are under support unless you've made special arrangements with IBM or HCL.

Related

Difference in using __icontains and lowercasing the query?

Is there any difference in performance if I lower case the query before going to use __contains or directly using __icontains. In code:
This
def search(request):
query = (request.GET.get("q")).lower()
if query:
users = User.objects.filter(location__contains=query)
VS
def search(request):
query = request.GET.get("q")
if query:
users = User.objects.filter(location__icontains=query)
I lowercased the location while inserting it into database. And, query is the query which can be in any cases.
Feel free to ask!!!
Usually, the case-insensitive search (or the LIKE operation) is carried out by converting the LHS and RHS into same cases, either to lower case or to upper case.
Something like this,
SELECT *
FROM YourTable
WHERE UPPER(YourColumn) = UPPER('VALUE')
If you are sure that your DB column location only contains lowercase characters, the first option is better.
Note: You may not see any performance difference in small databases(10k entries), but you will see it on bigger DBs.

SQLite3 Simulate RIGHT OUTER JOIN with LEFT OUTER JOIN's without being able to change table order

I am new to SQL and have recently started implementing joins into my code, the data I wish to retrieve can be done with the following SQL statement. However, as you know SQLite3 does not support RIGHT OUTER and FULL OUTER JOINs.
Therefore I would like to re-write this statement using LEFT OUTER JOINs as only these are supported, any help would be appreciated.
Before you go ahead and mark this question as duplicate, I have looked at answers to other similar questions but none have explained the general rules when it comes to rearranging queries to use LEFT JOINs only.
I also think this particular example is slightly different in the sense that the table (periods) cannot be joined with either of the tables (teacher_subjects, classroom_subjects) without first joining the (class_placement) table.
FROM P
LEFT JOIN CP
ON P.PID = CP.PID
RIGHT JOIN CS
ON CP.CID = CS.CID
RIGHT JOIN TS
ON CP.TID = TS.TID
WHERE (CP.CID IS NULL
AND CP.TID IS NULL)
ORDER BY P.PID;
Unsurprisingly, the error I get from running this query is:
sqlite3.OperationalError: RIGHT and FULL OUTER JOINs are not currently supported
Sorry in advance if I am being really stupid but if you require any extra information please ask. Many Thanks.
Ignoring column order, x right join y on c is y left join x on c. This is commonly explicitly said. (But you can also just apply the definitions of the joins to your original expression to get subexpressions with the values you want.) You can read the from grammar to see how you can parenthesize or subquery for precedence. Applying the identity we get ts left join (cs left join (p left join cp on x) on y) on z.
Similarly, ignoring column order, x full join y on c is y full join x on c. Expressing full join in terms of left join & right join is a frequently asked duplicate.

spark SQL scala DSL subquery support

Does SparkSQL support subquery? lists that currently no subquery support is available for spark 2.0.
Has this changed recently?
Your comment is correct. Your question is a little vague. However, I take your point and find also the concepts fine and also subject to this sort of question, so there you go.
So, this is now possible for the DataFrame API, not DataSet or DSL as you state.
SELECT A.dep_id,
A.employee_id,
A.age,
(SELECT MAX(age)
FROM employee B
WHERE A.dep_id = B.dep_id) max_age
FROM employee A
ORDER BY 1,2
An example - borrowed from the Internet, shows clearly the distinction between DS and DF implying that a SPARK SQL correlated sub-query (not shown here of course) does also not happen against a DataSet - by deduction:
sql("SELECT COUNT(*) FROM src").show()
val sqlDF = sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")
val stringsDS = sqlDF.map {case Row(key: Int, value: String) => s"Key: $key, Value: $value"}
stringsDS.show()
The SQL runs against some source like Hive or Parquet or against SPARK TempViews, not against a DS. From a DF you can go to the DS and then enjoy the more typesafe approach, but only with the limited interface on select. I did a good search to find something that disproves this, but this is not the case. DS and DF are sort of interchangeable anyway as I have stated I think to you earlier. But, I see you are very thorough!
Moreover, there are at least 2 techniques for converting the Nested-Correlated=Subqueries to "normal" JOINs which is what SPARK and indeed other Optimizers do in the background. E.g. RewriteCorrelatedScalarSubquery and PullupCorrelatedPredicate.
But for a DSL, which you allude to, you can re-write your query by hand to achieve the same, by using JOIN, LEFT JOIN, OUTER JOIN, whatever the case may be. Except that is not so obvious for all oddly enough.

Raw sql with many columns

I'm building a CRUD application that pulls data using Persistent and executes a number of fairly complicated queries, for instance using window functions. Since these aren't supported by either Persistent or Esqueleto, I need to use raw sql.
A good example is that I want to select rows in which the value does not deviate strongly from the previous value, so in pseudo-sql the condition is WHERE val - lag(val) <= x. I need to run this selection in SQL, rather than pulling all data and then filtering in Haskell, because otherwise I'd have way to much data to handle.
These queries return many columns. However, the RawSql instance maxes out at tuples with 8 elements. So now I am writing additional functions from9, to9, from10, to10 and so on. And after that, all these are converted using functions with type (Single a, Single b, ...) -> DesiredType. Even though this could be shortened using code generation, the approach is simply hacky and clearly doesn't feel like good Haskell. This concerns me because I think most of my queries will require rawSql.
Do you have suggestions on how to improve this? Currently, my main thought is to un-normalize the database and duplicate data, e.g. by including the lagged value as column, so that I can query the data with Esqueleto.

Sparql 'langmatch' seems extremely slow on Virtuoso (DBpedia)

I have a sparql performance issue with DBpedia. I'd like to extract ordered information from DBpedia sparql endpoint page by page. My first example query looked like this:
select distinct ?objProperty ?label where {
?x ?objProperty <http://dbpedia.org/resource/United_States>.
?objProperty a owl:ObjectProperty.
OPTIONAL{?objProperty rdfs:label ?label}
}order by ?label limit 10 offset 3
It was executed about 2s for me on avg(please, if you try it yourself and you see timing less than a second - increment 'offset', because it seems that DBpedia's Virtuoso is caching request results).
However the result returned is not suitable for pagination, because it is a mess of lines with labels from different languages. I want English language for labels and for precise pagination I want exactly 10 different object properties to be returned as a result. Also they have to be ordered by label. Ok. Another try:
select distinct ?objProperty ?label where {
?a ?objProperty <http://dbpedia.org/resource/United_States>.
?objProperty a owl:ObjectProperty.
OPTIONAL{?objProperty rdfs:label ?label}
FILTER ( LANGMATCHES(lang(?label),"EN") || LANG(?label) = "")
}order by ?label limit 10 offset 3
For me this request returned what I expected,.. but it was executed about 7 seconds on avg!!! So sloooow!!! Without order by and langmatch, query works about 1s on avg. Without order by but with langmatch, it takes about 6s, so it seems that langmatch eats ~ 5s on avg for this query.
I do not understand (these are questions by the way):
Am I doing something wrong? :)
Why langmatch slows query SOOO much? I wish langmatch is not regex based? If this performance issue is unavoidable using langmatch, is there a faster way to work with languages? If no, I can't imagine how semantic technologies would conquer the world in nearest future as people expect :))
Is there a better (faster) way to build pagination based requests than using limit/offset? If no, what is the best way to avoid performance issues like mentioned above with limit/offset?
1. Am I doing something wrong? :)
I think there's a slight issue that could make your query a bit faster. You've got the ?label as optional, but I think that the filter will only succeed when ?label is bound, effectively making ?label non-optional. My reasoning is as follows: in the case where ?label is not bound, the expression lang(?label) will be an error (unless an implementation extends lang()), and both langMatches and = expect non-error values, so we'd have this reduction:
langMatches(lang(?label),"en") || lang(?label) = "en"
langMatches(error, "en") || error = "en"
error || error
false
I'm basing this on section 17.2 of the SPARQL 1.1 recommendation, which says:
17.2 Filter Evaluation
Functions invoked with an argument of the wrong type will produce a type error. Effective boolean value arguments (labeled "xsd:boolean
(EBV)" in the operator mapping table below), are coerced to
xsd:boolean using the EBV rules in section 17.2.2.
Apart from BOUND, COALESCE, NOT EXISTS and EXISTS, all functions and operators operate on RDF Terms and will produce a type error if any
arguments are unbound.
Any expression other than logical-or (||) or logical-and (&&) that encounters an error will produce that error.
Based on that, I'd rewrite the query as the following. My impression is that it's a little bit faster, but that might just be confirmation bias. It's not much faster, though.
select distinct ?p ?label where {
?x ?p dbpedia:United_States .
?p a owl:ObjectProperty ;
rdfs:label ?label .
filter( langMatches(lang(?label),"en") || lang(?label) = "" )
}
order by ?label
limit 10
offset 3
SPARQL results
2. Why langmatch slows query SOOO much? I wish langmatch is not regex based? If this performance issue is unavoidable using langmatch, is there a faster way to work with languages?
The public DBpedia SPARQL endpoint can be a bit slow at times, but that doesn't seem to be the issue here. When I run your original query, or the new one above, query, it takes six or seven seconds to get the results. Two things to note though:
langMatch isn't regular expression based. The docs for langMatches say that "Returns true if language-tag (first argument) matches language-range (second argument) per the basic filtering scheme defined in RFC4647 section 3.3.1. language-range is a basic language range per Matching of Language Tags RFC4647 section 2.1. A language-range of "*" matches any non-empty language-tag string." The basic filtering is case insensitive, but it's not regex.
langMatches isn't the only thing that might be causing some slower results. Note that to find the first 10 of something (or, in general, the mth through the _n_th), you have to visit all the elements. You don't have to sort all of them, but you have to visit all of them, which means that there's no way to get just the results from the desired page (unless there's some special indexing going on; keep making this query and maybe it will speed up overtime :)). This leads us into the next point, though.
3. Is there a better (faster) way to build pagination based requests than using limit/offset? If no, what is the best way to avoid performance issues like mentioned above with limit/offset?
While the original and updated queries take six or seven seconds to retrieve the 10 results with limit 10, asking for limit 1000, or limit 5000, also only take about six or seven seconds. Using limit/offset is the correct way to do pagination, but ordering the results can be expensive, since to find the elements in some particular range, you have to look at all the elements (though you don't necessarily have to order all the elements). It probably makes sense, then, to make those pages as big as possible, and to do any presentation paging locally. E.g., instead of running 100 queries for 10 results each (100 queries × 7 seconds = 700 seconds = 11 minutes and 40 seconds), you can run 1 query for 1000 results (1 query × 7 seconds = 7 seconds), and do any important paged presentation locally.
Handling of language filter is up to SPARQL engine. How it stores literals? Whether it can use indexes or another technique to avoid full text scan to get literal for desired language?
You can store literal as "chat"#en string, but selecting all literals for english for a given property would require all property literals scan for #en match.
In some SPARQL engines, you can get actual execution plan. For example, here is the way to do it in Virtuoso: Virtuoso execution plan, however, you can't use it on public endpoint.
Query optimization, execution, query hints are very well documented for RDBMS, you can easily find out what database really does to answer your query and how to modify schema or query to get best results. IMHO, SPARQL engines are not that mature for this.

Resources