Sparql 'langmatch' seems extremely slow on Virtuoso (DBpedia) - pagination

I have a sparql performance issue with DBpedia. I'd like to extract ordered information from DBpedia sparql endpoint page by page. My first example query looked like this:
select distinct ?objProperty ?label where {
?x ?objProperty <http://dbpedia.org/resource/United_States>.
?objProperty a owl:ObjectProperty.
OPTIONAL{?objProperty rdfs:label ?label}
}order by ?label limit 10 offset 3
It was executed about 2s for me on avg(please, if you try it yourself and you see timing less than a second - increment 'offset', because it seems that DBpedia's Virtuoso is caching request results).
However the result returned is not suitable for pagination, because it is a mess of lines with labels from different languages. I want English language for labels and for precise pagination I want exactly 10 different object properties to be returned as a result. Also they have to be ordered by label. Ok. Another try:
select distinct ?objProperty ?label where {
?a ?objProperty <http://dbpedia.org/resource/United_States>.
?objProperty a owl:ObjectProperty.
OPTIONAL{?objProperty rdfs:label ?label}
FILTER ( LANGMATCHES(lang(?label),"EN") || LANG(?label) = "")
}order by ?label limit 10 offset 3
For me this request returned what I expected,.. but it was executed about 7 seconds on avg!!! So sloooow!!! Without order by and langmatch, query works about 1s on avg. Without order by but with langmatch, it takes about 6s, so it seems that langmatch eats ~ 5s on avg for this query.
I do not understand (these are questions by the way):
Am I doing something wrong? :)
Why langmatch slows query SOOO much? I wish langmatch is not regex based? If this performance issue is unavoidable using langmatch, is there a faster way to work with languages? If no, I can't imagine how semantic technologies would conquer the world in nearest future as people expect :))
Is there a better (faster) way to build pagination based requests than using limit/offset? If no, what is the best way to avoid performance issues like mentioned above with limit/offset?

1. Am I doing something wrong? :)
I think there's a slight issue that could make your query a bit faster. You've got the ?label as optional, but I think that the filter will only succeed when ?label is bound, effectively making ?label non-optional. My reasoning is as follows: in the case where ?label is not bound, the expression lang(?label) will be an error (unless an implementation extends lang()), and both langMatches and = expect non-error values, so we'd have this reduction:
langMatches(lang(?label),"en") || lang(?label) = "en"
langMatches(error, "en") || error = "en"
error || error
false
I'm basing this on section 17.2 of the SPARQL 1.1 recommendation, which says:
17.2 Filter Evaluation
Functions invoked with an argument of the wrong type will produce a type error. Effective boolean value arguments (labeled "xsd:boolean
(EBV)" in the operator mapping table below), are coerced to
xsd:boolean using the EBV rules in section 17.2.2.
Apart from BOUND, COALESCE, NOT EXISTS and EXISTS, all functions and operators operate on RDF Terms and will produce a type error if any
arguments are unbound.
Any expression other than logical-or (||) or logical-and (&&) that encounters an error will produce that error.
Based on that, I'd rewrite the query as the following. My impression is that it's a little bit faster, but that might just be confirmation bias. It's not much faster, though.
select distinct ?p ?label where {
?x ?p dbpedia:United_States .
?p a owl:ObjectProperty ;
rdfs:label ?label .
filter( langMatches(lang(?label),"en") || lang(?label) = "" )
}
order by ?label
limit 10
offset 3
SPARQL results
2. Why langmatch slows query SOOO much? I wish langmatch is not regex based? If this performance issue is unavoidable using langmatch, is there a faster way to work with languages?
The public DBpedia SPARQL endpoint can be a bit slow at times, but that doesn't seem to be the issue here. When I run your original query, or the new one above, query, it takes six or seven seconds to get the results. Two things to note though:
langMatch isn't regular expression based. The docs for langMatches say that "Returns true if language-tag (first argument) matches language-range (second argument) per the basic filtering scheme defined in RFC4647 section 3.3.1. language-range is a basic language range per Matching of Language Tags RFC4647 section 2.1. A language-range of "*" matches any non-empty language-tag string." The basic filtering is case insensitive, but it's not regex.
langMatches isn't the only thing that might be causing some slower results. Note that to find the first 10 of something (or, in general, the mth through the _n_th), you have to visit all the elements. You don't have to sort all of them, but you have to visit all of them, which means that there's no way to get just the results from the desired page (unless there's some special indexing going on; keep making this query and maybe it will speed up overtime :)). This leads us into the next point, though.
3. Is there a better (faster) way to build pagination based requests than using limit/offset? If no, what is the best way to avoid performance issues like mentioned above with limit/offset?
While the original and updated queries take six or seven seconds to retrieve the 10 results with limit 10, asking for limit 1000, or limit 5000, also only take about six or seven seconds. Using limit/offset is the correct way to do pagination, but ordering the results can be expensive, since to find the elements in some particular range, you have to look at all the elements (though you don't necessarily have to order all the elements). It probably makes sense, then, to make those pages as big as possible, and to do any presentation paging locally. E.g., instead of running 100 queries for 10 results each (100 queries × 7 seconds = 700 seconds = 11 minutes and 40 seconds), you can run 1 query for 1000 results (1 query × 7 seconds = 7 seconds), and do any important paged presentation locally.

Handling of language filter is up to SPARQL engine. How it stores literals? Whether it can use indexes or another technique to avoid full text scan to get literal for desired language?
You can store literal as "chat"#en string, but selecting all literals for english for a given property would require all property literals scan for #en match.
In some SPARQL engines, you can get actual execution plan. For example, here is the way to do it in Virtuoso: Virtuoso execution plan, however, you can't use it on public endpoint.
Query optimization, execution, query hints are very well documented for RDBMS, you can easily find out what database really does to answer your query and how to modify schema or query to get best results. IMHO, SPARQL engines are not that mature for this.

Related

Mongo DB like search with count is very slow on 50 million collection data

In my application, I have a collection of 50 million data. I am using like search and then count the results on a particular field(i.e Patientfirstname). I also created an index on the Patientfirstname field it improved the performance but still it is taking a lot of time.
db.patients.find({"Patientfirstname":{"$regex":"Testuser"}}).count() without index 40 sec
db.patients.find({"Patientfirstname":{"$regex":"Testuser"}}).count() after adding index on the Patientfirstname field 31 sec
db.patients.find({"Patientfirstname":{"$regex":"Testuser"}}).count()
I tried with a different approach (aggregate) but still, response is very slow
db.patients.aggregate.([{$match:{"Patientfirstname":{"$regex":"Testuser"}}},
{$project:{"Patientfirstname":1,"_id":1}},
{$group : {_id:"$Patientfirstname", count:{$sum:1}}},
{$sort:{"count":-1}} ])
this query also takes the same to time fetch the results 31 sec
another approach was tried but the results are not correct
select only the field from the entire collection and then apply like search and count and result.
db.patients.find({},{Patientfirstname:1,_id:1}).count({"Patientfirstname":{"$regex":"Testuser"}})
applying a filter in the count is not working, entire collection count is displayed
Please help in this query to fetch results faster.Thanks in advance
So here is the deal:
As rightly pointed in the comments, $regex is an operator that would not perform well with or without indexes. Here is the reason why:
Queries without indexes are slow because they executed using COLLSCAN - which is essentially iteration of the whole 50 Million documents on the disk one-by-one, filtering data and returning only the ones that match. Disks being an inherently slow piece of hardware does not help the situation either.
Now, When indexed - MongoDB creates a B-Tree in the RAM. And $regex operator being not very selective in nature, it forces a complete Tree Scan (as compared to a reduced / partial tree scan in case of equalities or ranges) in the index b-tree - which is as bad as a Collection Scan itself. The only reason you get a benefit on 9 seconds is because this Tree Scan occurs in the RAM and not the disk.
Having said that, there are a few alternatives to it:
Optimize your $regex. From the MongoDB Documentation itself:
For case sensitive regular expression queries, if an index exists for the field, then MongoDB matches the regular expression against the values in the index, which can be faster than a collection scan. Further optimization can occur if the regular expression is a "prefix expression", which means that all potential matches start with the same string. This allows MongoDB to construct a "range" from that prefix and only match against those values from the index that fall within that range.
A regular expression is a "prefix expression" if it starts with a caret (^) or a left anchor (\A), followed by a string of simple symbols. For example, the regex /^abc.*/ will be optimized by matching only against the values from the index that start with abc.
Additionally, while /^a/, /^a./, and /^a.$/ match equivalent strings, they have different performance characteristics. All of these expressions use an index if an appropriate index exists; however, /^a./, and /^a.$/ are slower. /^a/ can stop scanning after matching the prefix.
Case insensitive regular expression queries generally cannot use indexes effectively. The $regex implementation is not collation-aware and is unable to utilize case-insensitive indexes.
Create a Text Index - This would tokenize your text string and enable faster text based searches
If you are deployed on MongoDB Atlas - Then you can use Atlas Search which is a Lucene based Text Search Engine (Works almost like elasticsearch on steroids). This offers significantly greater performance and functionalities like fuzzy text search, text automcomplete etc.

Understanding Solr Function Query Performance

I'm working with "edismax" and "function-query" parsers in Solr and have difficulty in understanding whether the query time taken by "function-query" makes sense. The query I'm trying to optimize looks as follows:
q={!func sum($q1,$q2,$q3)} where q1,q2,q3 are edismax queries.
The QTime returned by edismax queries takes well under 50ms but it seems that function-query is the rate determining step since combined query above takes around 200-300ms. I also analyzed the performance of function query using only constants.
The QTime results for different q are as follows:
097ms for q={!func} sum(10,20)
109ms for q={!func} sum(10,20,30)
127ms for q={!func} sum(10,20,30,40)
145ms for q={!func} sum(10,20,30,40,50)
Does this trend make sense? Are function-queries expected to be this slow?
What makes edismax queries so much faster?
What can I do to optimize my original query (which has edismax subqueries q1,q2,q3) to work under 100ms?
func query enumerates all docs, thus it doesn't provide any selectivity. You probably don't need to evaluate it on docs, which doesn't match dismaxes eg
q=+{!v=$q1} +{!v=$q2} +{!v=$q3} {!func sum($q1,$q2,$q3)}

How to quickly filter a list using regex?

Well... I have a trivial request of building an Entry that filter on-the-fly a list of entries. (think of an editor auto-complete feature)
The request is to support a regex filter over the whole list and display only matching entries.
e.g.,
The list contains:
abc.efg.hij.entry
abc.ddd.hij.entry2
hij.some.value.entry
Typing in the Entry
Value : List
hij : abc.efg.hij.entry, abc.ddd.hij.entry2, hij.some.value.entry
ddd : abc.ddd.hij.entry2
dd*entry : abc.ddd.hij.entry2
val : hij.some.value.entry
Here is the code i'm using for filtering the list:
regex = re.compile(r"{0}".format(entry_value), re.IGNORECASE)
display_list = list(filter(regex.search, display_list))
The real life list contains ~300K entries of strings (up to 100 char each) and the performance of the above is very poor, considering a GUI response time.
I've profiled my real test case and it yields ~0.8s for each key typing in the Entry.
Is there a faster way?
If you are doing regular expression pattern matching against a normal python list that contains 300,000 items, it's just naturally going to be slow. Also, if you are going to display 300,000 items in a listbox it's going to be slow to display all of those items.
Your best bet might be to pick a better data structure. For example, on my system I can run your filter against 300,000 items in about 250ms, but a query against an in-memory sqlite database with 300,000 rows takes about half that time. In either case, it can add another second to fully update the display if the result is very large (for example, if all 300,000 match)
Of course, sqlite doesn't support regex out of the box, but you can translate some common patterns to sql patterns (eg: 'foo.*bar' could be translated to 'foo%bar'). For more information on sqlite and regex see How do I use regex in a SQLite query?
Another strategy to employ would be to not search on every character typed. Wait until the user pauses in their typing. So, for example, if they type "Lorem", you don't need to search on "L" and then "Lo", and then "Lor", etc. Instead, schedule the search to happen in 100 ms, and with each keypress you can reschedule the search. This will prevent the searching from slowing down, while still giving the user what appears to be a fairly rapid result.

solr how to properly use boost factor in a query?

Ok, so I am using many fields with qf, like:
[qf] => frpId^5 fundraise_title^3 fundraiser_display_name^3 charity_name^2 participantFname^2 participantLname^2 participantEmail^1 groupName^3 fundraise_text^ fundraiseTitleExact^15 fundraiserDisplayNameExact^15 charityNameExact^15 participantFnameExact^10 participantLnameExact^10 groupNameExact^10 all^
but I really want that exact matches for the field fundraiseTitleExact to be on top.
With this previous set up of qf, they are on the position 32.
Let's say that I am boosting fundraiseTitleExact like:
[qf] => frpId^5 fundraise_title^3 fundraiser_display_name^3 charity_name^2 participantFname^2 participantLname^2 participantEmail^1 groupName^3 fundraise_text^ fundraiseTitleExact^15000000000000000 fundraiserDisplayNameExact^15 charityNameExact^15 participantFnameExact^10 participantLnameExact^10 groupNameExact^10 all^
But even now the fundraiseTitleExact exact match is only on the position 27 (5 positions up) and is not going upper.
How can I prioritise this field over the rest?
This looks more like a tuning problem, however you have several options:
Tune up your relevancy modifying all the boosts until you get the expected results (I would advise to work with lower boosts than the ones in your questions and then increase the boost of the most important field);
If you are using edismax query parser then You probably want to check the bq and bf parameters in order to boost your term;
If worse come to worst you could use Query Elevation Component to put some entries at the top of the list.
I advise to read the following books to widen your knowledge of solr boosting and relevancy mechanisms:
Solr in Action
Relevant Search

What's wrong with this Solr range filter query?

The following filter query returns zero results (using *:* as query):
-startDate:[* TO *] OR startDate:[* TO NOW/DAY+1DAY]
But if I filter only by:
-startDate:[* TO *]
I get 3 results.
If I filter only by:
startDate:[* TO NOW/DAY+1DAY]
I get 161 reults.
Why is the combined FQ returning zero results? What I want is the filter to return any doc whose start date is null or start date is before today.
EDIT:
I'm using Solr 4.2.1.2013.03.26.08.26.55
EDIT:
Well, strange it may sound a colleague suggested putting parenthesis on the two parts like this:
(-startDate:[* TO *]) OR (startDate:[* TO NOW/DAY+1DAY])
And somehow it worked. I'm still curious why that made a difference. Hope someone can shed some light.
Thanks!
Solr supports pure negative queries. They do this, essentially, by expanding the pure negative to something like:
*:* -startDate:[* TO *]
However, what you combine it in a BooleanQuery, I don't believe it applies this sort of logic anymore. A negative query does not, in lucene, fetch anything, but rather filters out matches brought in by other, positive, query terms. This differs from SQL queries, which in a sense start with an implicit *:*, or a full table of results, and allow you to pare it down.
I believe your OR is effectively being ignored, since it doesn't, strictly speaking, make sense in context. Generally, OR is just syntactic sugar, I believe (field:this OR field:that is equivalent to field:this field:that).
So, in effect your query is: startDate:[* TO NOW/DAY+1DAY] -startDate:[* TO *], which makes the results you see more obvious. When you wrap it in parentheses, then each term query is treated separately, and you gain access to solr's support of lonely negative queries.
A much better idea is to store a default value, if you need to search for unset/null values. *:* and by extension pure negative queries like this have to scan the entire index, and so perform very poorly. Providing a default value will improve performance, and prevent this sort of confusing situation.
I used femtoRgon's answer and was able to construct a query that included a range and blank values.
The following includes all docs with a StartDate on or after 1/1/2014 and all docs without a StartDate.
(StartDate:[2014-01-01T00:00:00Z TO *]) OR (-StartDate:([* TO *]) AND *:*)
The magic is (-StartDate:([* TO *]) AND *:*). This will select the docs without a StartDate.
Pure negative queries don't work, because they are omitting results from nothing.
Try:
: AND -startDate:[* TO *]
When you query with -startDate:[* TO *] you get documents which do not have any data for the startDate field.
When you query for startDate:[* TO NOW/DAY+1DAY] you get documents which have a value less than or equal to NOW/DAY+1DAY in the startDate field.
You could try -startDate:* OR startDate:[* TO NOW/DAY+1DAY]. The first part says documents that do not have a value and the second part says document having value less than or equal to NOW/DAY+1DAY in the startDate field.

Resources