Inconsistent counts in Virtuoso 7.1 for large graphs - dbpedia

I have an instance of Virtuoso 7.1 running and DBpedia set up as clearly elucidated in this blog. Now I have a very basic requirement of finding some count values. However I am confused by the results of my query:
select count(?s)
where {?s ?p ?o .
FILTER(strstarts(str(?s),"http://dbpedia.org/resource")) }
With this query I'd like to see how many resources are present in DBpedia that have an URI that starts with "http://dbpedia.org/resource". Essentially my hope is to find resources of the kind <http://dbpedia.org/resource/Hillary_Clinton> or <http://dbpedia.org/resource/Bill_Clinton> and so on.
My confusion lies in the fact that Virtuoso returns different results each time.
Now I tried it on two different machines, a local machine and our server. In both cases I see wildly different results. By wildly I would just want you to sample the sizes. They are 1101000, 36314, 328014, 292014.
Also about the execution time out. I did try changing it to 5000 from the default 0 or to 8000. That did not exactly increase the results.
I know DBpedia provides statistics for their dump, but I'd like to do this right in Virtuoso. Why is this anomaly?
Furthermore I saw this discussion as well, where they refer to something that might be related. I would just want to know how to get the counts right for DBpedia in Virtuoso. If not Virtuoso is there any other graph store i.e. Jena, rdf4j, Fuseki, which would do this right?

First thing -- Virtuoso 7.1 is very old (shipped 2014-02-17). I'd strongly advise updating to a current build, 7.2.4 (version string 07.20.3217) or later, whether Commercial or Open Source.
Now -- the query you're running has to do a lot of work to produce your result. It has to check every ?s for your string, and then tot up the count. That's going to need a (relatively) very long time to run; exactly how long is dependent on the runtime environment and total database size, among other significant factors.
HTML headers (specifically, X-SQL-Message) will include notice of such query timeouts, as seen here —
$ curl -LI "http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+COUNT%28%3Fs%29+%0D%0AWHERE%0D%0A++%7B++%3Fs++%3Fp++%3Fo++.+%0D%0A+++++FILTER%28strstarts%28str%28%3Fs%29%2C%22http%3A%2F%2Fdbpedia.org%2Fresource%22%29%29+%0D%0A++%7D+&format=text%2Fhtml&CXML_redir_for_subjs=121&CXML_redir_for_hrefs=&timeout=3000000&debug=on"
HTTP/1.1 200 OK
Date: Tue, 06 Sep 2016 16:39:44 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 128
Connection: keep-alive
Server: Virtuoso/07.20.3217 (Linux) i686-generic-linux-glibc212-64 VDB
X-SPARQL-default-graph: http://dbpedia.org
X-SQL-State: S1TAT
X-SQL-Message: RC...: Returning incomplete results, query interrupted by result timeout. Activity: 12.43M rnd 13.28M seq 0 same seg 8.023M same pg 3.369M same par 0 disk 0 spec disk 0B / 0 m
X-Exec-Milliseconds: 121040
X-Exec-DB-Activity: 12.43M rnd 13.28M seq 0 same seg 8.023M same pg 3.369M same par 0 disk 0 spec disk 0B / 0 messages 0 fork
Expires: Tue, 13 Sep 2016 16:39:44 GMT
Cache-Control: max-age=604800
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: HEAD, GET, POST, OPTIONS
Access-Control-Allow-Headers: DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Accept-Encoding
Accept-Ranges: bytes
On your own instance, you can set larger or even infinite timeouts (e.g., MaxQueryExecutionTime) to ensure that you always get the most complete result available.

Related

Why do CouchDB views support compaction but mango indexes do not?

As I was reading the CouchDB documentation I found it weird that views needed compaction while mango indexes did not. Are they not essentially the same thing and subject to the same requirement of cleaning out unused or old entries? It seems like an oversight to me.
I suppose I just need some clarification on how the index trees are different between them.
Thanks!
One may in fact compact a mango index because every index created at the /db/_index endpoint1 has a "ddoc" (design doc) just like the design docs for map/reduce views.
Quoting from the /db/_index documentation,
Mango is a declarative JSON querying language for CouchDB databases.
Mango wraps several index types, starting with the Primary Index
out-of-the-box. Mango indexes, with index type json, are built using
MapReduce Views.1
Now look to the /db/_compact/design-doc2 endpoint's documentation*
Compacts the view indexes associated with the specified design
document. It may be that compacting a large view can return more
storage than compacting the actual db. Thus, you can use this in place
of the full database compaction if you know a specific set of view
indexes have been affected by a recent database change.
*Emphasis mine
Since every "mango index" has a design-doc, it follows that any mango index may be compacted with the /db/_compact/design-doc endpoint.
This may be verified easily with curl. Say there is a mango index with ddoc="foo-json-index-ddoc"in the "stack" database,
curl -v -X POST -H "Content-Type: application/json" http://localhost:5984/stack/_compact/foo-json-index-ddoc
The verbose (succcessful) response will look like this
< HTTP/1.1 202 Accepted
< Cache-Control: must-revalidate
< Content-Length: 12
< Content-Type: application/json
< Date: Tue, 18 May 2021 14:30:33 GMT
< Server: CouchDB/2.3.1 (Erlang OTP/19)
< X-Couch-Request-ID: bbf2b7b0c9
< X-CouchDB-Body-Time: 0
<
{"ok":true}
* Connection #0 to host localhost left intact
I left authorization out for clarity.
[1] /db/_index
[2] /db/_compact/design-doc

CosmosDb unexpected continuation token

Please note:
I beleive this question is different than the one here talking about why the continuation token is null. The problem listed here is about discussing this unexpected behaviour and see if there is any solution to it.
I've also reported this on cosmosdb github issues because at this stage I think this could very well be an SDK or Cosmos API bug.
Here it goes:
Basically I am getting no result with a continuation token in an unexpected situation.
The only similar experience (no result but a continuation token) I had with CosmosDb was when the RU is not enough and the query needs more RU to finish its job. For example when counting all the documents and you need to continue couple of times.
How to reproduce the issue?
This is very hard to reproduce as the consumer does not control the shard (physical partition) distribution. But you need a comosdb that has a few logical partitions and at least two shards and your query should be formed aiming for the data in the second shared. Do not provide a partition key and make the query cross partition.
Expected behavior
When:
the query is cross partition
there is enough RU
the query costs a very small RU
I'm expecting to receive the result in the first call.
Actual behavior
Query result is empty
Response has an unusual continuation token
The token looks like below:
{"token":null,"range":{"min":"05C1DFFFFFFFFC","max":"FF"}}
Following is the sample code that I can reproduce the issue every single time. In this case I have a document sitting in partition 2 (index 1) which I assume it's the second shard.
var client = new DocumentClient(ServiceEndpoint, AuthKey);
const string query = "select * from c where c.title='JACK CALLAGHAN'";
var collection = UriFactory.CreateDocumentCollectionUri(DatabaseId, CollectionId);
var cQuery = client.CreateDocumentQuery<dynamic>(collection, query, new FeedOptions
{
EnableCrossPartitionQuery = true,
PopulateQueryMetrics = true
}).AsDocumentQuery();
var response = cQuery.ExecuteNextAsync().GetAwaiter().GetResult();
Console.WriteLine($"response.AsEnumerable().Count()= {response.AsEnumerable().Count()}");
foreach (string headerKey in response.ResponseHeaders.Keys)
{
Console.WriteLine($"{headerKey}");
var keyValues = response.ResponseHeaders[headerKey].Split(";");
foreach (var keyValue in keyValues)
{
Console.WriteLine($"{keyValue}");
}
Console.WriteLine();
}
And the output including all the headers:
response.AsEnumerable().Count()= 0
Cache-Control
no-store, no-cache
Pragma
no-cache
Transfer-Encoding
chunked
Server
Microsoft-HTTPAPI/2.0
Strict-Transport-Security
max-age=31536000
x-ms-last-state-change-utc
Wed, 03 Apr 2019 00:50:35.469 GMT
x-ms-resource-quota
documentSize=51200
documentsSize=52428800
documentsCount=-1
collectionSize=52428800
x-ms-resource-usage
documentSize=184
documentsSize=164076
documentsCount=94186
collectionSize=188910
lsn
118852
x-ms-item-count
0
x-ms-schemaversion
1.7
x-ms-alt-content-path
dbs/bettingedge/colls/fixtures
x-ms-content-path
S8sXAPPiCdc=
x-ms-xp-role
1
x-ms-documentdb-query-metrics
totalExecutionTimeInMs=0.27
queryCompileTimeInMs=0.04
queryLogicalPlanBuildTimeInMs=0.02
queryPhysicalPlanBuildTimeInMs=0.03
queryOptimizationTimeInMs=0.00
VMExecutionTimeInMs=0.06
indexLookupTimeInMs=0.05
documentLoadTimeInMs=0.00
systemFunctionExecuteTimeInMs=0.00
userFunctionExecuteTimeInMs=0.00
retrievedDocumentCount=0
retrievedDocumentSize=0
outputDocumentCount=0
outputDocumentSize=49
writeOutputTimeInMs=0.00
indexUtilizationRatio=0.00
x-ms-global-Committed-lsn
118851
x-ms-number-of-read-regions
0
x-ms-transport-request-id
12
x-ms-cosmos-llsn
118852
x-ms-session-token
0:-1#118852
x-ms-request-charge
2.86
x-ms-serviceversion
version=2.2.0.0
x-ms-activity-id
c4bc4b76-47c2-42e9-868a-9ecfe0936b1e
x-ms-continuation
{"token":null,"range":{"min":"05C1DFFFFFFFFC","max":"FF"}}
x-ms-gatewayversion
version=2.2.0.0
Date
Fri, 05 Apr 2019 05:40:21 GMT
Content-Type
application/json
If we continue the query with the composite continuation token we can see the result.
Is that a normal behavior or a bug?
Using .NET Framework will handle continuation token natively:
`var query = Client.CreateDocumentQuery(UriFactory.CreateDocumentCollectionUri(databaseId,collectionI,
sqlQuery,feedOptions).AsDocumentQuery();
while (query.HasMoreResults)
{var response = await query.ExecuteNextAsync();
results.AddRange(response);}`
Adding Feedback provided through linked GitHub Issue:
Regarding your issues we have faced the same situations.
I have some comments.
we didn't know about this behavior and our client received an empty list with a continuation token and pretty much broke our flow.
Now in server side we are handling this situation and continue until we get result. The issue is what if there are 1000 partitions. Do we have to continue 100 times? Is that how ComosDB protect their SLA and under 10 ms response time. $ThinkingLoud
Yes, this broke our flow the first time too. But we handle it on the server side as well (Together with MaxNumberOfObjects we continue to serve request until we receive the number the client wants), and the pattern you're seeing is due to the underlying architecture of CosmosDB, consisting of phyiscal + logical partitions. It sounds like you're implementing Paging with the interplay of your client, and that is fine. However, I don't think this is what CosmosDB refer to with their SLA times.
What other undocumented unexpected behaviors are there that are going to get us by surprise in production environment? #ThinkingLoudAgain
This is a bit vague, but my advice would be for you to read up on all FeedOptions together with CosmosDB Performance tips and make sure you understand them as well as their application areas.
EDIT: Also, another warning. I am currently running into an issue with Continuation Token and DISTINCT keyword in the SQL query. It does not work as expected without an ORDER BY.
This is normal behavior for CosmosDb. There are several things that can cause the query to timeout internally and result in a response with few results or even an empty collection.
From CosmosDb documentation:
The number of items returned per query execution will always be less than or equal to MaxItemCount. However, it is possible that other criteria might have limited the number of results the query could return. If you execute the same query multiple times, the number of pages might not be constant. For example, if a query is throttled there may be fewer available results per page, which means the query will have additional pages. In some cases, it is also possible that your query may return an empty page of results.

SPARK parallelization of algorithm - non-typical, how to

I have a processing requirement that does not seem to fit the nice SPARK parallelization use cases. On the other hand, I may not see how it can be done in SPARK easily.
I am seeking the easiest way to parallelize the following situation:
Given a set of N records of record type A,
perform some processing on A records that generates a not yet existing set of initial results, say, of J records of record type B. Record type B has a data range aspect to it.
Then repeat the process for the A set of records not yet processed - the leftovers - for any records generated as part of B, but look to the left and to the right of the A records.
Repeat 3 until no new records generated.
This may sound odd, but it is nothing more than taking a set of trading records, and deciding for a given computed period Pn, if there is a bull or bear spread evident during this period. Once that initial period is found, then date-wise before Pn and after Pn, one can attempt to look for a bull or bear spread period that precedes or follows the initial Pn period. And so on. It all works correctly.
The algorithm I designed works on inserting records using SQL and some looping. The records generated do not exist initially and get created on the fly. I looked at dataframes and RDDs, but it is not so evident (to me) how one would do this.
Using SQL it is not such a difficult algorithm, but you need to work through the records of a given logical key set sequentially. Thus not a typical SPARK use case.
My questions are then:
How can I achieve at the very least parallelization?
Should we use mapPartitions in some way so as to at least get ranges of logical key sets to process, or is this simply not possible given the use case I attempt to present? I am going to try this, but feel I may be barking up the wrong tree here. It may just need to be a loop / while in the driver running single thread.
Some examples record A's shown in tabular format - as per how this algorithm works:
Jan Feb Mar Apr May Jun Jul Aug Sep
key X -5 1 0 10 9 -20 0 5 7
would result in record B's being generated as follows:
key X Jan - Feb --> Bear
key X Apr - Jun --> Bull
This falls into the category of non-typical Spark. Solved via looping within a loop in Spark Scala but with JDBC usage. Could as well have been a Scala JDBC program. Also variation with foreachPartition.

Hazelcast - Error in reading cache with 2 million objects with apprx 500 requests/second read

We have apprx 2 million distributed data objects(not replicated) in cache of 10 nodes cluster (apprx 500 MB data). Backup count is one. We are seeing given below errors/warnings.
Do you guys know when I can see these errors? I have sanitize some logs to not share something sensitive. Majority of time we do cache read(around 400 request/second), and whole cache gets reinitialized every 2 hours.
I know that we can do replicated cache to improve performance, but wondering what's wrong going on here. When I run with smaller cluster(e.g. 5 nodes) then everything works fine.
Hazelcast version 3.6.3
Server size 8 core, 16 GB
Windows Server 2012 R2
IO Input thread count size is 30
IO Output thread count size is 50
2017-06-24 23:46:22.679 ERROR (hz._hzInstance_1_My-App.partition-operation.thread-5) [c.h.m.i.o.GetOperation] - [192.168.111.11]:5701 [My-App] [3.6.3] Cannot send response: HeapData{type=-2, hashCode=113248027, partitionHash=113248027, totalSize=722, dataSize=714, heapCost=742} to Address[192.168.111.13]:5701. Op: com.hazelcast.map.impl.operation.GetOperation{identityHash=1124265765, serviceName='hz:impl:mapService', partitionId=189, replicaIndex=0, callId=3490089, invocationTime=1498362385498 (Sat Jun 24 23:46:25 EDT 2017), waitTimeout=-1, callTimeout=8000, name=HKF/my-cache-id-3, name=HKF/my-cache-id-3}
com.hazelcast.spi.exception.ResponseNotSentException: Cannot send response: HeapData{type=-2, hashCode=113248027, partitionHash=113248027, totalSize=722, dataSize=714, heapCost=742} to Address[192.168.111.13]:5701. Op: com.hazelcast.map.impl.operation.GetOperation{identityHash=1124265765, serviceName='hz:impl:mapService', partitionId=189, replicaIndex=0, callId=3490089, invocationTime=1498362385498 (Sat Jun 24 23:46:25 EDT 2017), waitTimeout=-1, callTimeout=8000, name=HKF/my-cache-id-3, name=HKF/my-cache-id-3}
at com.hazelcast.spi.impl.operationservice.impl.RemoteInvocationResponseHandler.sendResponse(RemoteInvocationResponseHandler.java:54)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.sendResponse(OperationRunnerImpl.java:278)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.handleResponse(OperationRunnerImpl.java:251)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:173)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:393)
at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.processPacket(OperationThread.java:184)
Why do you have such huge number of input and output threads (30/50). In most cases the default of 3+3 is more than sufficient. If you don't have 50+ connections; all these threads will be idle. Even with 50+ connections, you will not get good performance with so many IO threads.
The error you are seeing seems to indicate a networking issue: response can't be send. The big question is why this is happening.
Can you enable diagnostics:
http://docs.hazelcast.org/docs/latest-development/manual/html/Management/Diagnostics/Enabling_Diagnostics_Logging.html
And send the log files to peter at hazelcast dot com So I can have a look at it.

What are the units for the time remaining value in GitHub's rate limiting message?

According to GitHub's API documentation, when you go over the rate limit, you get a response that looks like this:
HTTP/1.1 403 Forbidden
Date: Tue, 20 Aug 2013 14:50:41 GMT
Status: 403 Forbidden
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1377013266
{
"message": "API rate limit exceeded for xxx.xxx.xxx.xxx. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)",
"documentation_url": "https://developer.github.com/v3/#rate-limiting"
}
What are the units on the X-RateLimit-Reset value? In other words, how can I tell from the error message how long in seconds or minutes I need to wait before I can send another request?
It's a Unix timestamp, see this note from the GitHub API documentation.
With the timestamp from that example the reset time would have been 20 Aug 2013 at 15:41:06.
According to a Wikipedia article the GitHub docs link to, a Unix timestamp is:
defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970, not counting leap seconds.

Resources