Why do CouchDB views support compaction but mango indexes do not? - couchdb

As I was reading the CouchDB documentation I found it weird that views needed compaction while mango indexes did not. Are they not essentially the same thing and subject to the same requirement of cleaning out unused or old entries? It seems like an oversight to me.
I suppose I just need some clarification on how the index trees are different between them.
Thanks!

One may in fact compact a mango index because every index created at the /db/_index endpoint1 has a "ddoc" (design doc) just like the design docs for map/reduce views.
Quoting from the /db/_index documentation,
Mango is a declarative JSON querying language for CouchDB databases.
Mango wraps several index types, starting with the Primary Index
out-of-the-box. Mango indexes, with index type json, are built using
MapReduce Views.1
Now look to the /db/_compact/design-doc2 endpoint's documentation*
Compacts the view indexes associated with the specified design
document. It may be that compacting a large view can return more
storage than compacting the actual db. Thus, you can use this in place
of the full database compaction if you know a specific set of view
indexes have been affected by a recent database change.
*Emphasis mine
Since every "mango index" has a design-doc, it follows that any mango index may be compacted with the /db/_compact/design-doc endpoint.
This may be verified easily with curl. Say there is a mango index with ddoc="foo-json-index-ddoc"in the "stack" database,
curl -v -X POST -H "Content-Type: application/json" http://localhost:5984/stack/_compact/foo-json-index-ddoc
The verbose (succcessful) response will look like this
< HTTP/1.1 202 Accepted
< Cache-Control: must-revalidate
< Content-Length: 12
< Content-Type: application/json
< Date: Tue, 18 May 2021 14:30:33 GMT
< Server: CouchDB/2.3.1 (Erlang OTP/19)
< X-Couch-Request-ID: bbf2b7b0c9
< X-CouchDB-Body-Time: 0
<
{"ok":true}
* Connection #0 to host localhost left intact
I left authorization out for clarity.
[1] /db/_index
[2] /db/_compact/design-doc

Related

Best way to split up large array in a Node.js environment

I’m pulling data from the Cloudflare API, getting all web request logs for a very high traffic website in a certain time frame (less than 7 days of data).
The Cloudflare API takes start and end parameters for the dates that you want to pull logs from. Start can be no later than 7 days and the difference between start and end cannot be greater than 1 hour. For this reason, in order to pull the data I need (usually 3-4 days worth), I wrote some custom code to generate a range of dates separated by one hour from the start to end I need.
With this range, I query the API with a loop and concat the array response to a single large array as I need to do analysis on all the data. This array typically has ~1 million entries (objects). I’m sure you can realize the problem here.
I’m using Deno.js (Node.js alternative) and, at first, the program wouldn’t even run as it would run out of memory. However, I figured out a workaround for this by passing the v8 engine flag to the run command: —-max-old-space-size=8000. It runs now with my massive array but it’s very slow and my computer essentially becomes a brick while it’s running.
My question is, how can I better deal with data of this size, specifically in a Node.js style environment?
Proposed Idea (please tell me if it’s stupid)
Deno gives a nice interface for creating temp directories and files so I was thinking of saving the data from each request to the API in a temporary .json file and then reading the file(s) where I need them as the next step for this data is to filter it down.
Would this approach improve speed?
To elaborate on my comment, the following awk script counts the number of log entries by IP. I'd start there, and then grep that IP to list the visited resources.
./ip-histogram mylogfile.log
# Output
1 127.0.0.1
3 127.0.2.2
mylogfile.log
127.0.2.2 - - [28/Jul/2006:10:27:10 -0300] "GET /foo" 200 3395
127.0.2.2 - - [28/Jul/2006:10:27:10 -0300] "GET /bar" 200 3395
127.0.2.2 - - [28/Jul/2006:10:27:10 -0300] "GET /baz" 200 3395
127.0.0.1 - - [28/Jul/2006:10:22:04 -0300] "GET /foo" 200 2216
ip-histogram
#!/usr/bin/awk -f
# Counts the frequency of each IP in a log file.
# Expects the IP to be in the first ($1) column.
#
# Sample Output:
# ./ip-histogram mylogfile.log
# 12 1.1.1.1
# 18 2.2.2.2
{
histogram[$1]++
}
END {
for (ip in histogram)
print histogram[ip], ip | "sort -n"
}
If you save the 1hr responses as follows: mylog0001.log, mylog0002.log, you can aggregate them by:
./ip-histogram mylog*.log

CosmosDb unexpected continuation token

Please note:
I beleive this question is different than the one here talking about why the continuation token is null. The problem listed here is about discussing this unexpected behaviour and see if there is any solution to it.
I've also reported this on cosmosdb github issues because at this stage I think this could very well be an SDK or Cosmos API bug.
Here it goes:
Basically I am getting no result with a continuation token in an unexpected situation.
The only similar experience (no result but a continuation token) I had with CosmosDb was when the RU is not enough and the query needs more RU to finish its job. For example when counting all the documents and you need to continue couple of times.
How to reproduce the issue?
This is very hard to reproduce as the consumer does not control the shard (physical partition) distribution. But you need a comosdb that has a few logical partitions and at least two shards and your query should be formed aiming for the data in the second shared. Do not provide a partition key and make the query cross partition.
Expected behavior
When:
the query is cross partition
there is enough RU
the query costs a very small RU
I'm expecting to receive the result in the first call.
Actual behavior
Query result is empty
Response has an unusual continuation token
The token looks like below:
{"token":null,"range":{"min":"05C1DFFFFFFFFC","max":"FF"}}
Following is the sample code that I can reproduce the issue every single time. In this case I have a document sitting in partition 2 (index 1) which I assume it's the second shard.
var client = new DocumentClient(ServiceEndpoint, AuthKey);
const string query = "select * from c where c.title='JACK CALLAGHAN'";
var collection = UriFactory.CreateDocumentCollectionUri(DatabaseId, CollectionId);
var cQuery = client.CreateDocumentQuery<dynamic>(collection, query, new FeedOptions
{
EnableCrossPartitionQuery = true,
PopulateQueryMetrics = true
}).AsDocumentQuery();
var response = cQuery.ExecuteNextAsync().GetAwaiter().GetResult();
Console.WriteLine($"response.AsEnumerable().Count()= {response.AsEnumerable().Count()}");
foreach (string headerKey in response.ResponseHeaders.Keys)
{
Console.WriteLine($"{headerKey}");
var keyValues = response.ResponseHeaders[headerKey].Split(";");
foreach (var keyValue in keyValues)
{
Console.WriteLine($"{keyValue}");
}
Console.WriteLine();
}
And the output including all the headers:
response.AsEnumerable().Count()= 0
Cache-Control
no-store, no-cache
Pragma
no-cache
Transfer-Encoding
chunked
Server
Microsoft-HTTPAPI/2.0
Strict-Transport-Security
max-age=31536000
x-ms-last-state-change-utc
Wed, 03 Apr 2019 00:50:35.469 GMT
x-ms-resource-quota
documentSize=51200
documentsSize=52428800
documentsCount=-1
collectionSize=52428800
x-ms-resource-usage
documentSize=184
documentsSize=164076
documentsCount=94186
collectionSize=188910
lsn
118852
x-ms-item-count
0
x-ms-schemaversion
1.7
x-ms-alt-content-path
dbs/bettingedge/colls/fixtures
x-ms-content-path
S8sXAPPiCdc=
x-ms-xp-role
1
x-ms-documentdb-query-metrics
totalExecutionTimeInMs=0.27
queryCompileTimeInMs=0.04
queryLogicalPlanBuildTimeInMs=0.02
queryPhysicalPlanBuildTimeInMs=0.03
queryOptimizationTimeInMs=0.00
VMExecutionTimeInMs=0.06
indexLookupTimeInMs=0.05
documentLoadTimeInMs=0.00
systemFunctionExecuteTimeInMs=0.00
userFunctionExecuteTimeInMs=0.00
retrievedDocumentCount=0
retrievedDocumentSize=0
outputDocumentCount=0
outputDocumentSize=49
writeOutputTimeInMs=0.00
indexUtilizationRatio=0.00
x-ms-global-Committed-lsn
118851
x-ms-number-of-read-regions
0
x-ms-transport-request-id
12
x-ms-cosmos-llsn
118852
x-ms-session-token
0:-1#118852
x-ms-request-charge
2.86
x-ms-serviceversion
version=2.2.0.0
x-ms-activity-id
c4bc4b76-47c2-42e9-868a-9ecfe0936b1e
x-ms-continuation
{"token":null,"range":{"min":"05C1DFFFFFFFFC","max":"FF"}}
x-ms-gatewayversion
version=2.2.0.0
Date
Fri, 05 Apr 2019 05:40:21 GMT
Content-Type
application/json
If we continue the query with the composite continuation token we can see the result.
Is that a normal behavior or a bug?
Using .NET Framework will handle continuation token natively:
`var query = Client.CreateDocumentQuery(UriFactory.CreateDocumentCollectionUri(databaseId,collectionI,
sqlQuery,feedOptions).AsDocumentQuery();
while (query.HasMoreResults)
{var response = await query.ExecuteNextAsync();
results.AddRange(response);}`
Adding Feedback provided through linked GitHub Issue:
Regarding your issues we have faced the same situations.
I have some comments.
we didn't know about this behavior and our client received an empty list with a continuation token and pretty much broke our flow.
Now in server side we are handling this situation and continue until we get result. The issue is what if there are 1000 partitions. Do we have to continue 100 times? Is that how ComosDB protect their SLA and under 10 ms response time. $ThinkingLoud
Yes, this broke our flow the first time too. But we handle it on the server side as well (Together with MaxNumberOfObjects we continue to serve request until we receive the number the client wants), and the pattern you're seeing is due to the underlying architecture of CosmosDB, consisting of phyiscal + logical partitions. It sounds like you're implementing Paging with the interplay of your client, and that is fine. However, I don't think this is what CosmosDB refer to with their SLA times.
What other undocumented unexpected behaviors are there that are going to get us by surprise in production environment? #ThinkingLoudAgain
This is a bit vague, but my advice would be for you to read up on all FeedOptions together with CosmosDB Performance tips and make sure you understand them as well as their application areas.
EDIT: Also, another warning. I am currently running into an issue with Continuation Token and DISTINCT keyword in the SQL query. It does not work as expected without an ORDER BY.
This is normal behavior for CosmosDb. There are several things that can cause the query to timeout internally and result in a response with few results or even an empty collection.
From CosmosDb documentation:
The number of items returned per query execution will always be less than or equal to MaxItemCount. However, it is possible that other criteria might have limited the number of results the query could return. If you execute the same query multiple times, the number of pages might not be constant. For example, if a query is throttled there may be fewer available results per page, which means the query will have additional pages. In some cases, it is also possible that your query may return an empty page of results.

Cassandra data modeling timeseries data

I have this data about visited users for an app/service:
contentId (assume uuid),
platform (eg. website, mobile etc),
softwareVersion (eg. sw1.1, sw2.5, ..etc),
regionId (eg. us144, uk123, etc..)
....
I have modeled it very simply like
id(String) time(Date) count(int)
contentid1-sw1.1 Feb06 30
contentid1-sw2.1 Feb06 20
contentid1-sw1.1 Feb07 10
contentid1-sw2.1 Feb07 10
contentid1-us144 Feb06 23
contentid1-sw1.1-us144 Feb06 10
....
Reason is because there's a popular query where someone can ask for contentId=foo,platform=bar,regionId=baz or any combination of those for a range of time (say between Jan 01 - Feb 05).
But another query that's not easily answerable is:
Return top K 'platform' for contentId=foo between Jan01 - Feb05. By top it means to be sorted by 'count's in that range. So for above data, query for top 2 platforms for contentId=contentId1 between Feb6-Feb8 must return:
sw1.1 40
sw2.1 30
Not sure how to model that in C* to get answers for top K queries, anyone has any ideas?
PS: there are 1billion+ entries for each day.
Also I am open to using Spark or any other frameworks along with C* to get these answers.

Inconsistent counts in Virtuoso 7.1 for large graphs

I have an instance of Virtuoso 7.1 running and DBpedia set up as clearly elucidated in this blog. Now I have a very basic requirement of finding some count values. However I am confused by the results of my query:
select count(?s)
where {?s ?p ?o .
FILTER(strstarts(str(?s),"http://dbpedia.org/resource")) }
With this query I'd like to see how many resources are present in DBpedia that have an URI that starts with "http://dbpedia.org/resource". Essentially my hope is to find resources of the kind <http://dbpedia.org/resource/Hillary_Clinton> or <http://dbpedia.org/resource/Bill_Clinton> and so on.
My confusion lies in the fact that Virtuoso returns different results each time.
Now I tried it on two different machines, a local machine and our server. In both cases I see wildly different results. By wildly I would just want you to sample the sizes. They are 1101000, 36314, 328014, 292014.
Also about the execution time out. I did try changing it to 5000 from the default 0 or to 8000. That did not exactly increase the results.
I know DBpedia provides statistics for their dump, but I'd like to do this right in Virtuoso. Why is this anomaly?
Furthermore I saw this discussion as well, where they refer to something that might be related. I would just want to know how to get the counts right for DBpedia in Virtuoso. If not Virtuoso is there any other graph store i.e. Jena, rdf4j, Fuseki, which would do this right?
First thing -- Virtuoso 7.1 is very old (shipped 2014-02-17). I'd strongly advise updating to a current build, 7.2.4 (version string 07.20.3217) or later, whether Commercial or Open Source.
Now -- the query you're running has to do a lot of work to produce your result. It has to check every ?s for your string, and then tot up the count. That's going to need a (relatively) very long time to run; exactly how long is dependent on the runtime environment and total database size, among other significant factors.
HTML headers (specifically, X-SQL-Message) will include notice of such query timeouts, as seen here —
$ curl -LI "http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+COUNT%28%3Fs%29+%0D%0AWHERE%0D%0A++%7B++%3Fs++%3Fp++%3Fo++.+%0D%0A+++++FILTER%28strstarts%28str%28%3Fs%29%2C%22http%3A%2F%2Fdbpedia.org%2Fresource%22%29%29+%0D%0A++%7D+&format=text%2Fhtml&CXML_redir_for_subjs=121&CXML_redir_for_hrefs=&timeout=3000000&debug=on"
HTTP/1.1 200 OK
Date: Tue, 06 Sep 2016 16:39:44 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 128
Connection: keep-alive
Server: Virtuoso/07.20.3217 (Linux) i686-generic-linux-glibc212-64 VDB
X-SPARQL-default-graph: http://dbpedia.org
X-SQL-State: S1TAT
X-SQL-Message: RC...: Returning incomplete results, query interrupted by result timeout. Activity: 12.43M rnd 13.28M seq 0 same seg 8.023M same pg 3.369M same par 0 disk 0 spec disk 0B / 0 m
X-Exec-Milliseconds: 121040
X-Exec-DB-Activity: 12.43M rnd 13.28M seq 0 same seg 8.023M same pg 3.369M same par 0 disk 0 spec disk 0B / 0 messages 0 fork
Expires: Tue, 13 Sep 2016 16:39:44 GMT
Cache-Control: max-age=604800
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: HEAD, GET, POST, OPTIONS
Access-Control-Allow-Headers: DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Accept-Encoding
Accept-Ranges: bytes
On your own instance, you can set larger or even infinite timeouts (e.g., MaxQueryExecutionTime) to ensure that you always get the most complete result available.

Generating unique short IDs - MongoDB to manage collisions

I am evaluating the following code to generate a short ID in my Node server (inspired by previous post: Short user-freiendly ID for mongo ):
> b = crypto.pseudoRandomBytes(6)
<SlowBuffer d3 9a 19 fe 08 e2>
> rid = b.readUInt32BE(0)*65536 + b.readUInt16BE(4)
232658814503138
> rid.toString(36).substr(0,8).toUpperCase()
'2AGXZF2Z'
This may not guarantee uniqueness, but my requirements are to have a short ID of maximum length 8 characters and it must also be all upper case. The purpose of this is to make the ID user friendly.
To ensure that there are no collisions, I am planning to create a collection in MongoDB that contains documents that map the short ID, which will be an indexed field, onto the MongoDB ObjectID of the actual document I want the short ID to refer to.
What is the best strategy for doing this to ensure scalability and performance in a concurrent environment where multiple process on multiple physical servers will be checking for short ID uniqueness?

Resources