Document Schema Performance

Document Schema Performance - couchdb

I am trying to determine the best document schema for a project for couchdb (2.3.1). In researching this I am finding some conflicting information and no relevant guidelines for the latest version of couchdb and similar scenarios. If this data does not lend itself to couchdb or a different method other than whats detailed below is prefered, I would like to better understand why.
My scenario is to track the manufacturing details of widgets:
100,000-300,000 widget types must be tracked
Each widget type is manufactured between 200-1,800 times a day
Widget type manufacturing may burst to ~10,000 in a day
Each widget creation and its associated details must be recorded and updated
Widget creation is stored for 30 days
Query widget details by widget type and creationStartTime/creationEndTime
I am not concerned with revisions, and can just update and use the same _rev if this may increase performance
Method 1:
{
"_id": "*",
"_rev": "*",
"widgetTypeId": "1831",
"creation": [{
"creationId" "da17faef-3591-4579-b5f6-ff0a719a6da7",
"creationStartTime": 1556471139,
"creationEndTime": 1556471173,
"color": "#ffffff",
"styleId": "92811",
"creatorId": "82812"
},{
"creationId" "893fede7-3874-44ed-b290-7001b4901bc9",
"creationStartTime": 1556471481,
"creationEndTime": 1556471497,
"color": "#cccccc",
"styleId": "75343",
"creatorId": "3211"
}]
}
Using method one would limit my document creation to 100,000-300,000 documents. However, these documents would be very tall and frequently updated.
Method 2:
{
"_id": "*",
"_rev": "*",
"widgetTypeId": "1831",
"creationId" "da17faef-3591-4579-b5f6-ff0a719a6da7",
"creationStartTime": 1556471139,
"creationEndTime": 1556471173,
"color": "#ffffff",
"styleId": "92811",
"creatorId": "82812"
},{
"_id": "*",
"_rev": "*",
"widgetTypeId": "1831",
"creationId" "893fede7-3874-44ed-b290-7001b4901bc9",
"creationStartTime": 1556471481,
"creationEndTime": 1556471497,
"color": "#cccccc",
"styleId": "75343",
"creatorId": "3211"
}
Method 2 creates a tall database

It's a common problem to be faced with. In general terms, small, immutable documents will likely be more performant than few, huge, mutable documents. The reasons for this include:
There is no support for partial updates (patch) in CouchDB. So if you need to insert data into an array in a big document, you need to fetch all of the data, unpack the json, insert the data, repack the json and send the whole thing back to CouchDB over the wire.
Larger documents provide for more internal overheads, too, especially when it comes to indexing.
It's best to let the data that change as a unit make up a document. Ever-growing lists in documents is a bad idea.
It seems to me that your second alternative is a perfect fit for what you want to achieve: a set of small documents that can be made immutable. Then make a set of views so you can query on time ranges and widget type.

Related

RESTful API design - naming an "activity" resource

When designing the endpoints for an activity resource that provides information on the activity of other resources such as users and organisations we are struggling with naming conventions.
What would be more semantic:
/organisations/activity
/organisations/activity/${activityId}
/users/activity
/users/activity/${activityId}
OR
/activity/users/${activityId}
/activity/users
/activity/organisations/${activityId}
/activity/organisations

There's not a generic answer for this, especially since the mechanisms doing the lookup/retrieval at the other end, and associated back-ends vary so drastically, not to mention the use case purpose and intended application.
That said, assuming for all intents and purposes the "schema" (or ... endpoint convention from the point of view of the end user) was just going to be flat, I have seen many more of the latter activity convention, as that is the actual resource, which is what many applications and APIs are developed around.
I've come to expect the following style of representation from APIs today (how they achieve the referencings and mappings is a different story, but from the point of view of API reference)
-
{
"Activity": [
{
"date": "1970-01-01 08:00:00",
"some_other_resource_reference_uuid": "f1c4a41e-1639-4e35-ba98-e7b169d1c92d",
"user": "b3ababc4-461b-404a-a1a2-83b4ca8c097f",
"uuid": "0ccf1b41-aecf-45f9-a963-178128096c97"
}
],
"Users": [
{
"email": "johnanderson#mycompany.net",
"first": "John",
"last": "Anderson",
"user_preference_1": "somevalue",
"user_property_1": "somevalue",
"uuid": "b3ababc4-461b-404a-a1a2-83b4ca8c097f"
}
]
}
The StackExchange API allows retrieving objects through multiple methods also:
For example, the User type look like this:
-
{
"view_count": 1000,
"user_type": "registered",
"user_id": 9999,
"link": "http://example.stackexchange.com/users/1/example-user",
"profile_image": "https://www.gravatar.com/avatar/a007be5a61f6aa8f3e85ae2fc18dd66e?d=identicon&r=PG",
"display_name": "Example User"
}
And on the Question type, the same user is shown underneath the owner object :
-
{
"owner": {
"user_id": 9999,
"user_type": "registered",
"profile_image": "https://www.gravatar.com/avatar/a007be5a61f6aa8f3e85ae2fc18dd66e?d=identicon&r=PG",
"display_name": "Example User",
"link": "https://example.stackexchange.com/users/1/example-user"
},
"is_answered": false,
"view_count": 31415,
"favorite_count": 1,
"down_vote_count": 2,
"up_vote_count": 3,
"answer_count": 0,
"score": 1,
"last_activity_date": 1494871135,
"creation_date": 1494827935,
"last_edit_date": 1494896335,
"question_id": 1234,
"link": "https://example.stackexchange.com/questions/1234/an-example-post-title",
"title": "An example post title",
"body": "An example post body"
}
On the Posts Type reference (Using this as a separate example because there is only a handful of methods to reach this type), you'll see an example down the bottom :
Methods That Return This Type
  posts
  posts/{ids}
  users/{ids}/posts 2.2
  me/posts 2.2
So whilst you can access resources (or "types" as it is on StackExchange), through a number of ways including filters and complex queries, there still exists the ability to see the desired resource through a number of more direct transparent URI conventions.
Different applications will clearly have different requirements. For example, the Gmail API is user based all the way - this makes sense from a users point of view given that in the context of the authenticated credential, you're separating one users objects from another.
This doesn't mean google uses the same convention for all of their APIs, their Activities API resource is all about the activity
Even looking at the Twitter API, there is a Direct Messages endpoint resource that has sender and receiver objects within.
I've not seen many API's at all that are limited to accessing resources purely via a user endpoint, unless the situation obviously calls for it, i.e. the Gmail example above.
Regardless of how flexible a REST API can be, the minimum I have come to expect is that some kind of Activity, location, physical object, or other entity is usually it's own resource, and the user association is plugged in and referenced at various degrees of flexibility (at a minimum, the example given at the top of this post).

It should be pointed out that in a true REST api the uri holds no meaning. It's the link relationships from your organizations and users resources that matter.
Clients should just discover those urls, and should also adapt to the new situation if you decide that you want a different url structure after all.
That being said, it's nice to have a logical structure for this type of thing. However, either is fine. You're asking for an opinion, there is not really a standard or best practice. That said, I would choose option #1.

Aggregate query for IBM Cloudant which is basically couchDB

I am a contributor at http://airpollution.online/ which is open environment web platform built open source having IBM Cloudant as it's Database service.
Platform's architecture is such way that we need to fetch latest data of each air pollution measurement devices from a collection. As far as my experience go with MongoDB, I have wrote aggregate query to fetch each devices' latest data as per epoch time key in each and every document available in respective collection.
Sample Aggregate query is :
db.collection("hourly_analysis").aggregate([
{
$sort: {
"time": -1,
"Id": -1
}
}, {
$project: {
"Id": 1,
"data": 1,
"_id": 0
}
}, {
$group: {
"_id": "$Id",
"data": {
"$last": "$$ROOT"
}
}
}
If someone has idea/suggestions about how can I write design documents in IBM Cloudant, Please help me! Thanks!
P.S. We still have to make backend open source for this project. (may take some time)

In CouchDB/Cloudant this is usually better done as a view than an ad-hoc query. It's a tricky one but try this:
- a map step that emits the device ID and timestamp as two parts of a composite key, plus the device reading as the value
- a reduce step that looks for the largest timestamp and returns both the biggest (most recent) timestamp and the reading that goes with it (both values are needed because when rereducing, we need to know the timestamp so we can compare them)
- the view with group_level set to 1 will give you the newest reading for each device.
In most cases in Cloudant you can use the built-in reduce functions but here you want a function of a key.
(The way that I solved this problem previously was to copy incoming data into a "newest readings" store as well as writing it to a database in the normal way. This makes it very quick to access if you only ever want the newest reading.)

How to backup/dump structure of graphs in arangoDB

Is there a way to dump the graphstructure of an arangoDB database, since
arangodump unfortunately just dumps the data of edges and collections.

According to the documentation in order to dump structural information of all collections (including system collections) you run the following
arangodump --dump-data false --include-system-collections true --output-directory "dump"
If you do not want the system collections to be included then don't provide the argument (it defaults to false) or provide a false value.
How is the structural and data of collections dumped, see below from the documentation
Structural information for a collection will be saved in files with
name pattern .structure.json. Each structure file will contains a JSON
object with these attributes:
parameters: contains the collection properties
indexes: contains the collection indexes
Document data for a collection will be saved in
files with name pattern .data.json. Each line in a data file is a
document insertion/update or deletion marker, alongside with some meta
data.

For testing I often want to extract a subgraph with a known structure. I use that to test my queries against. The method is not pretty but it might address your question. I blogged about it here.

Although #Raf's answer is accepted , --dump-data false will only give structure files for all the collections and but data wont be there. Including --include-system-collections true would give _graphs system collection's structure which wont have information pertaining to individual graphs creation/structure.
For Graph creation data as well
Right command is as follows.
arangodump --server.database <DB_NAME> --include-system-collections true --output-directory <YOUR_DIRECTORY>
We would be interested in _graphs_<long_id>.data.json named file which has below data format.
{
"type": 2300,
"data":
{
"_id": "_graphs/social",
"_key": "social",
"_rev": "_WaJxhIO--_",
"edgeDefinitions": [
{
"collection": "relation",
"from": ["female", "male"],
"to": ["female", "male"]
}
],
"numberOfShards": 1,
"orphanCollections": [],
"replicationFactor": 1
}
}
Hope this helps other users who were looking for my requirement!

Currently ArangoDB manages graphs via documents in the system collection _graphs.
One document equals one graph. It contains the graph name, involved vertex collections and Edge Definition that configure the directions of edge collections.

How to find nearest points using latitude and longitude from a DocumentDB collection?

I'm using Microsoft's DocumentDB to store locations of hundreds and thousands of geographic locations in the UK. A document from the database collection looks like the following.
{
"CommonName": "Cassell Road",
"CommonNameLang": "en",
"Street": "Downend Road",
"StreetLang": "en",
"Indicator": "SW-bound",
"IndicatorLang": "en",
"Bearing": "SW",
"LocalityName": "Fishponds",
"ParentLocalityName": "Bristol",
"Easting": 364208,
"Northing": 176288,
"Longitude": -2.5168477762,
"Latitude": 51.4844052488,
},
What type of DocumentDB query could I use based on the user's latitude and longitude to find the nearest points from the collection?
Thanks

You'll need a geospatial function like ST_NEAR or ST_DISTANCE. This is not currently available - you can check the status of the feature here.
Short-term, especially if have < 1000 documents, you might be able to use an existing geospatial library e.g., System.Spatial and perform processing client-side. Alternatively, you can use a JavaScript library like GeoJSON Utils within stored procedures to do this.

How to find an object which is at nth nested level in mongoDB? (single collection, single document)

I am trying to find an nth object using '_id', which is in the same document.
Any suggestions or references or code samples would be appreciated.
(e.g)
Document will look as below:
{
"_id": "xxxxx",
"name": "One",
"pocket": [{
"_id": "xxx123",
"name": "NestedOne",
"pocket": []
}, {
"_id": "xxx1234",
"name": "NestedTwo",
"pocket": [{
"_id": "xxx123456",
"name": "NestedTwoNested",
"pocket": [{"_id": "xxx123666",
"name": "NestedNestedOne",
"pocket": []
}]
}]
}]
}
The pockets shall hold more pockets and it is dynamic.
Here, I would like to search "pocket" using "_id" , say "xxx123456", but without using static reference.
Thanks again.

I highly recommend you change your document structure to something easier to manage/search, as this will only become more of a pain to work with.
Why not use multiple collections, like explained in this answer?
So an easy way to think about this for your situation, which I hope is easier for you to reason about than dropping some schema code...
Store all of your things as children in the same document. Give them unique _ids.
Store all of the contents of their pockets as collections. The collections simply hold all the ids that would normally be inside the pocket (instead of referencing the pockets themselves).
That way, a lot of your work can happen outside of DB calls. You just batch pull out the items you need when you need them, instead of searching nested documents!
However, if you can work with the entire document:
Looks like you want to do a recursive search a certain of number of levels deep. I'll give you a general idea with some pseudocode, in hopes that you'll be able to figure the rest out.
Say your function will be:
function SearchNDeep(obj, n, id){
/**
You want to go down 1 level, to pocket
see if pocket has any things in it. If so:
Check all of the things...for further pockets...
Once you've checked one level of things, increment the counter.
When the counter reaches the right level, you'd want to then see if the object you're checking has a `'_id'` of `id`.
**/
}
That's the general idea. There is a cleaner, recursive way to do this where you call SearchNDeep while passing a number for how deep you are, base case being no more levels to go, or the object is found.
Remember to return false or undefined if you don't find it, and the right object if you do! Good luck!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Document Schema Performance - couchdb

Related

RESTful API design - naming an "activity" resource

Aggregate query for IBM Cloudant which is basically couchDB

How to backup/dump structure of graphs in arangoDB

How to find nearest points using latitude and longitude from a DocumentDB collection?

How to find an object which is at nth nested level in mongoDB? (single collection, single document)

Categories

Resources