couchdb - Map Reduce - How to Join different documents and group results within a Reduce Function - couchdb

I am struggling to implement a map / reduce function that joins two documents and sums the result with reduce.
First document type is Categories. Each category has an ID and within the attributes I stored a detail category, a main category and a division ("Bereich").
{
"_id": "a124",
"_rev": "8-089da95f148b446bd3b33a3182de709f",
"detCat": "Life_Ausgehen",
"mainCat": "COL_LEBEN",
"mainBereich": "COL",
"type": "Cash",
"dtCAT": true
}
The second document type is a transaction. The attributes show all the details for each transaction, including the field "newCat" which is a reference to the category ID.
{
"_id": "7568a6de86e5e7c6de0535d025069084",
"_rev": "2-501cd4eaf5f4dc56e906ea9f7ac05865",
"Value": 133.23,
"Sender": "Comtech",
"Booking Date": "11.02.2013",
"Detail": "Oki Drucker",
"newCat": "a124",
"dtTRA": true
}
Now if I want to develop a map/reduce to get the result in the form:
e.g.: "Name of Main Category", "Sum of all values in transactions".
I figured out that I could reference to another document with "_ID:" and ?include_docs=true, but in that case I can not use a reduce function.
I looked in other postings here, but couldn't find a suitable example.
Would be great if somebody has an idea how to solve this issue.

I understand, that multiple Category documents may have the same mainCat value. The technique called view collation is suitable to some cases where single join would be used in relational model. In your case it will not help: although you use two document schemes, you really have three level structure: main-category <- category <- transaction. I think you should consider changing the DB design a bit.
Duplicating the data, by storing mainCat value also in the transaction document, would help. I suggest to use meaningful ID for the transaction instead of generated one. You can consider for example "COL_LEBEN-7568a6de86e5e" (concatenated mainCat with some random value, where - delimiter is never present in the mainCat). Then, with simple parser in map function, you emit ["COL_LEBEN", "7568a6de86e5e"] for transactions, ["COL_LEBEN"] for categories, and reduce to get the sum.

Related

ArangoDB populate relation as field over graph query

I recently started using Arango since I want to make use of the advantages of graph databases. However, I'm not yet sure what's the most elegant and efficient approach to query an item from a document collection and applying fields to it that are part of a relation.
I'm used to make use of population or joins in SQL and NoSQL databases, but I'm not sure how it works here.
I created a document collection called posts. For example, this is a post:
{
"title": "Foo",
"content": "Bar"
}
And I also have a document collection called tags. A post can have any amount of tags, and my goal is to fetch either all or specific posts, but with their tags included, so for example this as my returning query result:
{
"title": "Foo",
"content": "Bar",
"tags": ["tag1", "tag2"]
}
I tried creating those two document collections and an edge collection post-tags-relation where I added an item for each tag from the post to the tag. I also created a graph, although I'm not yet sure what the vertex field is used for.
My query looked like this
FOR v, e, p IN 1..2 OUTBOUND 'posts/testPost' GRAPH post-tags-relation RETURN v
And it did give me the tag, but my goal is to fetch a post and include the tags in the same document...The path vertices do contain all tags and the post, but in separate arrays, which is not nice and easy to use (and probably not the right way). I'm probably missing something important here. Hopefully someone can help.
You're really close - it looks like your query to get the tags is correct. Now, just add a bit to return the source document:
FOR post IN posts
FILTER post._key == 'testPost'
LET tags = (
FOR v IN 1..2 OUTBOUND post
GRAPH post-tags-relation
RETURN v.value
)
RETURN MERGE(
post,
{ tags }
)
Or, if you want to skip the FOR/FILTER process:
LET post = DOCUMENT('posts/testPost')
LET tags = (
FOR v IN 1..2 OUTBOUND post
GRAPH post-tags-relation
RETURN v.value
)
RETURN MERGE(
post,
{ tags }
)
As for graph definition, there are three required fields:
edge definitions (an edge collection)
from collections (where your edges come from)
to collections (where your edges point to)
The non-obvious vertex collections field is there to allow you to include a set of vertex-only documents in your graph. When these documents are searched and how they're filtered remains a mystery to me. Personally, I've never used this feature (my data has always been connected) so I can't say when it would be valuable, but someone thought it was important to include.

learning mapreduce in Fauxton

I am brand new to noSQL, couchDB, and mapreduce and need some help.
I have the same question discussed here {How to use reduce in Fauxton} but do not understand the answer:(.
I have a working map function:
function (foo) {
if(foo.type == "blog post");
emit(foo)
}
which returns 11 individual documents. I want to modify this to return foo.type along with a count of 1.
I have tried:
function (doc) {
if(doc.type == "blog post");
return count(doc)
}
and "_count" from the Reduce panel, but clearly am doing something wrong as the View does not return anything.
Thanks in advance for any assistance or guidance!
In Fauxton, the Reduce step is kind of awkward and unintuitive to find.
Select _count in the "Reduce (optional)" popup below where you type
in your Map.
Select "Save Document and then Build Index". That will display your
map results.
Find the "Options" button at the top next to a gears icon. If you see a
green band instead, close the green band with the X.
Select Options, then the "Reduce" check-circle. Select Run Query.
Map
So when you build a map function, you are literally creating a dictionnary or map which are key:value data structures.
Your map function should emit keys that you will query. You can also emit a value but if you intend to simply get the associated document, you don't have to emit any values. Why? Because there is a query parameter that can be used to return the document associated (?include_docs=true).
Reduce
Then, you can have reduce function which will be called for every result with the same keys. Every result with the same key will be processed through your reduce function to reduce the value.
Corrected example
So in your case, you want to map document the document per type I suppose.
You could create a function that emit documents that have the type property.
function(doc){
if(doc.type)
emit(doc.type);
}
If you query this view, you will see that the keys of each rows will be the type of the document. If you choose the _count reduce function, you should have the number of document per types.
When querying the view, you have to specify : group=true&reduce=true
Also, you can get all the document of type blog postby querying with those parameters : ?key="blog post"

loopback relational database hasManyThrough pivot table

I seem to be stuck on a classic ORM issue and don't know really how to handle it, so at this point any help is welcome.
Is there a way to get the pivot table on a hasManyThrough query? Better yet, apply some filter or sort to it. A typical example
Table products
id,title
Table categories
id,title
table products_categories
productsId, categoriesId, orderBy, main
So, in the above scenario, say you want to get all categories of product X that are (main = true) or you want to sort the the product categories by orderBy.
What happens now is a first SELECT on products to get the product data, a second SELECT on products_categories to get the categoriesId and a final SELECT on categories to get the actual categories. Ideally, filters and sort should be applied to the 2nd SELECT like
SELECT `id`,`productsId`,`categoriesId`,`orderBy`,`main` FROM `products_categories` WHERE `productsId` IN (180) WHERE main = 1 ORDER BY `orderBy` DESC
Another typical example would be wanting to order the product images based on the order the user wants them to
so you would have a products_images table
id,image,productsID,orderBy
and you would want to
SELECT from products_images WHERE productsId In (180) ORDER BY orderBy ASC
Is that even possible?
EDIT : Here is the relationship needed for an intermediate table to get what I need based on my schema.
Products.hasMany(Images,
{
as: "Images",
"foreignKey": "productsId",
"through": ProductsImagesItems,
scope: function (inst, filter) {
return {active: 1};
}
});
Thing is the scope function is giving me access to the final result and not to the intermediate table.
I am not sure to fully understand your problem(s), but for sure you need to move away from the table concept and express your problem in terms of Models and Relations.
The way I see it, you have two models Product(properties: title) and Category (properties: main).
Then, you can have relations between the two, potentially
Product belongsTo Category
Category hasMany Product
This means a product will belong to a single category, while a category may contain many products. There are other relations available
Then, using the generated REST API, you can filter GET requests to get items in function of their properties (like main in your case), or use custom GET requests (automatically generated when you add relations) to get for instance all products belonging to a specific category.
Does this helps ?
Based on what you have here I'd probably recommend using the scope option when defining the relationship. The LoopBack docs show a very similar example of the "product - category" scenario:
Product.hasMany(Category, {
as: 'categories',
scope: function(instance, filter) {
return { type: instance.type };
}
});
In the example above, instance is a category that is being matched, and each product would have a new categories property that would contain the matching Category entities for that Product. Note that this does not follow your exact data scheme, so you may need to play around with it. Also, I think your API query would have to specify that you want the categories related data loaded (those are not included by default):
/api/Products/13?filter{"include":["categories"]}
I suggest you define a custom / remote method in Product.js that does the work for you.
Product.getCategories(_productId){
// if you are taking product title as param instead of _productId,
// you will first need to find product ID
// then execute a find query on products_categories with
// 1. where filter to get only main categoris and productId = _productId
// 2. include filter to include product and category objects
// 3. orderBy filter to sort items based on orderBy column
// now you will get an array of products_categories.
// Each item / object in the array will have nested objects of Product and Category.
}

Cloudant 1 to many function

I’ve just started to use Cloudant and I just can’t get my head around the map functions. I’ve been fiddling with the data below but it isn’t working out as I expected.
The relationship is, a user can have many vehicles. A vehicle belongs to 1 user. The vehicle ‘userId’ is the key of the user. There is a bit of redundancy as in user the _id and userId is the same, guess later is not required.
Anyhow, how can I find for a/every user, the vehicles which belong to it? The closest I’ve come through trial and error is a result which displays the owner of every vehicle, but I would like it the other way round, the user and the vehicles belonging to it. All the examples I’ve found use another document which ‘joins’ two or more documents, but I don’t need to do that?
Any point in the right direction appreciated - I really have no idea.
function (doc) {
if (doc.$doctype == "vehicle")
{
emit(doc.userId, {_id: doc.userId});
}
}
EDIT: Getting closer. I'm not sure exactly what I was expecting, but the result seems a bit 'messy'. Row[0] is the user document, row[n > 0] are the vehicle documents. I guess it's fine when a startkey/endkey is used, but without the results are a bit jumbled up.
function (doc) {
if (doc.$doctype == 'user') {
emit([doc._id, 0], doc);
} else if (doc.$doctype == 'vehicle') {
emit([doc.userId, 1, doc._id], doc);
}
}
A user is described as,
{
"_id": "user:10",
"firstname": “firstnamehere",
"secondname": “secondnamehere",
"userId": "user:10",
"$doctype": "user"
}
a vehicle is described as,
{
"_id": "vehicle:4002”,
“name”: “avehicle”,
"userId": "user:10",
"$doctype": "vehicle",
}
You're getting in the right direction! You already got that right with the global IDs. Having the type of the document as part of the ID in some form is a very good idea, so that you don't get confused later (all documents are in the same "pot").
Here are some minor problems with your current solution (before getting to your actual question):
Don't emit the doc as value in emit(key, value). You can always ask for the document that belongs to a view row by querying with include_docs=true. Having the doc as view value increases the view indexes a lot. When you don't need a specific value, use emit(key, null).
You also don't need the ID in the emit value. You'll get the ID of the document that belongs to a view row as part of the row anyway.
View Collation
Now to your problem of aggregating the vehicles with their user. You got the basic pattern right. This pattern is called view collation, you can read more about it in the CouchDB docs (ignore that it is in the "Couchapp" section).
The trick with view collation is that you return two or more types of documents, but make sure that they are sorted in a way that allows for direct grouping. Thus it is important to understand how CouchDB sorts the view result. See the collation specification for more information on that one. An important key to understanding view collation is that rows with array keys are sorted by key elements. So when two rows have the same key[0], they sort by key[1]. If that's equal as well, key[2] is considered, and so on.
Your map function frist groups users and vehicles by user ID (key[0]). Your map function then uses the fact that 0 sorts before 1 in the second element of the key, so your view will contain the following:
user 1
vehicle of user 1
vehicle of user 1
vehicle of user 1
user 2
user 3
vehicle of user 3
user 4
etc.
As you can see, the vehicles of a user immediately follow their user. Thus you can group this result into aggregates without performing expensive sort or lookup operations.
Note that users are sorted according to their ID, and vehicles within users also according to their ID. This is because you use the IDs in the key array.
Creating Queries
Now that view isn't worth much if you can't query according to your needs. A view as you have it supports the following queries:
Get all users with their vehicles
Get a range of users with their vehicles
Get a single user with its vehicles
Get a single user without vehicles (you could also use the _all_docs view for that though)
Example query for "all users between user 1 and user 3 (inclusive) with their vehicles"
We want to query for a range, so we use startkey and endkey in the query:
startkey=["user:1", 0]
endkey=["user:3", 1, {}]
Note the use of {} as sentinel value, which is required so that the end key is larger than any row that has a key of ["user:3", 1, (anyConceivableVehicleId)]

ElasticSearch default scoring mechanism

What I am looking for, is plain, clear explanation, of how default scoring mechanism of ElasticSearch (Lucene) really works. I mean, does it use Lucene scoring, or maybe it uses scoring of its own?
For example, I want to search for document by, for example, "Name" field. I use .NET NEST client to write my queries. Let's consider this type of query:
IQueryResponse<SomeEntity> queryResult = client.Search<SomeEntity>(s =>
s.From(0)
.Size(300)
.Explain()
.Query(q => q.Match(a => a.OnField(q.Resolve(f => f.Name)).QueryString("ExampleName")))
);
which is translated to such JSON query:
{
"from": 0,
"size": 300,
"explain": true,
"query": {
"match": {
"Name": {
"query": "ExampleName"
}
}
}
}
There is about 1.1 million documents that search is performed on. What I get in return, is (that is only part of the result, formatted on my own):
650 "ExampleName" 7,313398
651 "ExampleName" 7,313398
652 "ExampleName" 7,313398
653 "ExampleName" 7,239194
654 "ExampleName" 7,239194
860 "ExampleName of Something" 4,5708737
where first field is just an Id, second is Name field on which ElasticSearch performed it's searching, and third is score.
As you can see, there are many duplicates in ES index. As some of found documents have diffrent score, despite that they are exactly the same (with only diffrent Id), I concluded that diffrent shards performed searching on diffrent parts of whole dataset, which leads me to trail that the score is somewhat based on overall data in given shard, not exclusively on document that is actually considered by search engine.
The question is, how exactly does this scoring work? I mean, could you tell me/show me/point me to exact formula to calculate score for each document found by ES? And eventually, how this scoring mechanism can be changed?
The default scoring is the DefaultSimilarity algorithm in core Lucene, largely documented here. You can customize scoring by configuring your own Similarity, or using something like a custom_score query.
The odd score variation in the first five results shown seems small enough that it doesn't concern me much, as far as the validity of the query results and their ordering, but if you want to understand the cause of it, the explain api can show you exactly what is going on there.
The score variation is based on the data in a given shard (like you suspected). By default ES uses a search type called 'query then fetch' which, sends the query to each shard, finds all the matching documents with scores using local TDIFs (this will vary based on data on a given shard - here's your problem).
You can change this by using 'dfs query then fetch' search type - prequery each shard asking about term and document frequencies and then sends a query to each shard etc..
You can set it in the url
$ curl -XGET '/index/type/search?pretty=true&search_type=dfs_query_then_fetch' -d '{
"from": 0,
"size": 300,
"explain": true,
"query": {
"match": {
"Name": {
"query": "ExampleName"
}
}
}
}'
Great explanation in ElasticSearch documentation:
What is relevance:
https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html
Theory behind relevance scoring:
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

Resources