Storing and Grouping IP Addresses in CouchDB - couchdb

I have a Couch Database that contains a stack of IP address documents like this:
{
"_id": "09eea172ea6537ad0bf58c92e5002199",
"_rev": "1-67ad27f5ab008ad9644ce8ae003b1ec5",
"1stOctet": "10",
"2ndOctet": "1",
"3rdOctet": "3",
"4thOctet": "55"
}
The documents consist of multiple IP that are part of different subnet ranges.
I need a way to reduce/group these documents based on the 1st, 2nd, 3rd and 4th Octets in order to produce a reduced list of subnets.
Has anybody done anything like this before.
Best Regards,
Carlskii

I'm not sure if this is exactly what you're looking for, if you can provide more of an example as to your desired output, I can likely be of more help.
First, I would have your document structure look like this: (if you can't change that structure, it's not a big deal)
{
"ip": "10.1.3.55"
}
Your map function would look like:
function (doc) {
emit(doc.ip.split("."));
}
You'll need a reduce function, I've just used this in my testing
_count
Then I would use the group_level view query parameter to group based on each octet.
1 = group based on 1st octet
2 = group based on 1st-2nd octet
3 = group based on 1st-3rd octet
4 = group based on entire octet
group=true is functionally the same in this case as group_level=4

Related

Cloudant 1 to many function

I’ve just started to use Cloudant and I just can’t get my head around the map functions. I’ve been fiddling with the data below but it isn’t working out as I expected.
The relationship is, a user can have many vehicles. A vehicle belongs to 1 user. The vehicle ‘userId’ is the key of the user. There is a bit of redundancy as in user the _id and userId is the same, guess later is not required.
Anyhow, how can I find for a/every user, the vehicles which belong to it? The closest I’ve come through trial and error is a result which displays the owner of every vehicle, but I would like it the other way round, the user and the vehicles belonging to it. All the examples I’ve found use another document which ‘joins’ two or more documents, but I don’t need to do that?
Any point in the right direction appreciated - I really have no idea.
function (doc) {
if (doc.$doctype == "vehicle")
{
emit(doc.userId, {_id: doc.userId});
}
}
EDIT: Getting closer. I'm not sure exactly what I was expecting, but the result seems a bit 'messy'. Row[0] is the user document, row[n > 0] are the vehicle documents. I guess it's fine when a startkey/endkey is used, but without the results are a bit jumbled up.
function (doc) {
if (doc.$doctype == 'user') {
emit([doc._id, 0], doc);
} else if (doc.$doctype == 'vehicle') {
emit([doc.userId, 1, doc._id], doc);
}
}
A user is described as,
{
"_id": "user:10",
"firstname": “firstnamehere",
"secondname": “secondnamehere",
"userId": "user:10",
"$doctype": "user"
}
a vehicle is described as,
{
"_id": "vehicle:4002”,
“name”: “avehicle”,
"userId": "user:10",
"$doctype": "vehicle",
}
You're getting in the right direction! You already got that right with the global IDs. Having the type of the document as part of the ID in some form is a very good idea, so that you don't get confused later (all documents are in the same "pot").
Here are some minor problems with your current solution (before getting to your actual question):
Don't emit the doc as value in emit(key, value). You can always ask for the document that belongs to a view row by querying with include_docs=true. Having the doc as view value increases the view indexes a lot. When you don't need a specific value, use emit(key, null).
You also don't need the ID in the emit value. You'll get the ID of the document that belongs to a view row as part of the row anyway.
View Collation
Now to your problem of aggregating the vehicles with their user. You got the basic pattern right. This pattern is called view collation, you can read more about it in the CouchDB docs (ignore that it is in the "Couchapp" section).
The trick with view collation is that you return two or more types of documents, but make sure that they are sorted in a way that allows for direct grouping. Thus it is important to understand how CouchDB sorts the view result. See the collation specification for more information on that one. An important key to understanding view collation is that rows with array keys are sorted by key elements. So when two rows have the same key[0], they sort by key[1]. If that's equal as well, key[2] is considered, and so on.
Your map function frist groups users and vehicles by user ID (key[0]). Your map function then uses the fact that 0 sorts before 1 in the second element of the key, so your view will contain the following:
user 1
vehicle of user 1
vehicle of user 1
vehicle of user 1
user 2
user 3
vehicle of user 3
user 4
etc.
As you can see, the vehicles of a user immediately follow their user. Thus you can group this result into aggregates without performing expensive sort or lookup operations.
Note that users are sorted according to their ID, and vehicles within users also according to their ID. This is because you use the IDs in the key array.
Creating Queries
Now that view isn't worth much if you can't query according to your needs. A view as you have it supports the following queries:
Get all users with their vehicles
Get a range of users with their vehicles
Get a single user with its vehicles
Get a single user without vehicles (you could also use the _all_docs view for that though)
Example query for "all users between user 1 and user 3 (inclusive) with their vehicles"
We want to query for a range, so we use startkey and endkey in the query:
startkey=["user:1", 0]
endkey=["user:3", 1, {}]
Note the use of {} as sentinel value, which is required so that the end key is larger than any row that has a key of ["user:3", 1, (anyConceivableVehicleId)]

couchdb - Map Reduce - How to Join different documents and group results within a Reduce Function

I am struggling to implement a map / reduce function that joins two documents and sums the result with reduce.
First document type is Categories. Each category has an ID and within the attributes I stored a detail category, a main category and a division ("Bereich").
{
"_id": "a124",
"_rev": "8-089da95f148b446bd3b33a3182de709f",
"detCat": "Life_Ausgehen",
"mainCat": "COL_LEBEN",
"mainBereich": "COL",
"type": "Cash",
"dtCAT": true
}
The second document type is a transaction. The attributes show all the details for each transaction, including the field "newCat" which is a reference to the category ID.
{
"_id": "7568a6de86e5e7c6de0535d025069084",
"_rev": "2-501cd4eaf5f4dc56e906ea9f7ac05865",
"Value": 133.23,
"Sender": "Comtech",
"Booking Date": "11.02.2013",
"Detail": "Oki Drucker",
"newCat": "a124",
"dtTRA": true
}
Now if I want to develop a map/reduce to get the result in the form:
e.g.: "Name of Main Category", "Sum of all values in transactions".
I figured out that I could reference to another document with "_ID:" and ?include_docs=true, but in that case I can not use a reduce function.
I looked in other postings here, but couldn't find a suitable example.
Would be great if somebody has an idea how to solve this issue.
I understand, that multiple Category documents may have the same mainCat value. The technique called view collation is suitable to some cases where single join would be used in relational model. In your case it will not help: although you use two document schemes, you really have three level structure: main-category <- category <- transaction. I think you should consider changing the DB design a bit.
Duplicating the data, by storing mainCat value also in the transaction document, would help. I suggest to use meaningful ID for the transaction instead of generated one. You can consider for example "COL_LEBEN-7568a6de86e5e" (concatenated mainCat with some random value, where - delimiter is never present in the mainCat). Then, with simple parser in map function, you emit ["COL_LEBEN", "7568a6de86e5e"] for transactions, ["COL_LEBEN"] for categories, and reduce to get the sum.

How to efficiently store this document structure in Cassandra?

I want to migrate this complex document structure to cassandra:
foo = {
1: {
:some => :data,
},
2: {
:some => :data
},
...
99 :{
:some => :data
}
'seen' => {1 => 1347682901, 2 => 1347682801}
}
The problem:
It has to be retrievable (readble) as one row/record in ~<5 milliseconds.
So far, I am serializing the data but that is not the optimum as I'm always in need to update the whole thing.
Another thing is, that I would like to use cassandras ttl feature for the values in the 'seen' hash.
Any ideas on how the sub-structures (1..n) could work in cassandra, as they are totally dynamic but should be readable all with one query?
Create a columnFamily. And store as following
rowKey = foo
columnName Value
-----------------------------------
1 {:some => :data,..}
2 {:some => :data,..}
...
...
99 {:some => :data,..}
seen {1 => 1347682901, 2 => 1347682801}
1,2,... "seen" are all dynamic.
If you are worried about updating just one of these columns. It is same as how you insert a new column in a columnfamily. See here Cassandra update column
$column_family->insert('foo', array('42' => '{:some => :newdata,..}'));
I haven't had to use TTL yet. But it's as simple as it is. See pretty easy way to achieve this here Expiring Columns in Cassandra 0.7+
Update
Q1. Just for my understanding: Do you suggest creating 99 columns? Or is it possible to keep that dynamic?
Column family, unlike RDBMS, has flexible structure. You can have unlimited numbers of columns for a row key, dynamically created. For example:
myCcolumnFamily{
"rowKey1": {
"attr1": "some_values",
"attr2": "other_value",
"seen" : 823648223
},
"rowKey2": {
"attr1": "some_values",
"attr3": "other_value1",
"attr5": "other_value2",
"attr7": "other_value3",
"attr9": "other_value4",
"seen" : 823648223
},
"rowKey3": {
"name" : "naishe",
"log" : "s3://bucket42.aws.com/naishe/logs",
"status" : "UNKNOWN",
"place" : "Varanasi"
}
}
This is an old article, worth reading: WTF is a SuperColumn? Here is a typical quote that will answer your query (emphasis mine):
One thing I want to point out is that there’s no schema enforced at this [ColumnFamily] level. The Rows do not have a predefined list of Columns that they contain. In our example above you see that the row with the key “ieure” has Columns with names “age” and “gender” whereas the row identified by the key “phatduckk” doesn’t. It’s 100% flexible: one Row may have 1,989 Columns whereas the other has 2. One Row may have a Column called “foo” whereas none of the rest do. This is the schemaless aspect of Cassandra.
. . . .
Q2. And you suggest serializing the sub-structure?
It's up to you. If you do not want to serialize, you probably should use SuperColumn. My rule of thumb is this. If the value in a column represents a unit whose parts cannot be accessed independently, use Column. (that means serialize value). If column is having fragmented subparts that possibly will require accessing directly use SuperColumn.

Couchdb: filter and group in a single view

I have a Couchdb database with documents of the form: { Name, Timestamp, Value }
I have a view that shows a summary grouped by name with the sum of the values. This is straight forward reduce function.
Now I want to filter the view to only take into account documents where the timestamp occured in a given range.
AFAIK this means I have to include the timestamp in the emitted key of the map function, eg. emit([doc.Timestamp, doc.Name], doc)
But as soon as I do that the reduce function no longer sees the rows grouped together to calculate the sum. If I put the name first I can group at level 1 only, but how to I filter at level 2?
Is there a way to do this?
I don't think this is possible with only one HTTP fetch and/or without additional logic in your own code.
If you emit([time, name]) you would be able to query startkey=[timeA]&endkey=[timeB]&group_level=2 to get items between timeA and timeB grouped where their timestamp and name were identical. You could then post-process this to add up whenever the names matched, but the initial result set might be larger than you want to handle.
An alternative would be to emit([name,time]). Then you could first query with group_level=1 to get a list of names [if your application doesn't already know what they'll be]. Then for each one of those you would query startkey=[nameN]&endkey=[nameN,{}]&group_level=2 to get the summary for each name.
(Note that in my query examples I've left the JSON start/end keys unencoded, so as to make them more human readable, but you'll need to apply your language's equivalent of JavaScript's encodeURIComponent on them in actual use.)
You can not make a view onto a view. You need to write another map-reduce view that has the filtering and makes the grouping in the end. Something like:
map:
function(doc) {
if (doc.timestamp > start and doc.timestamp < end ) {
emit(doc.name, doc.value);
}
}
reduce:
function(key, values, rereduce) {
return sum(values);
}
I suppose you can not store this view, and have to put it as an ad-hoc query in your application.

Querying documents containing two tags with CouchDB?

Consider the following documents in a CouchDB:
{
"name":"Foo1",
"tags":["tag1", "tag2", "tag3"],
"otherTags":["otherTag1", "otherTag2"]
}
{
"name":"Foo2",
"tags":["tag2", "tag3", "tag4"],
"otherTags":["otherTag2", "otherTag3"]
}
{
"name":"Foo3",
"tags":["tag3", "tag4", "tag5"],
"otherTags":["otherTag3", "otherTag4"]
}
I'd like to query all documents that contain ALL (not any!) tags given as the key.
For example, if I request using '["tag2", "tag3"]' I'd like to retrieve Foo1 and Foo2.
I'm currently doing this by querying by tag, first for "tag2", then for "tag3", creating the union manually afterwards.
This seems to be awfully inefficient and I assume that there must be a better way.
My second question - but they are quite related, I think - would be:
How would I query for all documents that contain "tag2" AND "tag3" AND "otherTag3"?
I hope a question like this hasn't been asked/answered before. I searched for it and didn't find one.
Do you have a maximum number of?
Tags per document, and
Tags allowed in the query
If so, you have an upper-bound on the maximum number of tags to be indexed. For example, with a maximum of 5 tags per document, and 5 tags allowed in the AND query, you could simply output every 1, 2, 3, 4, and 5-tag combination into your index, for a maximum of 1 (five-tag combos + 5 (four-tag combos) + 10 (three-tag combos) + 10 (two-tag combos) + 5 (one-tag combos) = 31 rows in the view for that document.
That may be acceptable to you, considering that it's quite a powerful query. The disk usage may be acceptable (especially if you simply emit(tags, {_id: doc._id}) to minimize data in the view, and you can use ?include_docs=true to get the full document later. The final thing to remember is to always emit the key array sorted, and always query it the same way, because you are emitting only tag combinations, not permutations.
That can get you so far, however it does not scale up indefinitely. For full-blown arbitrary AND queries, you will indeed be required to split into multiple queries, or else look into CouchDB-Lucene.

Resources