cassandra data model for web logging - cassandra

Been playing around with Cassandra and I am trying to evaluate what would be the best data model for storing things like views or hits for unique page id's? Would it best to have a single column family per pageid, or 1 Super-column (logs) with columns pageid? Each page has a unique id, then would like to store date and some other metrics on the view.
I am just not sure which solution handles better scalability, lots of column family OR 1 giant super-column?
page-92838 { date:sept 2, browser:IE }
page-22939 { date:sept 2, browser:IE5 }
OR
logs {
page-92838 {
date:sept 2,
browser:IE
}
page-22939 {
date:sept 2,
browser:IE5
}
}
And secondly, how to handle lots of different date: entries for page-92838?

You don't need a column-family per pageid.
One solution is to have a row for each page, keyed on the pageid.
You could then have a column for each page-view or hit, keyed and sorted on time-UUID (assuming having the views in time-sorted order would be useful) or other unique, always-increasing counter. Note that all Cassandra columns are time-stamped anyway, so you would have a precise timestamp 'for free' regardless of what other time- or date- stamps you use. Using a precise time-UUID as the key also solves the problem of storing many hits on the same date.
The value of each column could then be a textual value or JSON document containing any other metadata you want to store (such as browser).
page-12345 -> {timeuuid1:metadata1}{timeuuid2:metadata2}{timeuuid3:metadata3}...
page-12346 -> ...

With cassandra, it is best to start with what queries you need to do, and model your schema to support those queries.
Assuming you want to query hits on a page, and hits by browser, you can have a counter column for each page like,
stats { #cf
page-id { #key
hits : # counter column for hits
browser-ie : #counts of views with ie
browser-firefox : ....
}
}
If you need to do time based queries, look at how twitters rainbird denormalizes as it writes to cassandra.

Related

Couchdb - date range + multiple query parameters

I want to be able query the couchdb between dates, I know that this can be done with startkey and endkey (it works fine), but is it possible to do query for example like this:
SELECT *
FROM TABLENAME
WHERE
DateTime >= '2011-04-12T00:00:00.000' AND
DateTime <= '2012-05-25T03:53:04.000'
AND
Status = 'Completed'
AND
Job_category = 'Installation'
Generally-speaking, establishing indexes on multiple fields grows in complexity as the number of fields increases.
My main question is: do Status and Job_category need to be queried dynamically too? If not, your view is simple:
function (doc) {
if (doc.Status === 'Completed' && doc.Job_category === 'Installation') {
emit(doc.DateTime); // this line may change depending on how you break up and emit the datetimes
}
}
Views are fairly cheap, (depending on the size of your database) so don't be afraid to establish several that cover different cases. I would expect something like Status to have predefined list of available options, as oppposed to Job_category which seems like it could be more related to user input.
If you need those fields to be dynamic, you can just add them to the index as well:
function (doc) {
emit([ doc.Status, doc.Job_category, doc.DateTime ]);
}
Then you can use an array as your start_key. For example:
start_key=["Completed", "Installation", ...]
tl;dr: use "static" views where you have a predetermined list of values for a given field. while possible to query "dynamic" views with multiple fields, the complexity grows very quickly.

Cloudant 1 to many function

I’ve just started to use Cloudant and I just can’t get my head around the map functions. I’ve been fiddling with the data below but it isn’t working out as I expected.
The relationship is, a user can have many vehicles. A vehicle belongs to 1 user. The vehicle ‘userId’ is the key of the user. There is a bit of redundancy as in user the _id and userId is the same, guess later is not required.
Anyhow, how can I find for a/every user, the vehicles which belong to it? The closest I’ve come through trial and error is a result which displays the owner of every vehicle, but I would like it the other way round, the user and the vehicles belonging to it. All the examples I’ve found use another document which ‘joins’ two or more documents, but I don’t need to do that?
Any point in the right direction appreciated - I really have no idea.
function (doc) {
if (doc.$doctype == "vehicle")
{
emit(doc.userId, {_id: doc.userId});
}
}
EDIT: Getting closer. I'm not sure exactly what I was expecting, but the result seems a bit 'messy'. Row[0] is the user document, row[n > 0] are the vehicle documents. I guess it's fine when a startkey/endkey is used, but without the results are a bit jumbled up.
function (doc) {
if (doc.$doctype == 'user') {
emit([doc._id, 0], doc);
} else if (doc.$doctype == 'vehicle') {
emit([doc.userId, 1, doc._id], doc);
}
}
A user is described as,
{
"_id": "user:10",
"firstname": “firstnamehere",
"secondname": “secondnamehere",
"userId": "user:10",
"$doctype": "user"
}
a vehicle is described as,
{
"_id": "vehicle:4002”,
“name”: “avehicle”,
"userId": "user:10",
"$doctype": "vehicle",
}
You're getting in the right direction! You already got that right with the global IDs. Having the type of the document as part of the ID in some form is a very good idea, so that you don't get confused later (all documents are in the same "pot").
Here are some minor problems with your current solution (before getting to your actual question):
Don't emit the doc as value in emit(key, value). You can always ask for the document that belongs to a view row by querying with include_docs=true. Having the doc as view value increases the view indexes a lot. When you don't need a specific value, use emit(key, null).
You also don't need the ID in the emit value. You'll get the ID of the document that belongs to a view row as part of the row anyway.
View Collation
Now to your problem of aggregating the vehicles with their user. You got the basic pattern right. This pattern is called view collation, you can read more about it in the CouchDB docs (ignore that it is in the "Couchapp" section).
The trick with view collation is that you return two or more types of documents, but make sure that they are sorted in a way that allows for direct grouping. Thus it is important to understand how CouchDB sorts the view result. See the collation specification for more information on that one. An important key to understanding view collation is that rows with array keys are sorted by key elements. So when two rows have the same key[0], they sort by key[1]. If that's equal as well, key[2] is considered, and so on.
Your map function frist groups users and vehicles by user ID (key[0]). Your map function then uses the fact that 0 sorts before 1 in the second element of the key, so your view will contain the following:
user 1
vehicle of user 1
vehicle of user 1
vehicle of user 1
user 2
user 3
vehicle of user 3
user 4
etc.
As you can see, the vehicles of a user immediately follow their user. Thus you can group this result into aggregates without performing expensive sort or lookup operations.
Note that users are sorted according to their ID, and vehicles within users also according to their ID. This is because you use the IDs in the key array.
Creating Queries
Now that view isn't worth much if you can't query according to your needs. A view as you have it supports the following queries:
Get all users with their vehicles
Get a range of users with their vehicles
Get a single user with its vehicles
Get a single user without vehicles (you could also use the _all_docs view for that though)
Example query for "all users between user 1 and user 3 (inclusive) with their vehicles"
We want to query for a range, so we use startkey and endkey in the query:
startkey=["user:1", 0]
endkey=["user:3", 1, {}]
Note the use of {} as sentinel value, which is required so that the end key is larger than any row that has a key of ["user:3", 1, (anyConceivableVehicleId)]

How to efficiently store this document structure in Cassandra?

I want to migrate this complex document structure to cassandra:
foo = {
1: {
:some => :data,
},
2: {
:some => :data
},
...
99 :{
:some => :data
}
'seen' => {1 => 1347682901, 2 => 1347682801}
}
The problem:
It has to be retrievable (readble) as one row/record in ~<5 milliseconds.
So far, I am serializing the data but that is not the optimum as I'm always in need to update the whole thing.
Another thing is, that I would like to use cassandras ttl feature for the values in the 'seen' hash.
Any ideas on how the sub-structures (1..n) could work in cassandra, as they are totally dynamic but should be readable all with one query?
Create a columnFamily. And store as following
rowKey = foo
columnName Value
-----------------------------------
1 {:some => :data,..}
2 {:some => :data,..}
...
...
99 {:some => :data,..}
seen {1 => 1347682901, 2 => 1347682801}
1,2,... "seen" are all dynamic.
If you are worried about updating just one of these columns. It is same as how you insert a new column in a columnfamily. See here Cassandra update column
$column_family->insert('foo', array('42' => '{:some => :newdata,..}'));
I haven't had to use TTL yet. But it's as simple as it is. See pretty easy way to achieve this here Expiring Columns in Cassandra 0.7+
Update
Q1. Just for my understanding: Do you suggest creating 99 columns? Or is it possible to keep that dynamic?
Column family, unlike RDBMS, has flexible structure. You can have unlimited numbers of columns for a row key, dynamically created. For example:
myCcolumnFamily{
"rowKey1": {
"attr1": "some_values",
"attr2": "other_value",
"seen" : 823648223
},
"rowKey2": {
"attr1": "some_values",
"attr3": "other_value1",
"attr5": "other_value2",
"attr7": "other_value3",
"attr9": "other_value4",
"seen" : 823648223
},
"rowKey3": {
"name" : "naishe",
"log" : "s3://bucket42.aws.com/naishe/logs",
"status" : "UNKNOWN",
"place" : "Varanasi"
}
}
This is an old article, worth reading: WTF is a SuperColumn? Here is a typical quote that will answer your query (emphasis mine):
One thing I want to point out is that there’s no schema enforced at this [ColumnFamily] level. The Rows do not have a predefined list of Columns that they contain. In our example above you see that the row with the key “ieure” has Columns with names “age” and “gender” whereas the row identified by the key “phatduckk” doesn’t. It’s 100% flexible: one Row may have 1,989 Columns whereas the other has 2. One Row may have a Column called “foo” whereas none of the rest do. This is the schemaless aspect of Cassandra.
. . . .
Q2. And you suggest serializing the sub-structure?
It's up to you. If you do not want to serialize, you probably should use SuperColumn. My rule of thumb is this. If the value in a column represents a unit whose parts cannot be accessed independently, use Column. (that means serialize value). If column is having fragmented subparts that possibly will require accessing directly use SuperColumn.

Couchdb: filter and group in a single view

I have a Couchdb database with documents of the form: { Name, Timestamp, Value }
I have a view that shows a summary grouped by name with the sum of the values. This is straight forward reduce function.
Now I want to filter the view to only take into account documents where the timestamp occured in a given range.
AFAIK this means I have to include the timestamp in the emitted key of the map function, eg. emit([doc.Timestamp, doc.Name], doc)
But as soon as I do that the reduce function no longer sees the rows grouped together to calculate the sum. If I put the name first I can group at level 1 only, but how to I filter at level 2?
Is there a way to do this?
I don't think this is possible with only one HTTP fetch and/or without additional logic in your own code.
If you emit([time, name]) you would be able to query startkey=[timeA]&endkey=[timeB]&group_level=2 to get items between timeA and timeB grouped where their timestamp and name were identical. You could then post-process this to add up whenever the names matched, but the initial result set might be larger than you want to handle.
An alternative would be to emit([name,time]). Then you could first query with group_level=1 to get a list of names [if your application doesn't already know what they'll be]. Then for each one of those you would query startkey=[nameN]&endkey=[nameN,{}]&group_level=2 to get the summary for each name.
(Note that in my query examples I've left the JSON start/end keys unencoded, so as to make them more human readable, but you'll need to apply your language's equivalent of JavaScript's encodeURIComponent on them in actual use.)
You can not make a view onto a view. You need to write another map-reduce view that has the filtering and makes the grouping in the end. Something like:
map:
function(doc) {
if (doc.timestamp > start and doc.timestamp < end ) {
emit(doc.name, doc.value);
}
}
reduce:
function(key, values, rereduce) {
return sum(values);
}
I suppose you can not store this view, and have to put it as an ad-hoc query in your application.

Querying documents containing two tags with CouchDB?

Consider the following documents in a CouchDB:
{
"name":"Foo1",
"tags":["tag1", "tag2", "tag3"],
"otherTags":["otherTag1", "otherTag2"]
}
{
"name":"Foo2",
"tags":["tag2", "tag3", "tag4"],
"otherTags":["otherTag2", "otherTag3"]
}
{
"name":"Foo3",
"tags":["tag3", "tag4", "tag5"],
"otherTags":["otherTag3", "otherTag4"]
}
I'd like to query all documents that contain ALL (not any!) tags given as the key.
For example, if I request using '["tag2", "tag3"]' I'd like to retrieve Foo1 and Foo2.
I'm currently doing this by querying by tag, first for "tag2", then for "tag3", creating the union manually afterwards.
This seems to be awfully inefficient and I assume that there must be a better way.
My second question - but they are quite related, I think - would be:
How would I query for all documents that contain "tag2" AND "tag3" AND "otherTag3"?
I hope a question like this hasn't been asked/answered before. I searched for it and didn't find one.
Do you have a maximum number of?
Tags per document, and
Tags allowed in the query
If so, you have an upper-bound on the maximum number of tags to be indexed. For example, with a maximum of 5 tags per document, and 5 tags allowed in the AND query, you could simply output every 1, 2, 3, 4, and 5-tag combination into your index, for a maximum of 1 (five-tag combos + 5 (four-tag combos) + 10 (three-tag combos) + 10 (two-tag combos) + 5 (one-tag combos) = 31 rows in the view for that document.
That may be acceptable to you, considering that it's quite a powerful query. The disk usage may be acceptable (especially if you simply emit(tags, {_id: doc._id}) to minimize data in the view, and you can use ?include_docs=true to get the full document later. The final thing to remember is to always emit the key array sorted, and always query it the same way, because you are emitting only tag combinations, not permutations.
That can get you so far, however it does not scale up indefinitely. For full-blown arbitrary AND queries, you will indeed be required to split into multiple queries, or else look into CouchDB-Lucene.

Resources