CouchDB has a special _all_docs view, which returns documents sorted on ID. But as ID's are random by default, the sorting makes no sense.
I always need to sort by 'date added'. Now I have two options:
Generating my own ID's and make sure they start with a timestamp
Use standard GUID's, but add a timestamp in json, and sort on
that
Now the second solution is less hackish, but I suspect the first solution to be much more efficient and faster, because all queries will be done on the real row id, which is indexed.
Is it true that both solutions differ in performance? And if it's true, which one is likely to be faster or preferred?
Is it true that both solutions differ in performance?
Your examples given describing the primary and secondary index approach in CouchDB.
_all_docs is the only primary index and is always up-to-date. Secondary indexes (views) as in your second solution getting updated when they are requested.
Thats the reason why from the requesters point-of-view _all_docs might be "faster". In real there isn't a difference in requesting already up-to-date indexes. Two workarounds for potentially outdated views (secondary indexes) are the use of the query param stale=ok (update the view after the response to the request) or so called "view-heaters" (send a simple HTTP Get to the view to trigger the update process).
And if it's true, which one is [...] prefered?
The capabilities to build an useful index and response payload are significant higher on the side of secondary indexes.
When you want to use the primary index you have to "design" your id as you have described. You can imagine that is a huge pre-decision of what can also be done with the doc and the ids.
My recommendation would be to use secondary indexes (views). Only if you need data stored in real-time or high-concurrency scenarios you should include the primary index in the search for the best fit to request data.
Related
Given a scenario where you have a User table, with id as PRIMARY KEY.
You have a column called email, and a column called name.
You want to UPDATE User.name based on User.email
I realized that the UPDATE command requires you to pass in a PRIMARY KEY. Does this mean I can't use a pure CQL migration, and would need to first query for the User.id primary key before I can UPDATE?
In this case, I DO know the PRIMARY KEY because the UUIDs are the same for dev and prod, but it feels dirty.
Yes, you're correct - you need to know primary key of the record to perform an update on the data, or deletion of specific record. There are several options here, depending of your data model:
Perform full scan of the table using effective token range scan (Look to this answer for more details);
If this is required very often, you can create a materialized view, with User.email as partition key, and fetch all message IDs that you can update (but you'll need to do this from your application, there is no nested query support in CQL). But also be aware that materialized views are "experimental" feature in Cassandra, and may not work all the time (it's more stable in DataStax Enterprise). Also, if you have some users with hundreds of thousands of emails, this may create big partitions.
Do like 2nd item with your code, by using an additional table
I think Alex's answer covers your question -- "how can I find a value in a PK column working backwards from a non-PK column's value?".
However, I think it's worth noting that asking this question indicates you should reconsider your data model. A rule of thumb in C* data model design is that you begin by considering the queries you need, and you've missed the UPDATE query use case. You can probably make things work without changing your model for now, but if you find you need to make other queries you're unprepared for, you'll run into operational issues with lots of indexes and/or MVs.
More generally, search around for articles and other resources about Cassandra data modeling. It sounds like you're basically using C* for a relational use case so you'll want to look into that.
TL;DR: which of the three options below is the most efficient for paginating with Redis?
I'm implementing a website with multiple user-generated posts, which are saved in a relational DB, and then copied to Redis in form of Hashes with keys like site:{site_id}:post:{post_id}.
I want to perform simple pagination queries against Redis, in order to implement lazy-load pagination (ie. user scrolls down, we send an Ajax request to the server asking for the next bunch of posts) in a Pinterest-style interface.
Then I created a Set to keep track of published posts ids, with keys like site:{site_id}:posts. I've chosen Sets because I don't want to have duplicated IDs in the collection and I can do it fastly with a simple SADD (no need to check if id exists) on every DB update.
Well, as Sets aren't ordered, I'm wheighting the pros and cons of the options I have to paginate:
1) Using SSCAN command to paginate my already-implemented sets
In this case, I could persist the returned Scan cursor in the user's
session, then send it back to server on next request (it doesn't seem
reliable with multiple users accessing and updating the database: at
some time the cursor would be invalid and return weird results -
unless there is some caveat that I'm missing).
2) Refactor my sets to use Lists or Sorted Sets instead
Then I could paginate using LRANGE or ZRANGE. List seems to
be the most performant and natural option for my use case. It's
perfect for pagination and ordering by date, but I simply can't check
for a single item existence without looping all list. Sorted Sets
seems to join the advantages of both Sets and Lists, but consumes more
server resources.
3) Keep using regular sets and store the page number as part of the key
It would be something like site:{site_id}:{page_number}:posts. It
was the recommended way before Scan commands were implemented.
So, the question is: which one is the most efficient / simplest approach? Is there any other recommended option not listed here?
"Best" is best served subjective :)
I recommend you go with the 2nd approach, but definitely use Sorted Sets over Lists. Not only do the make sense for this type of job (see ZRANGE), they're also more efficient in terms of complexity compared to LRANGE-ing a List.
Im using CouchDB with node.js. Right now there is one node involved and even in remote future its not planned to changed that. While I can remove most of the cases where a short and auto-incremental-like (it can be sparse but not like random) ID is required there remains one place where the users actually needs to enter the ID of a product. I'd like to keep this ID as short as possible and in a more human readable format than something like '4ab234acde242349b' as it sometimes has to be typed by hand and so on.
However in the database it can be stored with whatever ID pleases CouchDB (using the default auto generated UUID) but it should be possible to give it a number that can be used to identify it as well. What I have thought about is creating a document that consists of an array with all the UUIDs from CouchDB. When in node I create a new product I would run an update handler that updates said document with the new unique ID at the end. To obtain the products ID I'd then query the array and client side using indexOf I could get the index as a short ID.
I dont know if this is feasible. From the performance point of view I can say the following: There are more queries that should do numerical ID -> uuid than uuid -> numerical ID. There will be at max 7000 new entries a year in the database. Also there is no use case where a product can be deleted yet I'd like not to rely on that.
Are there any other applicable ways to genereate a shorter and more human readable ID that can be associated with my document?
/EDIT
From a technical point of view: It seems to be working. I can do both conversions number <-> uuid and it seems go well. I dont now if this works well with replication and stuff but as there is said array i guess it should, right?
You have two choices here:
Set your human readable id as _id field. Basically you can just set in create document calls to DB, and it will accept it. This can be a more lightweight solution, but it comes with some limitations:
It has to be unique. You should also be careful about clients trying to create documents, but instead overwrite existing ones.
It can only contain alphanumeric or a few special characters. In my experience it is asking for trouble to have extra character types.
It cannot be longer than a theoretical string length limit(Couchdb doesn't define any, but you should). Long ids will increase your views(indexes) size really bad. And it might make it s lower.
If these things are no problem with you, then you should go with this solution.
As you said yourself, let the _id be a UUID, and set the human readable id to another field. To reach the document by the human readable id, you can just create a view emitting the human readable id as a key, and then either emit the document as value or get the document via include_docs=true option. Whenever the view is reached Couchdb will update the view incrementally and return you the list. This is really same as you creating a document with an array/object of ids inside it. Except with using a couchdb view, you get more performance.
This might be also slightly slower on querying and inserting. If the ids are inserted sequentially, it's fine, if not, CouchDB will slightly take more time to insert it at the right place. These don't work well with huge amounts of insert coming at the DB.
Querying shouldn't be more than 10% of total query time longer than first option. I think 10% is really a big number. It will be most probably less than 5%, I remember in my CouchDB application, I switched from reading by _id to reading from a view by a key and the slow down was very little that from user end point, when making 100 queries at the same time, it wasn't noticeable.
This is how people, query documents by other fields than id, for example querying a user document with email, when the user is logging in.
If you don't know how couchdb views work, you should read the views chapter of couchdb definite guide book.
Also make sure you stay away from documents with huge arrays inside them. I think CouchDB, has a limit of 4GB per document. I remember having many documents and it had really long querying times because the view had to iterate on each array item. In the end for each array item, instead I created one document. It was way faster.
I want to generate IDs for use with CouchDB. I'd like the IDs to be lexographically ascending by time so that I can sort on id without maintaining a seperate timestamp field. I know that CouchDB will generate ids with this property, but I don't want the performance hit of querying the database, I'd rather just run an algorithm on my servers. I'd go with an implementation of rfc 4112 except that the results aren't lexographically ascending. Is there any good reason I shouldn't just do:
(Date.now()) + 'x' + Math.round(Math.random() *1E18)
(I'm using nodejs). Are there any costs of using a non-standard uuid, or of relying on javascript's built in random function?
You have some choices when it comes to uuids.
The first choice is if you want the _id generated client side(node, browser, etc..), or by couch. It sounds like you want to generate your own uuid on the client side. That is fine. Just stick the result of your function into the _id field of the doc you save to couchdb. Couch will just use that.
You could have couch create the id. Couchdb only generates a _id if you don't choose one for yourself. Couchdb by default uses a 'sequential' uuid generation algorithm. You can change the algorithm to others via futon and config. There is a section called 'uuids' with a key of 'algorithm'. You can see the source for these algorithms here:
https://github.com/apache/couchdb/blob/master/src/couchdb/couch_uuids.erl
With descriptions about them here:
http://wiki.apache.org/couchdb/HttpGetUuids?highlight=%28utc%5C_random%29
As you can see the utc_random function is very similiar to your suggestion. But if you wanted your own,If you were inclined you could add you algorithm on the serverside and recompile couch.
The second part of your question is about the performance of choosing different algorithms. I am going to quote Dave Cottlehuber from a user list post:
CouchDB will have best insert time when your doc ids are
continually increasing, as this minimises rewrites to the b~tree. This
will also help
your view build time for the same reason, and also minimises wasted doc space,
although that would also be recovered during compaction.
So both your algorithm and the utc_random should be fine as they doc ids are continually increasing do to the seemingly helpful one direction of time.
I would recommend sticking with the UUID that CouchDB generates for you, but you can configure the server to use utc_random which will prefix a timestamp which you can sort your records by.
http://wiki.apache.org/couchdb/HttpGetUuids
Since CouchDB does not have support for SQL alike AUTO_INCREMENT what would be your approach to generate sequential unique numeric ids for your documents?
I am using numeric ids for:
User-friendly IDs (e.g. TASK-123, RQ-001, etc.)
Integration with libraries/systems that require numeric primary key
I am aware of the problems with replication, etc. That's why I am interested in how people try to overcome this issue.
As Dominic Barnes says, auto-increment integers are not scalable, not distributed-friendly or cloud-friendly. It seems every app nowadays needs a mobile version with offline support, and that is not directly compatible with auto-increment integers. We all know this, but it's true: auto-increment integers are necessary for legacy code and arguably other stuff.
In both scenarios, you are responsible for producing the auto-incrementing integer. A view is running emit(the_numeric_id, null). (You could also have a "type" namespace, e.g. by emit([doc.type, the_numeric_id], null). Query for the final row (e.g. with a startkey=MAXINT&descending=true&limit=1, increment the value returned, and that is your next id. The attempt to save is in a loop which can retry if there was a collision.
You can also play tricks if you don't need 100% density of the list of IDs. For example, you can add timestamps to the emit() rows, and estimate the document creation velocity, and increment by that velocity times your computation and transmit time. You could also simply increment by a random integer between 1 and N, so most of the time the first insert works, at a cost of non-homogeneous ID numbers.
About where to store the integer, I think there is the id strategy and the try and check strategy.
The id strategy is simpler and quicker in the short term. Document IDs are an integer (perhaps prefixed with a type to add a namespace). Since Couch guarantees uniqueness on the _id field, you just worry about the auto-incrementing. Do this in a loop: 409 Conflict triggers a retry, 201 Accepted means you're done.
I think the major pain with this trick is, that if and when you get conflicts, you have two completely unrelated documents, and one of them must be copied into a fresh document. If there were relationships with other documents, they must all be corrected. (The CouchDB 0.11 emit(key, {_id: some_foreign_doc_id}) trick comes to mind.)
The try and check strategy uses the default UUID as the doc._id, so every insert will succeed. Ideally, all or most of your inter-document relations are based on the immutable UUID _id, not the integer. That is just used for users and UI. The auto-incrementing integer is simply a field in the document, {"int_id":20}. The view of course does emit(doc.int_id, null). (You can look up a document by integer id with a ?key=23?include_docs=true parameter of the view.
Of course, after a replication, you might have id conflicts (not official CouchDB conflicts, but just documents using the same numeric id). The view which emits by ID would also have a reduce phase: simply _count should be enough. Next you must patrol the DB, querying this view with ?group=true and looking for any row (corresponding to an integer id) which has a count > 1. On the plus side, correcting the numeric id of a document is a minor change because it does not require new document creation.
Those are my ideas. Now that I wrote them down, I feel like you must do relation-shepherding regardless of where the id is stored; so perhaps using _id is better after all. The only other downside I see is that you are permanently married to a fundamentally broken naming model—for some definition of "permanently."
Is there any particular reason you want to use numeric IDs over the UUIDs that CouchDB can generate for you? UUIDs are perfect for the distributed paradigm that CouchDB uses, stick with what is built in.
If you find yourself with any more than 1 CouchDB node in your architecture, you're going to get conflicting document IDs if you rely on something like "auto increment" when it comes time for replication. Even if you're only using 1 node now, that's probably not always going to be the case, especially since CouchDB works so well in a distributed and "offline" architecture.
I have had pretty good luck just using an iso formatted date as my key:
http://wiki.apache.org/couchdb/IsoFormattedDateAsDocId
It's pretty simple to do, human-readable and it basically builds in a few querying options by just existing. :-)
Keeping in mind the issues around replication and conflicts, you can use an update function to generate incrementing IDs that are guaranteed unique in a single master setup.
function(doc, req) {
if (!doc) {
doc = {
_id: req.id,
type: 'idGenerator',
count: 0
};
}
doc.count++;
return [doc, toJSON(doc.count)];
}
Include this function in a design document like so:
{
"_id": "_design/application",
"language": "javascript",
"updates": {
"generateId": "function (doc, req) {\n\t\t\tif (!doc) {\n\t\t\t\tdoc = {\n\t\t\t\t\t_id: req.id,\n\t\t\t\t\ttype: 'idGenerator',\n\t\t\t\t\tcount: 0\n\t\t\t\t};\n\t\t\t}\n\n\t\t\tdoc.count++;\n\t\t\t\n\t\t\treturn [doc, toJSON(doc.count)];\n\t\t}"
}
}
Then call it like so:
curl -XPOST http://localhost:5984/mydb/_design/application/_update/generateId/entityId
Replace entityId with whatever you like to create several independent ID sequences.
Not a perfect solution but something that worked for me. Create an independent service that generates auto-incremented ids. Yes, you probably say "this breaks the offline model of couchdb" but what if you get a pool of N ids that you can then use whenever you need to get a new auto-incremented id. Then every time you're online you get some more ids and if you are running out of ids you tell your users - please go online. If the pool is big enough (say the monthly traffic) this shouldn't happen. Again, not perfect but maybe can be helpful to some people.
Instead of explicitly constructing an increasing integer key, you could use the implicit index couchDB accepts for paging.
The skip parameter accepts an integer that will effectively provide the auto-incrementing index you are used to.
http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options
The drawback is that it is not a viable solution for "User-friendly IDs". The index is not tied to the doc, and is subject to change if you are rewriting history.
If your only constraint is "integration with libraries/systems that require numeric primary key", this will bridge the gap without loosing the benefits of couchDB's key structure.