TL;DR: which of the three options below is the most efficient for paginating with Redis?
I'm implementing a website with multiple user-generated posts, which are saved in a relational DB, and then copied to Redis in form of Hashes with keys like site:{site_id}:post:{post_id}.
I want to perform simple pagination queries against Redis, in order to implement lazy-load pagination (ie. user scrolls down, we send an Ajax request to the server asking for the next bunch of posts) in a Pinterest-style interface.
Then I created a Set to keep track of published posts ids, with keys like site:{site_id}:posts. I've chosen Sets because I don't want to have duplicated IDs in the collection and I can do it fastly with a simple SADD (no need to check if id exists) on every DB update.
Well, as Sets aren't ordered, I'm wheighting the pros and cons of the options I have to paginate:
1) Using SSCAN command to paginate my already-implemented sets
In this case, I could persist the returned Scan cursor in the user's
session, then send it back to server on next request (it doesn't seem
reliable with multiple users accessing and updating the database: at
some time the cursor would be invalid and return weird results -
unless there is some caveat that I'm missing).
2) Refactor my sets to use Lists or Sorted Sets instead
Then I could paginate using LRANGE or ZRANGE. List seems to
be the most performant and natural option for my use case. It's
perfect for pagination and ordering by date, but I simply can't check
for a single item existence without looping all list. Sorted Sets
seems to join the advantages of both Sets and Lists, but consumes more
server resources.
3) Keep using regular sets and store the page number as part of the key
It would be something like site:{site_id}:{page_number}:posts. It
was the recommended way before Scan commands were implemented.
So, the question is: which one is the most efficient / simplest approach? Is there any other recommended option not listed here?
"Best" is best served subjective :)
I recommend you go with the 2nd approach, but definitely use Sorted Sets over Lists. Not only do the make sense for this type of job (see ZRANGE), they're also more efficient in terms of complexity compared to LRANGE-ing a List.
Related
We are storing users and friends (relationships) in Redis sets.
This is probably easy but we can't figure out how to get back results when paginating.
Example: when showing a logged in users's friends, we need the first 20 results, then on the following click, the next 20 results, etc.. We don't really care about the order, provided we don't get repeated data for the following queries.
We prefer to use sets vs sorted sets, as sets lets us use cheap SINTER for other queries.
WHat would the recommended aproach be? Storing them as both sets and sorted sets? Sounds a bit redundant.
You can paginate through a Set using SSCAN, note that it can return the same result twice though. Alternatively, Sorted Sets are the best for that kind of task. Lastly, Lists can also work but LRANGE is an expensive operation.
CouchDB has a special _all_docs view, which returns documents sorted on ID. But as ID's are random by default, the sorting makes no sense.
I always need to sort by 'date added'. Now I have two options:
Generating my own ID's and make sure they start with a timestamp
Use standard GUID's, but add a timestamp in json, and sort on
that
Now the second solution is less hackish, but I suspect the first solution to be much more efficient and faster, because all queries will be done on the real row id, which is indexed.
Is it true that both solutions differ in performance? And if it's true, which one is likely to be faster or preferred?
Is it true that both solutions differ in performance?
Your examples given describing the primary and secondary index approach in CouchDB.
_all_docs is the only primary index and is always up-to-date. Secondary indexes (views) as in your second solution getting updated when they are requested.
Thats the reason why from the requesters point-of-view _all_docs might be "faster". In real there isn't a difference in requesting already up-to-date indexes. Two workarounds for potentially outdated views (secondary indexes) are the use of the query param stale=ok (update the view after the response to the request) or so called "view-heaters" (send a simple HTTP Get to the view to trigger the update process).
And if it's true, which one is [...] prefered?
The capabilities to build an useful index and response payload are significant higher on the side of secondary indexes.
When you want to use the primary index you have to "design" your id as you have described. You can imagine that is a huge pre-decision of what can also be done with the doc and the ids.
My recommendation would be to use secondary indexes (views). Only if you need data stored in real-time or high-concurrency scenarios you should include the primary index in the search for the best fit to request data.
Im using CouchDB with node.js. Right now there is one node involved and even in remote future its not planned to changed that. While I can remove most of the cases where a short and auto-incremental-like (it can be sparse but not like random) ID is required there remains one place where the users actually needs to enter the ID of a product. I'd like to keep this ID as short as possible and in a more human readable format than something like '4ab234acde242349b' as it sometimes has to be typed by hand and so on.
However in the database it can be stored with whatever ID pleases CouchDB (using the default auto generated UUID) but it should be possible to give it a number that can be used to identify it as well. What I have thought about is creating a document that consists of an array with all the UUIDs from CouchDB. When in node I create a new product I would run an update handler that updates said document with the new unique ID at the end. To obtain the products ID I'd then query the array and client side using indexOf I could get the index as a short ID.
I dont know if this is feasible. From the performance point of view I can say the following: There are more queries that should do numerical ID -> uuid than uuid -> numerical ID. There will be at max 7000 new entries a year in the database. Also there is no use case where a product can be deleted yet I'd like not to rely on that.
Are there any other applicable ways to genereate a shorter and more human readable ID that can be associated with my document?
/EDIT
From a technical point of view: It seems to be working. I can do both conversions number <-> uuid and it seems go well. I dont now if this works well with replication and stuff but as there is said array i guess it should, right?
You have two choices here:
Set your human readable id as _id field. Basically you can just set in create document calls to DB, and it will accept it. This can be a more lightweight solution, but it comes with some limitations:
It has to be unique. You should also be careful about clients trying to create documents, but instead overwrite existing ones.
It can only contain alphanumeric or a few special characters. In my experience it is asking for trouble to have extra character types.
It cannot be longer than a theoretical string length limit(Couchdb doesn't define any, but you should). Long ids will increase your views(indexes) size really bad. And it might make it s lower.
If these things are no problem with you, then you should go with this solution.
As you said yourself, let the _id be a UUID, and set the human readable id to another field. To reach the document by the human readable id, you can just create a view emitting the human readable id as a key, and then either emit the document as value or get the document via include_docs=true option. Whenever the view is reached Couchdb will update the view incrementally and return you the list. This is really same as you creating a document with an array/object of ids inside it. Except with using a couchdb view, you get more performance.
This might be also slightly slower on querying and inserting. If the ids are inserted sequentially, it's fine, if not, CouchDB will slightly take more time to insert it at the right place. These don't work well with huge amounts of insert coming at the DB.
Querying shouldn't be more than 10% of total query time longer than first option. I think 10% is really a big number. It will be most probably less than 5%, I remember in my CouchDB application, I switched from reading by _id to reading from a view by a key and the slow down was very little that from user end point, when making 100 queries at the same time, it wasn't noticeable.
This is how people, query documents by other fields than id, for example querying a user document with email, when the user is logging in.
If you don't know how couchdb views work, you should read the views chapter of couchdb definite guide book.
Also make sure you stay away from documents with huge arrays inside them. I think CouchDB, has a limit of 4GB per document. I remember having many documents and it had really long querying times because the view had to iterate on each array item. In the end for each array item, instead I created one document. It was way faster.
I am experimenting with Dojo's dgrid (which is great!). I am using Nodejs/Mongoose on the server side.
I want to write a "log browser": I have a big mongodb table containing lots of log entries; using dgrid, I want to be able to 1) Filter by certain parameters 2) Paginate using dgrid's native pagination.
Hence the problem: dojo's JsonRest stores will send a request like this:
Accept:application/javascript, application/json
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
...
Host:localhost:3000
Range:items=0-24
Hence the problem: it will give a range (that's all it can do, really) and will display things on the client side according to what it receives from the server.
It's unrealistic to expect a cliend side JsonRest object to make requests other than "ranges". However, I am aware that skip/limit doesn't go very well with Mongoose:
What is the best way to do ajax pagination with MongoDb and Nodejs?
My idea was to render the dgrid, allowing the users to pick filters, and let them happily paginate through their logs. However, the fact that skip/limit are out of question, I am in a bit of a pickle...
Any pearls of wisdom, other than ditch dgrid altogether and implementing pagination on my own without using Dojo stores?
Merc.
Front-end
The filtering isn't as feature-full in dgrid as it is in the dojo EnhancedGrid filter plugin so you will probably need to implement that part yourself.
The good news is you get the paging simply by mixing-in "dgrid/OnDemandGrid" when you create your grid.
Back-end
The docs seem to indicate that your best bet for performance is to do some tricks with indices and query based on those to get your ranges.
You are probably already referencing these, but here they are;
http://mongoosejs.com/docs/api.html#query_Query-skip
http://docs.mongodb.org/manual/reference/method/cursor.skip/
Since log data is usually sequential and rarely modified, you could probably just use a monotonically increasing index for each row of log data and query using those to get the right offset into and count of the rows.
I come from a SQL world where lookups are done by several object properties (published = TRUE or user_id = X) and there are no joins anywhere (because of the 1:1 cache layer). It seems that a document database would be a good fit for my data.
I am trying to figure-out if there is a way to pass one (or more) object properties to a CouchDB map/reduce function to find matching documents in a database without creating dozens of views for each document type.
Is it possible to pass the desired document property key(s) to match at run-time to CouchDB and have it return the objects that match (or the count of object that match for pagination)?
For example, on one page I want all posts with a doc.user_id of X that are doc.published. On another page I might want all documents with doc.tags[] with the tag "sport".
You could build a view that iterates over the keys in the document, and emits a key of [propertyName, propertyValue] - that way you're building a single index with EVERYTHING prop/value in it. Would be massive, no idea how performance would be to build, and disk usage (probably bad).
Map function would look something like:
// note - totally untested, my CouchDB fu is rusty
function(doc) {
for(prop in doc) {
emit([prop, doc[prop]], null);
}
}
Works for the basic case of simple properties, and can be extended to be smart about arrays, and emit a prop/value pair for each item in the array. That would let you handle the tags.
To query on it, set [prop] as your query key on the view.
Basically, no.
The key difference between something like Couch and a SQL DB is that the only way to query in CouchDB is essentially through the views/indexes. Indexes in SQL are optional. They exist (mostly) to boost performance. For example, if you have a small DB, your app will run just fine on SQL with 0 indexes. (Might be some issue with unique constraints, but that's a detail.)
The overall point being is that part of the query processor in a SQL database includes other methods of data access beyond simply indexes, notably table scans, merge joins, etc.
Couch has no query processor. It has views (defined by JS) used to define B-Tree indexes.
And, that's it. That's the hammer of Couch. It's a good hammer. It's been lasting the data processing world for basically 40 years.
Indexes are somewhat expensive to create in Couch (based on data volume) which is why "temporary views" are frowned upon. And they have a cost in maintenance as well, so views need to be a conscious design element in your database. At the same time, they're a bit more powerful than normal SQL indexes as well.
You can readily add your own query processing on top of Couch, but that will be more work for you. You can create a few select views, on your most popular or selective criteria, and then filter the resulting documents by other criteria in your own code. Yes, you have to do it, so you have to question whether the effort involved is worth more than whatever benefits you feel Couch is offering your (HTTP API, replication, safe, always consistent datastore, etc.) over a SQL solution.
I ran into a similar issue like this, and built a quick workaround using CouchDB-Python (which is a great library). It's not a pretty solution (goes against the principles of CouchDB), but it works.
CouchDB-Python gives you the function "Query", which allows you to "execute an ad-hoc temporary view against the database". You can read about it here
What I have is that I store the javascript function as a string in python, and the concatenate it with variable names that I define in Python.
In some_function.py
variable = value
# Map function (in javascript)
map_fn = """function(doc) {
<javascript code>
var survey_match = """ + variable + """;
<javascript code>
"""
# Iterates through rows
for row in db.query(map_fn):
<python code>
It sure isn't pretty, and probably breaks a bunch of CouchDB philosophies, but it works.
D