CouchDB view query with multiple key values - couchdb

I am currently trying to create a view and query to fit this SQL query:
SELECT * FROM articles
WHERE articles.location="NY" OR articles.location="CA"
ORDER BY articles.release_date DESC
I tried to create a view with a complex key:
function(doc) {
if(doc.type == "Article") {
emit([doc.location, doc.release_date], doc)
}
}
And then using startkey and endkey to retrieve one location and ordering the result on the release date.
.../_view/articles?startkey=["NY", {}]&endkey=["NY"]&limit=5&descending=true
This works fine.
However, how can I send multiple startkeys and endkeys to my view in order to mimic
WHERE articles.location="NY" OR articles.location="CA" ?

My arch nemesis, Dominic, is right.
Furthermore, it is never possible to query by criteria A and then sort by criteria B in CouchDB. In exchange for that inconvenience, CouchDB guarantees scalable, dependable, logarithmic query times. You have a choice.
Store the view output in its own database, and make a new view to sort by criteria B
or, sort the rows afterward, which can be either
Sort client-side, once you receive the rows
Sort server-side, in a _list function. The is great, but remember it's not ultimately scalable. If you have millions of rows, the _list function will probably crash.

The short answer is, you currently cannot use multiple startkey/endkey combinations.
You'll either have to make 2 separate queries, or you could always add on the lucene search engine to get much more robust searching capabilities.
It is possible to use multiple key parameters in a query. See the Couchbase CouchDB documentation on multi-document fetching.

Related

How to get all documents from a collection in FaunaDB?

I already have an answer:
const faunadb = require('faunadb')
const q = faunadb.query
exports.handler = async (event, context) => {
const client = new faunadb.Client({
secret: process.env.FAUNADB_SERVER_SECRET
})
try {
// Getting the refs with a first query
let refs = await client.query(q.Paginate(q.Match(q.Index('skus'))))
// Forging a second query with the retrieved refs
const bigQuery = refs.data.map((ref) => q.Get(ref))
// Sending over that second query
let allDocuments = await client.query(bigQuery)
// All my documents are here!
console.log('#allDocuments: ', allDocuments);
//...
} catch (err) {
// ...
}
}
But I find it unsatisfying because I'm making 2 queries for what seems like one the most trivial DB call. It seems inefficient and wordy to me.
As I'm just learning about FaunaDB, there's probably something I don't grasp here.
My question could be split into 3:
Can I query for all documents in a single call?
If not, why not? What's the logic behind such a design?
Could I make such a query without an index?
FaunaDB's FQL language is quite similar to JavaScript (which helps a lot if you want to do conditional transactions etc).
In essence, FaunaDB also has a Map. Given that your index contains only one value that is the reference you can write this:
q.Map(
q.Paginate(q.Match(q.Index('skus'))),
q.Lambda(x => q.Get(x))
)
For this specific case, you actually do not need an index since each collection has a built-in default index to do a select all via the 'Documents' function.
q.Map(
q.Paginate(q.Documents(q.Collection('<your collection>'))),
q.Lambda(x => q.Get(x))
)
Now in case the index that you are using returns multiple values (because you would want to sort on something other than 'ref') then you need to provide the same amount of parameters to the Lambda as the amount of values that were defined in the index. Let's say my index has ts and ref in values because I want to sort them on time, then the query to get all values becomes:
q.Map(
q.Paginate(q.Match(q.Index('<your index with ts and ref values>'))),
q.Lambda((ts, ref) => q.Get(ref))
)
Values are used for range queries/sorting but also define what the index returns
Coming back to your questions:
- Can I query for all documents in a single call?
Absolutely, I would advice you to do so. Note that the documents you will get are paginated automatically. You can set the page size by providing a parameter to paginate and will get back an 'after' or 'before' attribute in case the page is bigger. That after or before can be again presented to the Paginate function as a parameter to get a next or previous page: https://docs.fauna.com/fauna/current/api/fql/functions/paginate
- Could I make such a query without an index?
No, but you can use the built-in index as explained above. FaunaDB protects users from querying without an index. Since it is a scalable database that could contain massive data and is pay-as-you-go it's a good idea to prevent users from shooting themselves in the foot :). Pagination and mandatory Indexes help to do that.
As to the why FQL is different. FQL is a language that is not declarative like many querying languages. Instead it's procedural, you write exactly how you fetch data. That has advantages:
By writing how data is retrieved you can exactly predict how a query behaves which is nice-to-have in a pay-as-you-go system.
The same language can be used for security rules or complex conditional transactions (update certain entities or many entities ranging over different collections depending on certain conditions). It's quite common in Fauna to write a query that does many things in one transaction.
Our flavour of 'stored procedures' called User Defined Functions are just written in FQL and not another language.
Querying is also discussed in this tutorial that comes with code in a GitHub repository which might give you a more complete picture: https://css-tricks.com/rethinking-twitter-as-a-serverless-app/
Can I query for all documents in a single call?
Yes, if your collection is small. The Paginate function defaults to fetching 64 documents per page. You can adjust the page size up to 100,000 documents. If your collection has more than 100,000 documents, then you have to execute multiple queries, using cursors to fetch subsequent documents.
See the Pagination tutorial for details: https://docs.fauna.com/fauna/current/tutorials/indexes/pagination
If not, why not? What's the logic behind such a design?
For an SQL database, SELECT * FROM table is both convenient and, potentially, a resource nightmare. If the table contains billions of rows, attempting to serve results for that query could consume the available resources on the server and/or the client.
Fauna is a shared database resource. We want queries to perform well for any user with any database, and that requires that we put sensible limits on the number of documents involved in any single transaction.
Could I make such a query without an index?
No, and yes.
Retrieving multiple results from Fauna requires an index, unless you are independently tracking the references for documents. However, with the Documents function, Fauna maintains an internal index so you don't need to create your own index to access all documents in a collection.
See the Documents reference page for details: https://docs.fauna.com/fauna/current/api/fql/functions/documents
Returning to your example code, you are executing two queries, but they could easily be combined into one. FQL is highly composable. For example:
let allDocuments = await client.query(
q.Map(
q.Paginate(q.Documents(q.Collection("skus"))),
q.Lambda("X", q.Get(q.Var("X")))
)
)
Your observation that FQL is wordy, is correct. Many functional languages exhibit that wordiness. The advantage is that any functions that accept expressions can be composed at will. One of the best examples of composability, and how to manage inter-document references, is presented in our E-commerce tutorial, specifically, the section describing the submit_order function: https://docs.fauna.com/fauna/current/tutorials/ecommerce#function

Can you find a specific documents position in a sorted Azure Search index

We have several Azure Search indexes that use a Cosmos DB collection of 25K documents as a source and each index has a large number of document properties that can be used for sorting and filtering.
We have a requirement to allow users to sort and filter the documents and then search and jump to a specific documents page in the paginated result set.
Is it possible to query an Azure Search index with sorting and filtering and get the position/rank of a specific document id from the result set? Would I need to look at an alternative option? I believe there could be a way of doing this with a SQL back-end but obviously that would be a major undertaking to implement.
I've yet to find a way of doing this other than writing a query to paginate through until I find the required document which would be a relatively expensive and possibly slow task in terms of processing on the server.
There is no mechanism in Azure Search for filtering within the resultset of another query. You'd have to page through results, looking for the document ID on the client side. If your queries aren't very selective and produce many pages of results, this can be slow as $skip actually re-evaluates all results up to the page you specify.
You could use caching to make this faster. At least one Azure Search customer is using Redis to cache search results. If your queries are selective enough, you could even cache the results in memory so you'd only pay the cost of paging once.
Trying this at the moment. I'm using a two step process:
Generate your query but set $count=true and $top=0. The query result should contain a field named #odata.count.
You can then pick an index, then use $top=1 and $skip=<index> to return a single entry. There is one caveat: $skip will only accept numbers less than 100000

Is startkey pagination required for Mango queries in CouchDB 2.0

I've been searching all over on this one. I'm running CouchDB 2.0 and understand I have a choice to make between using traditional views or the newer Mango query when retrieving a set of data.
So I'm currently using the Mango query syntax and getting the results I need - however I now need to implement pagination. When researching pagination in CouchDB 2.0 I found this excellent discussion around the topic:
http://docs.couchdb.org/en/2.0.0/couchapp/views/pagination.html
It suggests that the best way to paginate large data sets is not to use skip but instead to use startkey and perform a kind of linked list pagination from one page to the next.
So this makes sense to me and works for my application, but when I then turn to the Mango/_find API I can't see any way to pass in startkey:
http://docs.couchdb.org/en/2.0.0/api/database/find.html
Confusingly enough, it does accept a skip parameter, but there is no startkey.
Is anybody able to explain what is happening here? Are the performance characteristics much different in Mango/_find such that we can safely use skip on large data sets? Or should we be using views with startkey when traversing larger collections of data?
This particular question doesn't seem to get answered in any recent documentation AFAIK. Any help would be greatly appreciated.
You could perhaps work around the lack of startkey/endkey support by including a constraint in the selector:
"selector": {
"_id": { "$gte": "myStartKey", "$lte": "myEndKey"}
}
(just my two cents; maybe somebody else has a more complete answer)
The CouchDB documented approach of pagination applies only to map-reduce/views kind of queries and cannot be applied to Mango queries. Primarily becos, for views, there is a single key field which is used for sorting, hence it's easy to skip previous doc by using that 'startkey' and in case of non-unique key by adding a startkey_docid.
For selector queries, to effectively skip previous records you will have to look at the sort keys specified in the original query and add another condition(s) to skip those docs which are already processed. For example, if you sorted (in asc) on a numeric field and processed till value = 10, then you could add a { "field" : { "$gte" : 10 } } as $and logic within original selector. This becomes complicated if you have multiple sort fields. A skip/limit might be an easier approach to pagination for selector queries.

Get the last documents?

CouchDB has a special _all_docs view, which returns documents sorted on ID. But as ID's are random by default, the sorting makes no sense.
I always need to sort by 'date added'. Now I have two options:
Generating my own ID's and make sure they start with a timestamp
Use standard GUID's, but add a timestamp in json, and sort on
that
Now the second solution is less hackish, but I suspect the first solution to be much more efficient and faster, because all queries will be done on the real row id, which is indexed.
Is it true that both solutions differ in performance? And if it's true, which one is likely to be faster or preferred?
Is it true that both solutions differ in performance?
Your examples given describing the primary and secondary index approach in CouchDB.
_all_docs is the only primary index and is always up-to-date. Secondary indexes (views) as in your second solution getting updated when they are requested.
Thats the reason why from the requesters point-of-view _all_docs might be "faster". In real there isn't a difference in requesting already up-to-date indexes. Two workarounds for potentially outdated views (secondary indexes) are the use of the query param stale=ok (update the view after the response to the request) or so called "view-heaters" (send a simple HTTP Get to the view to trigger the update process).
And if it's true, which one is [...] prefered?
The capabilities to build an useful index and response payload are significant higher on the side of secondary indexes.
When you want to use the primary index you have to "design" your id as you have described. You can imagine that is a huge pre-decision of what can also be done with the doc and the ids.
My recommendation would be to use secondary indexes (views). Only if you need data stored in real-time or high-concurrency scenarios you should include the primary index in the search for the best fit to request data.

CouchDB map/reduce by any document property at runtime?

I come from a SQL world where lookups are done by several object properties (published = TRUE or user_id = X) and there are no joins anywhere (because of the 1:1 cache layer). It seems that a document database would be a good fit for my data.
I am trying to figure-out if there is a way to pass one (or more) object properties to a CouchDB map/reduce function to find matching documents in a database without creating dozens of views for each document type.
Is it possible to pass the desired document property key(s) to match at run-time to CouchDB and have it return the objects that match (or the count of object that match for pagination)?
For example, on one page I want all posts with a doc.user_id of X that are doc.published. On another page I might want all documents with doc.tags[] with the tag "sport".
You could build a view that iterates over the keys in the document, and emits a key of [propertyName, propertyValue] - that way you're building a single index with EVERYTHING prop/value in it. Would be massive, no idea how performance would be to build, and disk usage (probably bad).
Map function would look something like:
// note - totally untested, my CouchDB fu is rusty
function(doc) {
for(prop in doc) {
emit([prop, doc[prop]], null);
}
}
Works for the basic case of simple properties, and can be extended to be smart about arrays, and emit a prop/value pair for each item in the array. That would let you handle the tags.
To query on it, set [prop] as your query key on the view.
Basically, no.
The key difference between something like Couch and a SQL DB is that the only way to query in CouchDB is essentially through the views/indexes. Indexes in SQL are optional. They exist (mostly) to boost performance. For example, if you have a small DB, your app will run just fine on SQL with 0 indexes. (Might be some issue with unique constraints, but that's a detail.)
The overall point being is that part of the query processor in a SQL database includes other methods of data access beyond simply indexes, notably table scans, merge joins, etc.
Couch has no query processor. It has views (defined by JS) used to define B-Tree indexes.
And, that's it. That's the hammer of Couch. It's a good hammer. It's been lasting the data processing world for basically 40 years.
Indexes are somewhat expensive to create in Couch (based on data volume) which is why "temporary views" are frowned upon. And they have a cost in maintenance as well, so views need to be a conscious design element in your database. At the same time, they're a bit more powerful than normal SQL indexes as well.
You can readily add your own query processing on top of Couch, but that will be more work for you. You can create a few select views, on your most popular or selective criteria, and then filter the resulting documents by other criteria in your own code. Yes, you have to do it, so you have to question whether the effort involved is worth more than whatever benefits you feel Couch is offering your (HTTP API, replication, safe, always consistent datastore, etc.) over a SQL solution.
I ran into a similar issue like this, and built a quick workaround using CouchDB-Python (which is a great library). It's not a pretty solution (goes against the principles of CouchDB), but it works.
CouchDB-Python gives you the function "Query", which allows you to "execute an ad-hoc temporary view against the database". You can read about it here
What I have is that I store the javascript function as a string in python, and the concatenate it with variable names that I define in Python.
In some_function.py
variable = value
# Map function (in javascript)
map_fn = """function(doc) {
<javascript code>
var survey_match = """ + variable + """;
<javascript code>
"""
# Iterates through rows
for row in db.query(map_fn):
<python code>
It sure isn't pretty, and probably breaks a bunch of CouchDB philosophies, but it works.
D

Resources