How to get all documents from a collection in FaunaDB? - node.js

I already have an answer:
const faunadb = require('faunadb')
const q = faunadb.query
exports.handler = async (event, context) => {
const client = new faunadb.Client({
secret: process.env.FAUNADB_SERVER_SECRET
})
try {
// Getting the refs with a first query
let refs = await client.query(q.Paginate(q.Match(q.Index('skus'))))
// Forging a second query with the retrieved refs
const bigQuery = refs.data.map((ref) => q.Get(ref))
// Sending over that second query
let allDocuments = await client.query(bigQuery)
// All my documents are here!
console.log('#allDocuments: ', allDocuments);
//...
} catch (err) {
// ...
}
}
But I find it unsatisfying because I'm making 2 queries for what seems like one the most trivial DB call. It seems inefficient and wordy to me.
As I'm just learning about FaunaDB, there's probably something I don't grasp here.
My question could be split into 3:
Can I query for all documents in a single call?
If not, why not? What's the logic behind such a design?
Could I make such a query without an index?

FaunaDB's FQL language is quite similar to JavaScript (which helps a lot if you want to do conditional transactions etc).
In essence, FaunaDB also has a Map. Given that your index contains only one value that is the reference you can write this:
q.Map(
q.Paginate(q.Match(q.Index('skus'))),
q.Lambda(x => q.Get(x))
)
For this specific case, you actually do not need an index since each collection has a built-in default index to do a select all via the 'Documents' function.
q.Map(
q.Paginate(q.Documents(q.Collection('<your collection>'))),
q.Lambda(x => q.Get(x))
)
Now in case the index that you are using returns multiple values (because you would want to sort on something other than 'ref') then you need to provide the same amount of parameters to the Lambda as the amount of values that were defined in the index. Let's say my index has ts and ref in values because I want to sort them on time, then the query to get all values becomes:
q.Map(
q.Paginate(q.Match(q.Index('<your index with ts and ref values>'))),
q.Lambda((ts, ref) => q.Get(ref))
)
Values are used for range queries/sorting but also define what the index returns
Coming back to your questions:
- Can I query for all documents in a single call?
Absolutely, I would advice you to do so. Note that the documents you will get are paginated automatically. You can set the page size by providing a parameter to paginate and will get back an 'after' or 'before' attribute in case the page is bigger. That after or before can be again presented to the Paginate function as a parameter to get a next or previous page: https://docs.fauna.com/fauna/current/api/fql/functions/paginate
- Could I make such a query without an index?
No, but you can use the built-in index as explained above. FaunaDB protects users from querying without an index. Since it is a scalable database that could contain massive data and is pay-as-you-go it's a good idea to prevent users from shooting themselves in the foot :). Pagination and mandatory Indexes help to do that.
As to the why FQL is different. FQL is a language that is not declarative like many querying languages. Instead it's procedural, you write exactly how you fetch data. That has advantages:
By writing how data is retrieved you can exactly predict how a query behaves which is nice-to-have in a pay-as-you-go system.
The same language can be used for security rules or complex conditional transactions (update certain entities or many entities ranging over different collections depending on certain conditions). It's quite common in Fauna to write a query that does many things in one transaction.
Our flavour of 'stored procedures' called User Defined Functions are just written in FQL and not another language.
Querying is also discussed in this tutorial that comes with code in a GitHub repository which might give you a more complete picture: https://css-tricks.com/rethinking-twitter-as-a-serverless-app/

Can I query for all documents in a single call?
Yes, if your collection is small. The Paginate function defaults to fetching 64 documents per page. You can adjust the page size up to 100,000 documents. If your collection has more than 100,000 documents, then you have to execute multiple queries, using cursors to fetch subsequent documents.
See the Pagination tutorial for details: https://docs.fauna.com/fauna/current/tutorials/indexes/pagination
If not, why not? What's the logic behind such a design?
For an SQL database, SELECT * FROM table is both convenient and, potentially, a resource nightmare. If the table contains billions of rows, attempting to serve results for that query could consume the available resources on the server and/or the client.
Fauna is a shared database resource. We want queries to perform well for any user with any database, and that requires that we put sensible limits on the number of documents involved in any single transaction.
Could I make such a query without an index?
No, and yes.
Retrieving multiple results from Fauna requires an index, unless you are independently tracking the references for documents. However, with the Documents function, Fauna maintains an internal index so you don't need to create your own index to access all documents in a collection.
See the Documents reference page for details: https://docs.fauna.com/fauna/current/api/fql/functions/documents
Returning to your example code, you are executing two queries, but they could easily be combined into one. FQL is highly composable. For example:
let allDocuments = await client.query(
q.Map(
q.Paginate(q.Documents(q.Collection("skus"))),
q.Lambda("X", q.Get(q.Var("X")))
)
)
Your observation that FQL is wordy, is correct. Many functional languages exhibit that wordiness. The advantage is that any functions that accept expressions can be composed at will. One of the best examples of composability, and how to manage inter-document references, is presented in our E-commerce tutorial, specifically, the section describing the submit_order function: https://docs.fauna.com/fauna/current/tutorials/ecommerce#function

Related

Using and populating (real) DBRef arrays with Mongoose / mongoose-dbref

Mongoose doesn't appear to support Mongo DBRefs. Apparently they released "DBRef" support but it was actually just plain references (no ability to reference documents from different collections). I've finally managed to craft a schema that allows me to hold an array of ObjectID references and populate them, which is great for certain parts of my schema, but it would be extremely convenient if I could use proper DBRefs to create an array that lets me refer to documents from a number of collections.
Luckily(?) there's a module that can monkey patch DBRef support into mongoose: https://github.com/goulash1971/mongoose-dbref
Unluckily, I can't make any sense of the documents. The best I can tell is that there is no ability to use DBRefs in an array (there is a 'fetch' method to dereference, but it takes a single dbref); 'populate' doesn't seem to be patched to fill in DBRefs, and I can't tell how I'm supposed to assign a DBRef given a source document [collection.items.push(?????)].
From the internet, it appears that I can assign an object of the form { $id: document._id, $ref: 'Collection' } -- when logging the result, it appears to have "taken" as a DBRef data type, but I am unsure if this is correct since I cannot seem to do anything useful with it (turn the ref back into a document).
What I really want is a way to represent an ordered list of items from multiple collections; any solution to this is fine by me, but so far DBRefs are the best I've got. Help?
A DBRef (as explained in detail here) is a tuple containing the ObjectId, collection name, and possibly the database container name of a referenced object in another collection.
Internally in the MongoDB server these serve no purpose and are just data within a document. The point is for use in some drivers and ODM implementations to allow for some sort of automatic expansion by issuing additional queries to the server in order to have the data that is elsewhere appear to be an ordinary sub-document part of the referencing document. This can be automatic or a lazy load depending on the implementation, but is always done over the wire and processed on the client side. The server will do nothing to traverse or join this data.
Additionally, MongoDB collections are schemaless, so there is nothing as in the relational sense that says all documents in a collection have to have the same structure.
In the case of Mongoose, there are built in functions to do this sort of loading for you as a convenience, and while not strictly a DBRef and utilizing documents with a different schema in the same collection is the same means as storing the documents external to the referencing document.
It is important to consider the data access patterns of your application and not to simply opt for the same sort of relational design you are used to. Keeping in mind that you are only ever reading from one collection at a time, it is most desirable to get at the data you need in a single read or write, without multiple operations over the wire, which will slow things down considerably.
In short, you should always consider embedding sub-documents first, and then use external references any your best supported form only when you absolutely have to. Your application users will thank you in the end.

CouchDB map/reduce by any document property at runtime?

I come from a SQL world where lookups are done by several object properties (published = TRUE or user_id = X) and there are no joins anywhere (because of the 1:1 cache layer). It seems that a document database would be a good fit for my data.
I am trying to figure-out if there is a way to pass one (or more) object properties to a CouchDB map/reduce function to find matching documents in a database without creating dozens of views for each document type.
Is it possible to pass the desired document property key(s) to match at run-time to CouchDB and have it return the objects that match (or the count of object that match for pagination)?
For example, on one page I want all posts with a doc.user_id of X that are doc.published. On another page I might want all documents with doc.tags[] with the tag "sport".
You could build a view that iterates over the keys in the document, and emits a key of [propertyName, propertyValue] - that way you're building a single index with EVERYTHING prop/value in it. Would be massive, no idea how performance would be to build, and disk usage (probably bad).
Map function would look something like:
// note - totally untested, my CouchDB fu is rusty
function(doc) {
for(prop in doc) {
emit([prop, doc[prop]], null);
}
}
Works for the basic case of simple properties, and can be extended to be smart about arrays, and emit a prop/value pair for each item in the array. That would let you handle the tags.
To query on it, set [prop] as your query key on the view.
Basically, no.
The key difference between something like Couch and a SQL DB is that the only way to query in CouchDB is essentially through the views/indexes. Indexes in SQL are optional. They exist (mostly) to boost performance. For example, if you have a small DB, your app will run just fine on SQL with 0 indexes. (Might be some issue with unique constraints, but that's a detail.)
The overall point being is that part of the query processor in a SQL database includes other methods of data access beyond simply indexes, notably table scans, merge joins, etc.
Couch has no query processor. It has views (defined by JS) used to define B-Tree indexes.
And, that's it. That's the hammer of Couch. It's a good hammer. It's been lasting the data processing world for basically 40 years.
Indexes are somewhat expensive to create in Couch (based on data volume) which is why "temporary views" are frowned upon. And they have a cost in maintenance as well, so views need to be a conscious design element in your database. At the same time, they're a bit more powerful than normal SQL indexes as well.
You can readily add your own query processing on top of Couch, but that will be more work for you. You can create a few select views, on your most popular or selective criteria, and then filter the resulting documents by other criteria in your own code. Yes, you have to do it, so you have to question whether the effort involved is worth more than whatever benefits you feel Couch is offering your (HTTP API, replication, safe, always consistent datastore, etc.) over a SQL solution.
I ran into a similar issue like this, and built a quick workaround using CouchDB-Python (which is a great library). It's not a pretty solution (goes against the principles of CouchDB), but it works.
CouchDB-Python gives you the function "Query", which allows you to "execute an ad-hoc temporary view against the database". You can read about it here
What I have is that I store the javascript function as a string in python, and the concatenate it with variable names that I define in Python.
In some_function.py
variable = value
# Map function (in javascript)
map_fn = """function(doc) {
<javascript code>
var survey_match = """ + variable + """;
<javascript code>
"""
# Iterates through rows
for row in db.query(map_fn):
<python code>
It sure isn't pretty, and probably breaks a bunch of CouchDB philosophies, but it works.
D

CouchDB view query with multiple key values

I am currently trying to create a view and query to fit this SQL query:
SELECT * FROM articles
WHERE articles.location="NY" OR articles.location="CA"
ORDER BY articles.release_date DESC
I tried to create a view with a complex key:
function(doc) {
if(doc.type == "Article") {
emit([doc.location, doc.release_date], doc)
}
}
And then using startkey and endkey to retrieve one location and ordering the result on the release date.
.../_view/articles?startkey=["NY", {}]&endkey=["NY"]&limit=5&descending=true
This works fine.
However, how can I send multiple startkeys and endkeys to my view in order to mimic
WHERE articles.location="NY" OR articles.location="CA" ?
My arch nemesis, Dominic, is right.
Furthermore, it is never possible to query by criteria A and then sort by criteria B in CouchDB. In exchange for that inconvenience, CouchDB guarantees scalable, dependable, logarithmic query times. You have a choice.
Store the view output in its own database, and make a new view to sort by criteria B
or, sort the rows afterward, which can be either
Sort client-side, once you receive the rows
Sort server-side, in a _list function. The is great, but remember it's not ultimately scalable. If you have millions of rows, the _list function will probably crash.
The short answer is, you currently cannot use multiple startkey/endkey combinations.
You'll either have to make 2 separate queries, or you could always add on the lucene search engine to get much more robust searching capabilities.
It is possible to use multiple key parameters in a query. See the Couchbase CouchDB documentation on multi-document fetching.

Approaches to generate auto-incrementing numeric ids in CouchDB

Since CouchDB does not have support for SQL alike AUTO_INCREMENT what would be your approach to generate sequential unique numeric ids for your documents?
I am using numeric ids for:
User-friendly IDs (e.g. TASK-123, RQ-001, etc.)
Integration with libraries/systems that require numeric primary key
I am aware of the problems with replication, etc. That's why I am interested in how people try to overcome this issue.
As Dominic Barnes says, auto-increment integers are not scalable, not distributed-friendly or cloud-friendly. It seems every app nowadays needs a mobile version with offline support, and that is not directly compatible with auto-increment integers. We all know this, but it's true: auto-increment integers are necessary for legacy code and arguably other stuff.
In both scenarios, you are responsible for producing the auto-incrementing integer. A view is running emit(the_numeric_id, null). (You could also have a "type" namespace, e.g. by emit([doc.type, the_numeric_id], null). Query for the final row (e.g. with a startkey=MAXINT&descending=true&limit=1, increment the value returned, and that is your next id. The attempt to save is in a loop which can retry if there was a collision.
You can also play tricks if you don't need 100% density of the list of IDs. For example, you can add timestamps to the emit() rows, and estimate the document creation velocity, and increment by that velocity times your computation and transmit time. You could also simply increment by a random integer between 1 and N, so most of the time the first insert works, at a cost of non-homogeneous ID numbers.
About where to store the integer, I think there is the id strategy and the try and check strategy.
The id strategy is simpler and quicker in the short term. Document IDs are an integer (perhaps prefixed with a type to add a namespace). Since Couch guarantees uniqueness on the _id field, you just worry about the auto-incrementing. Do this in a loop: 409 Conflict triggers a retry, 201 Accepted means you're done.
I think the major pain with this trick is, that if and when you get conflicts, you have two completely unrelated documents, and one of them must be copied into a fresh document. If there were relationships with other documents, they must all be corrected. (The CouchDB 0.11 emit(key, {_id: some_foreign_doc_id}) trick comes to mind.)
The try and check strategy uses the default UUID as the doc._id, so every insert will succeed. Ideally, all or most of your inter-document relations are based on the immutable UUID _id, not the integer. That is just used for users and UI. The auto-incrementing integer is simply a field in the document, {"int_id":20}. The view of course does emit(doc.int_id, null). (You can look up a document by integer id with a ?key=23?include_docs=true parameter of the view.
Of course, after a replication, you might have id conflicts (not official CouchDB conflicts, but just documents using the same numeric id). The view which emits by ID would also have a reduce phase: simply _count should be enough. Next you must patrol the DB, querying this view with ?group=true and looking for any row (corresponding to an integer id) which has a count > 1. On the plus side, correcting the numeric id of a document is a minor change because it does not require new document creation.
Those are my ideas. Now that I wrote them down, I feel like you must do relation-shepherding regardless of where the id is stored; so perhaps using _id is better after all. The only other downside I see is that you are permanently married to a fundamentally broken naming model—for some definition of "permanently."
Is there any particular reason you want to use numeric IDs over the UUIDs that CouchDB can generate for you? UUIDs are perfect for the distributed paradigm that CouchDB uses, stick with what is built in.
If you find yourself with any more than 1 CouchDB node in your architecture, you're going to get conflicting document IDs if you rely on something like "auto increment" when it comes time for replication. Even if you're only using 1 node now, that's probably not always going to be the case, especially since CouchDB works so well in a distributed and "offline" architecture.
I have had pretty good luck just using an iso formatted date as my key:
http://wiki.apache.org/couchdb/IsoFormattedDateAsDocId
It's pretty simple to do, human-readable and it basically builds in a few querying options by just existing. :-)
Keeping in mind the issues around replication and conflicts, you can use an update function to generate incrementing IDs that are guaranteed unique in a single master setup.
function(doc, req) {
if (!doc) {
doc = {
_id: req.id,
type: 'idGenerator',
count: 0
};
}
doc.count++;
return [doc, toJSON(doc.count)];
}
Include this function in a design document like so:
{
"_id": "_design/application",
"language": "javascript",
"updates": {
"generateId": "function (doc, req) {\n\t\t\tif (!doc) {\n\t\t\t\tdoc = {\n\t\t\t\t\t_id: req.id,\n\t\t\t\t\ttype: 'idGenerator',\n\t\t\t\t\tcount: 0\n\t\t\t\t};\n\t\t\t}\n\n\t\t\tdoc.count++;\n\t\t\t\n\t\t\treturn [doc, toJSON(doc.count)];\n\t\t}"
}
}
Then call it like so:
curl -XPOST http://localhost:5984/mydb/_design/application/_update/generateId/entityId
Replace entityId with whatever you like to create several independent ID sequences.
Not a perfect solution but something that worked for me. Create an independent service that generates auto-incremented ids. Yes, you probably say "this breaks the offline model of couchdb" but what if you get a pool of N ids that you can then use whenever you need to get a new auto-incremented id. Then every time you're online you get some more ids and if you are running out of ids you tell your users - please go online. If the pool is big enough (say the monthly traffic) this shouldn't happen. Again, not perfect but maybe can be helpful to some people.
Instead of explicitly constructing an increasing integer key, you could use the implicit index couchDB accepts for paging.
The skip parameter accepts an integer that will effectively provide the auto-incrementing index you are used to.
http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options
The drawback is that it is not a viable solution for "User-friendly IDs". The index is not tied to the doc, and is subject to change if you are rewriting history.
If your only constraint is "integration with libraries/systems that require numeric primary key", this will bridge the gap without loosing the benefits of couchDB's key structure.

representing a many-to-many relationship in couchDB

Let's say I'm writing a log analysis application. The main domain object would be a LogEntry. In addition. users of the application define a LogTopic which describes what log entries they are interested in. As the application receives log entries it adds them to couchDB, and also checks them against all the LogTopics in the system to see if they match the criteria in the topic. If it does then the system should record that the entry matches the topic. Thus, there is a many-to-many relationship between LogEntries and LogTopics.
If I were storing this in a RDBMS I would do something like:
CREATE TABLE Entry (
id int,
...
)
CREATE TABLE Topic (
id int,
...
)
CREATE TABLE TopicEntryMap (
entry_id int,
topic_id int
)
Using CouchDB I first tried having just two document types. I'd have a LogEntry type, looking something like this:
{
'type': 'LogEntry',
'severity': 'DEBUG',
...
}
and I'd have a LogTopic type, looking something like this:
{
'type': 'LogTopic',
'matching_entries': ['log_entry_1','log_entry_12','log_entry_34',....],
...
}
You can see that I represent the relationship by using a matching_entries field in each LogTopic documents to store a list of LogEntry document ids. This works fine up to a point, but I have issues when multiple clients are both attempting to add a matching entry to a topic. Both attempt optimistic updates, and one fails. The solution I'm using now is to essentially reproduce the RDBMS approach, and add a third document type, something like:
{
'type':'LogTopicToLogEntryMap',
'topic_id':'topic_12',
'entry_id':'entry_15'
}
This works, and gets past the concurrent update issues, but I have two reservations:
I worry that I'm just using this
approach because it's what I'd do in
a relational DB. I wonder if there's
a more couchDB-like (relaxful?)
solution.
My views can no longer
retrieve all the entries for a
specific topic in one call. My
previous solution allowed that (if I
used the include_docs parameter).
Anyone have a better solution for me? Would it help if I also posted the views I'm using?
I cross-posted this question to the couchdb users mailing list and Nathan Stott pointed me to a very helpful blog post by Christopher Lenz
Your approach is fine. Using CouchDB doesn't mean you'll just abandon relational modeling. You will need need to run two queries but that's because this is a "join". SQL queries with joins are also slow but the SQL syntax lets you express the query in one statement.
In my few months of experience with CouchDB this is what I've discovered:
No schema, so designing the application models is fast and flexible
CRUD is there, so developing your application is fast and flexible
Goodbye SQL injection
What would be a SQL join takes a little bit more work in CouchDB
Depending on your needs I've found that couchdb-lucene is also useful for building more complex queries.
I'd try setting up the relation so that LogEntrys know to which LogTopics they belong. That way, inserting a LogEntry won't produce conflicts as the LogTopics won't need to be changed.
Then, a simple map function would emit the LogEntry once for each LogTopic it belongs to, essentially building up your TopicEntryMap on the fly:
"map": function (doc) {
doc.topics.map(function (topic) {
emit(topic, doc);
});
}
This way, querying the view with a ?key=<topic> argument will give you all the entries that belong to a topic.

Resources