CouchDB map/reduce by any document property at runtime?

CouchDB map/reduce by any document property at runtime? - couchdb

I come from a SQL world where lookups are done by several object properties (published = TRUE or user_id = X) and there are no joins anywhere (because of the 1:1 cache layer). It seems that a document database would be a good fit for my data.
I am trying to figure-out if there is a way to pass one (or more) object properties to a CouchDB map/reduce function to find matching documents in a database without creating dozens of views for each document type.
Is it possible to pass the desired document property key(s) to match at run-time to CouchDB and have it return the objects that match (or the count of object that match for pagination)?
For example, on one page I want all posts with a doc.user_id of X that are doc.published. On another page I might want all documents with doc.tags[] with the tag "sport".

You could build a view that iterates over the keys in the document, and emits a key of [propertyName, propertyValue] - that way you're building a single index with EVERYTHING prop/value in it. Would be massive, no idea how performance would be to build, and disk usage (probably bad).
Map function would look something like:
// note - totally untested, my CouchDB fu is rusty
function(doc) {
for(prop in doc) {
emit([prop, doc[prop]], null);
}
}
Works for the basic case of simple properties, and can be extended to be smart about arrays, and emit a prop/value pair for each item in the array. That would let you handle the tags.
To query on it, set [prop] as your query key on the view.

Basically, no.
The key difference between something like Couch and a SQL DB is that the only way to query in CouchDB is essentially through the views/indexes. Indexes in SQL are optional. They exist (mostly) to boost performance. For example, if you have a small DB, your app will run just fine on SQL with 0 indexes. (Might be some issue with unique constraints, but that's a detail.)
The overall point being is that part of the query processor in a SQL database includes other methods of data access beyond simply indexes, notably table scans, merge joins, etc.
Couch has no query processor. It has views (defined by JS) used to define B-Tree indexes.
And, that's it. That's the hammer of Couch. It's a good hammer. It's been lasting the data processing world for basically 40 years.
Indexes are somewhat expensive to create in Couch (based on data volume) which is why "temporary views" are frowned upon. And they have a cost in maintenance as well, so views need to be a conscious design element in your database. At the same time, they're a bit more powerful than normal SQL indexes as well.
You can readily add your own query processing on top of Couch, but that will be more work for you. You can create a few select views, on your most popular or selective criteria, and then filter the resulting documents by other criteria in your own code. Yes, you have to do it, so you have to question whether the effort involved is worth more than whatever benefits you feel Couch is offering your (HTTP API, replication, safe, always consistent datastore, etc.) over a SQL solution.

I ran into a similar issue like this, and built a quick workaround using CouchDB-Python (which is a great library). It's not a pretty solution (goes against the principles of CouchDB), but it works.
CouchDB-Python gives you the function "Query", which allows you to "execute an ad-hoc temporary view against the database". You can read about it here
What I have is that I store the javascript function as a string in python, and the concatenate it with variable names that I define in Python.
In some_function.py
variable = value
# Map function (in javascript)
map_fn = """function(doc) {
<javascript code>
var survey_match = """ + variable + """;
<javascript code>
"""
# Iterates through rows
for row in db.query(map_fn):
<python code>
It sure isn't pretty, and probably breaks a bunch of CouchDB philosophies, but it works.
D

Related

How to get all documents from a collection in FaunaDB?

I already have an answer:
const faunadb = require('faunadb')
const q = faunadb.query
exports.handler = async (event, context) => {
const client = new faunadb.Client({
secret: process.env.FAUNADB_SERVER_SECRET
})
try {
// Getting the refs with a first query
let refs = await client.query(q.Paginate(q.Match(q.Index('skus'))))
// Forging a second query with the retrieved refs
const bigQuery = refs.data.map((ref) => q.Get(ref))
// Sending over that second query
let allDocuments = await client.query(bigQuery)
// All my documents are here!
console.log('#allDocuments: ', allDocuments);
//...
} catch (err) {
// ...
}
}
But I find it unsatisfying because I'm making 2 queries for what seems like one the most trivial DB call. It seems inefficient and wordy to me.
As I'm just learning about FaunaDB, there's probably something I don't grasp here.
My question could be split into 3:
Can I query for all documents in a single call?
If not, why not? What's the logic behind such a design?
Could I make such a query without an index?

FaunaDB's FQL language is quite similar to JavaScript (which helps a lot if you want to do conditional transactions etc).
In essence, FaunaDB also has a Map. Given that your index contains only one value that is the reference you can write this:
q.Map(
q.Paginate(q.Match(q.Index('skus'))),
q.Lambda(x => q.Get(x))
)
For this specific case, you actually do not need an index since each collection has a built-in default index to do a select all via the 'Documents' function.
q.Map(
q.Paginate(q.Documents(q.Collection('<your collection>'))),
q.Lambda(x => q.Get(x))
)
Now in case the index that you are using returns multiple values (because you would want to sort on something other than 'ref') then you need to provide the same amount of parameters to the Lambda as the amount of values that were defined in the index. Let's say my index has ts and ref in values because I want to sort them on time, then the query to get all values becomes:
q.Map(
q.Paginate(q.Match(q.Index('<your index with ts and ref values>'))),
q.Lambda((ts, ref) => q.Get(ref))
)
Values are used for range queries/sorting but also define what the index returns
Coming back to your questions:
- Can I query for all documents in a single call?
Absolutely, I would advice you to do so. Note that the documents you will get are paginated automatically. You can set the page size by providing a parameter to paginate and will get back an 'after' or 'before' attribute in case the page is bigger. That after or before can be again presented to the Paginate function as a parameter to get a next or previous page: https://docs.fauna.com/fauna/current/api/fql/functions/paginate
- Could I make such a query without an index?
No, but you can use the built-in index as explained above. FaunaDB protects users from querying without an index. Since it is a scalable database that could contain massive data and is pay-as-you-go it's a good idea to prevent users from shooting themselves in the foot :). Pagination and mandatory Indexes help to do that.
As to the why FQL is different. FQL is a language that is not declarative like many querying languages. Instead it's procedural, you write exactly how you fetch data. That has advantages:
By writing how data is retrieved you can exactly predict how a query behaves which is nice-to-have in a pay-as-you-go system.
The same language can be used for security rules or complex conditional transactions (update certain entities or many entities ranging over different collections depending on certain conditions). It's quite common in Fauna to write a query that does many things in one transaction.
Our flavour of 'stored procedures' called User Defined Functions are just written in FQL and not another language.
Querying is also discussed in this tutorial that comes with code in a GitHub repository which might give you a more complete picture: https://css-tricks.com/rethinking-twitter-as-a-serverless-app/

Can I query for all documents in a single call?
Yes, if your collection is small. The Paginate function defaults to fetching 64 documents per page. You can adjust the page size up to 100,000 documents. If your collection has more than 100,000 documents, then you have to execute multiple queries, using cursors to fetch subsequent documents.
See the Pagination tutorial for details: https://docs.fauna.com/fauna/current/tutorials/indexes/pagination
If not, why not? What's the logic behind such a design?
For an SQL database, SELECT * FROM table is both convenient and, potentially, a resource nightmare. If the table contains billions of rows, attempting to serve results for that query could consume the available resources on the server and/or the client.
Fauna is a shared database resource. We want queries to perform well for any user with any database, and that requires that we put sensible limits on the number of documents involved in any single transaction.
Could I make such a query without an index?
No, and yes.
Retrieving multiple results from Fauna requires an index, unless you are independently tracking the references for documents. However, with the Documents function, Fauna maintains an internal index so you don't need to create your own index to access all documents in a collection.
See the Documents reference page for details: https://docs.fauna.com/fauna/current/api/fql/functions/documents
Returning to your example code, you are executing two queries, but they could easily be combined into one. FQL is highly composable. For example:
let allDocuments = await client.query(
q.Map(
q.Paginate(q.Documents(q.Collection("skus"))),
q.Lambda("X", q.Get(q.Var("X")))
)
)
Your observation that FQL is wordy, is correct. Many functional languages exhibit that wordiness. The advantage is that any functions that accept expressions can be composed at will. One of the best examples of composability, and how to manage inter-document references, is presented in our E-commerce tutorial, specifically, the section describing the submit_order function: https://docs.fauna.com/fauna/current/tutorials/ecommerce#function

How to use database views in EF Core 3.0?

I know the question was asked before, but at the time it was, we had EF Core 2.x. The short answer was "no you can't" and obviously, not very helpful.
The other answers involved ugly hacks like changing migration files after they were created by the tool.
I make an application Code First. I have my models created with lot's of foreign keys and database joins in mind.
But here comes the unpleasant surprise (I'm a little new to EF): those joins written in LINQ are pretty slow, as a matter of fact they do not produce database join, but fetch whole tables instead.
Of course it's totally unacceptable, I import an old database with millions of records, with the joins I get results in milliseconds, without I get couple of seconds lags - on my very fast internet connection (in real world scenario it would be much worse).
I need views, and AFAIK EF won't create them for me, is it STILL true for EF 3.0?
Then, what would be the best and the most clean way to create views in SQL and to make entities for them? I mean - considering the situation the database models would change over time, and the database structure would have to be updated.
Well, I would prefer doing my joins not in SQL views, just have queries returned "JOIN" statement results. Especially some not obvious joins. Lets say table B has a column being a foreign key referencing table A. I want to get results from table A joining B for details. With normal SQL JOIN performance.
I checked the database: there is no significant performance difference between "select * from A" and "select * from A join B...". In LINQ - the difference is huge.

I figured out that in Code First database views are redundant.
The "views" can be created as models (ordinary classes) having a field or a property set to joined entity. I use private fields for that purpose. Then I use LINQ Join() to create my view entity. The query may refer ONLY to the fields set to joined entities, nothing else. Such query, if written properly translates clearly to SQL JOIN and works with full speed. In my application it's equivalent of a database view.
Why private fields and not properties, you may ask. Maybe because joined entities are "implementation details", but another reason is my presentation code uses reflection to operate on entity public properties, it's good to have those entities hidden from it. Otherwise I would probably need to use attributes to hide those "columns".
BTW, such views can be ordered with OrderBy(), filtered with Where() at virtually no cost. The constraint is to maintain the collection's IQueryable interface, never refer joined entities indirectly. So even if X refers to A.B, never refer X in a LINQ query, always A.B where A is direct entity reference assigned in the Join() query.
To build dynamic queries at runtime one must use expressions.
This set of properties of EF Core 3.0 allows to build a database application without using SQL, but with the full SQL speed maintained. However, the database / entity structure must be relatively simple to achieve that.

homogeneous vs heterogeneous in documentdb

I am using Azure DocumentDB and all my experience in NoSql has been in MongoDb. I looked at the pricing model and the cost is per collection. In MongoDb I would have created 3 collections for what I was using: Users, Firms, and Emails. I noted that this approach would cost $24 per collection per month.
I was told by the people I work with that I'm doing it wrong. I should have all three of those things stored in a single collection with a field to describe what the data type is. That each collection should be related by date or geographic area so one part of the world has a smaller portion to search.
and to:
"Combine different types of documents into a single collection and add
a field across all to separate them in searching like a type field or
something"
I would never have dreamed of doing that in Mongo, as it would make indexing, shard keys, and other things hard to get right.
There might not be may fields that overlap between the objects (example: Email and firm objects)
I can do it this way, but I can't seem to find a single example of anyone else doing it that way - which indicates to me that maybe it isn't right. Now, I don't need an example, but can someone point me to some location that describes which is the 'right' way to do it? Or, if you do create a single collection for all data - other than Azure's pricing model, what are the advantages / disadvantages in doing that?
Any good articles on DocumentDb schema design?

Yes. In order to leverage CosmosDb to it's full potential need to think of a Collection is an entire Database system and not as a "table" designed to hold only one type of object.
Sharding in Cosmos is exceedingly simply. You just specify a field that all of your documents will populate and select that as your partition key. If you just select a generic value such as key or partitionKey you can easily separate the storage of your inbound emails, from users, from anything else by picking appropriate values.
class InboundEmail
{
public string Key {get; set;} = "EmailsPartition";
// other properties
}
class User
{
public string Key {get; set;} = "UsersPartition";
// other properties
}
What I'm showing is still only an example though. In reality your partition key values should be even more dynamic. It's important to understand that queries against a known partition are extremely quick. As soon as you need to scan across multiple partitions you'll see much slower and more costly results.
So, in an app that ingests a lot of user data. Keeping a single user's activity together in one partition might make sense for that particular entity.
If you want evidence that this is the appropriate way to use CosmosDb, consider the addition of the new Gremlin Graph APIs. Graphs are inherently heterogenous as they contain many different entities and entity types as well as the relationships between them. The query boundary of Cosmos is at the collection level so if you tried putting your entities all in different collections none of the Graph API or queries would work.
EDIT:
I noticed in the comments you made this statement And you would have an index on every field in both objects. CosmosDb does automatically index every field of every document. They use a special proprietary path based indexing mechanism that ensures every path of your JSON tree has indices on it. You have to specifically opt out of this auto indexing feature.

PouchDB structure

i am new with nosql concept, so when i start to learn PouchDB, i found this conversion chart. My confusion is, how PouchDB handle if lets say i have multiple table, does it mean that i need to create multiple databases? Because from my understanding in pouchdb a database can store a lot of documents, but a document mean a row in sql or am i misunderstood?

The answer to this question seems to be surprisingly under-documented. While #llabball clearly gave a decent answer, I don't think that views are always the way to go.
As you can read here in the section When not to use map/reduce, Nolan explains that for simpler applications, the key is to abuse _ids, and leverage the power of allDocs().
In other words, if you had two separate types (say artists, and albums), then you could prefix the id of each type to obtain an easily searchable data set. For example _id: 'artist_name' & _id: 'album_title', would allow you to easily retrieve artists in name order.
Laying out the data this way will result in better performance due to not requiring extra indexes, and less code. Clearly however, if your data requirements are more complex, then views are the way to go.

... does it mean that i need to create multiple databases?
No.
... a document mean a row in sql or am i misunderstood?
That's right. The SQL table defines column header (name and type) - that are the JSON property names of the doc.
So, all docs (rows) with the same properties (a so called "schema") are the equivalent of your SQL table. You can have as much different schemata in one database as you want (visit json-schema.org for some inspiration).
How to request them separately? Create CouchDB views! You can get all/some "rows" of your tabular data (docs with the same schema) with one request as you know it from SQL.
To write such views easily the property type is very common for CouchDB docs. Your known name from a SQL table can be your type like doc.type: "animal"
Your view names will be maybe animalByName or animalByWeight. Depends on your needs.

Sometimes multiple-databases plan is a good option, like a database per user or even a database per user-feature. Take a look at this conversation on CouchDB mailing list.

Why should (or shouldn't) a Search Query return back only document IDs?

So for a new project, I'm building a system for an ecommerce site. The idea is to import products from suppliers and instead of inserting them directly into our catalog, we would store all the information in a staging area. Each supplier has their own stage (i.e. table in the database), and then I will flatten the multiple staging areas into a single entity (currently a single table but later on perhaps into Sphinx or Solr). Then our merchandisers would be able to search the staging products' relevant fields (name and description) and be shown a list of products that match and then choose to have those products pushed into the live catalog. The search will query on the single table (the flattened staging areas).
My design calls to only store searchable and filterable fields in the single flattened table - e.g. name, description, supplier_id, supplier_prod_id etc. And the search queries will return only the ID's of the items matching and a class (supplier_id) that would be used to identify which staging area the product is from.
Another senior engineer feels the flattened search table should include other meta fields (which would not be searched on), but could be used when 'pushing' the products from stage to live catalog. He also feels that the query should return all this other information.
I feel pretty strongly about only having searchable fields in the flattened table and having the search return only class/id pairs which could be used to fetch all the other necessary metadata about the product (simple select * from class_table where id in (1,2,3)).
Part of my reasoning is that this will make it easier later on to switch the flattened table from database to a search server like sphinx or solr and the rest of the code wouldn't have to be changed just because implementation of the search changed.
Am I on the right path? How can I convince the other engineer why it is important to keep only searchable fields and return only ID's? Or more specifically, why should a search application return only IDs of objects?

I think that you're on the right path. If those other fields provide no value to either uniquely identify a staged item or to allow the user to filter staged items, then the data is fundamentally useless until the item is pushed to the live environment. If the other engineer feels that the extra metadata will help the users make a more informed decision, then you might as well make those extra fields searchable (thereby meeting your stated purpose for the table(s).)
The only reason I could think of to pre-fetch that other, non-searchable data would be for a performance improvement on the push to the live environment.

You should use each tool for what it does best. A full text search engine, such as Solr or Sphinx, excels at searching textual fields and ranking the hits quickly. It has no special advantage in retrieving stored data in a select-like fashion. A database is optimized for that. So, yes, you are on the right path. Please see Search Engine versus DBMS for other issues involved in deciding what to store inside the search engine.

In the case of sphinx, it only returns document ids and named attributes back to you anyway (attributes being numerical data, for the most part). I'd say you've got the right idea as the other metadata is just a simple JOIN away from the flattened table if you need it.

You can regard Solr as a powerfull index, so as an index gives IDs back, it would be logical that solr does the same.
You can use the solr query parameter fl to ask for identifier only results, for instance fl=id.
However, there's a feature that needs solr to give you back some data too: the highlighting of search terms in the matched documents. If you don't need it, then using solr to retrieve the identifiers only is fine (I assume you need only the documents list, and no other features, like facets, related docs or spell checking).
That said, it should matter how you build your objects in your search function, either from the DB using uniquely solr to retrieve IDs or from solr returned fields (providing they're stored) or even a mix of both. Think solr to get the 'highlighted' content fields and DB for the other ones. Again if you don't need highlighting, this is not an issue.

I'm using Solr with thousands of documents but only return the ids for the following reasons :
For Solr :
- if some sync mistake append, it's not a big deal (especially in your case, displaying a different price can be a big issue... it's like the item will not be in the right place, but the data are right)
- you will save a lot of time because when you don't ask Solr to return the 'description' of documents (I mean many lines of text)
For your DB :
- you can cache your results, so it's even faster with an ID (you don't need all the data from Solr everytime !!!)
- you build you results in the same way (you don't need a specific method when you want to build html from Solr, and an other method from your DB)
I think there is a lot more...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string