Cocuhdb a fit for a query like this? - couchdb

I have been looking for an escape from GAE as the datastore does not support a lot of the things I want to do with it.
So I have looked at CouchDB (among others) and I really like the REST interface and the hosting option I found at Cloudant.
But for all my googling and reading any docs I could find, I still am not sure if it is a good fit.
So I come here in the hope that someone might have more insight.
I write web apps and a lot of the projects I want to do will involve a query that looks like this:
Find all entries that are within a user-input-lat/long bounding box and where start-time is less than user-input-time-1 and end-time is greater than user-input-time-2 and has all tags in user-input-list-of-tags.
Thats not even pseudocode, but I hope it makes sense anyway.
I am not just looking for a "You cannot do that in CouchDB". Some kind of explanation and perhaps something like "If you can live without the tags then you can do this:"
I would like to use the Cloudant service so GeoCouch is apparently out of the question, but they do something that should work like lucene, but does that mean the queries are slow?
As you can tell, I am a bit confused here, so just do your best to straighten me out and I'll be greatfull :)

Not mentioning the tags (which in itself is already a problem), what you describe is a multi-dimensional query : you have several "coordinates" (lat, long, start-time, end-time) and provide a range for each of these coordinates.
On its own, CouchDB cannot perform multi-dimensional queries at all — you only get single-dimension queries across one coordinate.
Tags certainly are possible, but it depends on whether you need documents that have at least one tag in the list, or documents that have all tags in the list. The first case is easy (run one query per tag using the bulk API), the second might require excessive amounts of memory (if a document has N tags, it needs to emit 2N-1 tag-sets in order to match all possible tag combinations involving it, so you should place an upper bound on either the number of tags in a document, or the number of tags in a query).
Lucene does allow multi-dimensional and keyword-based queries, though I cannot vouch for their performance.

Related

MongoDB most efficient Query Strategy

I state that I have already tried to look in the Mongo documentation, but I have not found what I am looking for. I've also read similar questions, but they always talk about very simple queries. I'm working with the Node's Mongo native driver. This is a scalability problem, so the collections I am talking about can have millions of records or some dozen.
Basically I have a query and I need to validate all results (which have a complex structure). Two possible solutions come to mind:
I create a query as specific as possible and try to validate the result directly on the server
I use the cursor to go through the documents one by one from the client (this would also allow me to stop if I am looking for only one result)
Here is the question: what is the most efficient way, in terms of latency, overall time, bandwidth use and computational weight server/client? There is probably no single answer, in fact I'd like to understand the pros and cons of the different approaches (and whichever approach you recommend). I know the solution should be determined on a case-by-case basis, however I am trying to figure out what could best cover most of the cases.
Also, to be more specific:
A) Being a complex query (several nested objects with ranges of values ​​and lists of values ​​allowed), performing the validation from the server would certainly save bandwidth, but is it always possible? And in terms of computation could it be more efficient to do it on the client?
B) I don't understand the cursor behavior: is it a continuously open stream until it is closed by server/client? In addition, does the next() result already take up resources on the server/client or does it happen to the call?
If anyone knows, I'd also like to know how Mongoose solved these "problems", for example in the case of custom validators.

Search documents that contain some text, but keep information about what field matches

I'm learning node.js and mongodb. I'm learning by solving some problems. I want to make site that can search video database. Each video has title, description, author and a subarray of notes (you can think of it as a comments). Each note has a subarray of manual references to tags documents that exists in tags collection.
I need to search for some text in videos collection. For each resulting video I need to know if search criteria matches some of basic fields (author, title, description) or if some of its notes, including names of tags, matches criteria. Or both.
I know that this may not be right task for beginner but I would really like to make this work. I have some ideas about how to do this, but they probably are not good since I don't know much about mongo and it's capabilities.
What do you suggest, what should I use? Should I use text search capabilities + some aggregation? Should I offload some of work to be processed by application rather than mongo?
I probably don't need details, just directions.
Thank you.
Since nobody answers, I decided to share my idea and how I did it. There is probably better solution, that is way I asked this question.
I did two separate queries using regex, that I merged results in application code.
I used ES6 Map to make union of these two sets.

Implement site wide search with neo4j db using node-neo4j

I am using node-neo4j to communicate with my neo4j. Following github.com/aseemk/node-neo4j-template was a real help to get started. Still learning my way to get things done, I am looking to solve a few issues, I'd appreciate any heads up you give me.
Implement site wide search.
We have users indexed with their email id's, and want to index stories/posts by tags or keywords. How do we search across all nodes, do we maintain indices for all nodes of various types, what would be a good approach? Should I go with google to enable this feature? How to index same node with multiple tags/keywords?
Specify custom id's for nodes
We are fine with integer indices for nodes, but since these id's can be re-used, we would like to identify nodes with unique id's, Is there a way to make neo4j use uuid's, adding an uid attribute would do but want to avoid having to maintain two id's.
Traversing nodes
How do we traverse nodes using node-neo4j, Cipher-lang looks like the answer, I am yet to get used to it. Does node-neo4j help do this out of the box?
Transactions
I may sound silly, but can I do transactional operations with node-neo4j?
Too many questions, I feel most of my doubts would clear once I get more used to querying the db, but any input from you will give me a headstart.
You probably should have broken this up into separate questions. I can answer a couple of them but not all.
Yes, node-neo4j can handle Cypher out of the box, with the query method: https://github.com/thingdom/node-neo4j/blob/develop/lib/GraphDatabase._coffee#L179. Help with Cypher--you should watch this intro video: http://vimeopro.com/neo4j/webinars/video/48603403
For your uuid, you probably should add a separate attribute to the nodes, and have an index on it--just ignore the regular ids except during transient queries where it's more convenient. As far as I know there's no way to override the incrementing ID--that sure would be nice, though.
Hope that helps.

In CouchDB, are there ways to improve performance of the View index process?

I have some basic views and some map/reduce views with logic. Nothing too complex. Not too many documents. I've tried with 250k, 75k, and 10k documents. Seems like I'm always waiting for view indexing.
Does better, more efficient code in the view help? I'm assuming it's basically processing the view at all levels of aggregation. So there must be some improvement there.
Does emit()-ing less data help? emit(doc.id, doc) vs specifying fewer fields?
Do more or less complex keys impact view indexing?
Or is it all about memory, CPU cores, and processor speed?
There must be some documentation out there, but I can't find anything referencing ways to improve performance.
I would take a deeper look into the reduce function. Try to use the built-in Erlang functions like _sum, _count, instead of writing Javascript.
Complex views can take hours and more, that's normal.
Maybe post such not too complex map/reduce.
And don't forget: indexing all docs is only done once after changing the view (or pushing a whole bunch of new docs). Subsequent new docs are indexed incrementally.
Use a view with &stale=ok to retrieve the "old" data instantly, so you don't have to wait. (But pay attention: you always have to call a view without stale=ok at least once to trigger the indexing process). Or better: use stale=update_after.
The code you write in views is more like CREATE INDEX than SELECT. It should be irrelevant how long it takes, as long as the view builds keep up with the document change rate. Building a view is a sunk (one-time) cost.
When you query the view, that is always a binary tree scan, which operates against a static data set in logarithmic time. That is usually the performance people care about more (in production.)
If you are not seeing behavior like I describe, perhaps we could discuss your view functions and your general approach to your problem. CouchDB is very different from relational databases. In the latter, you have highly structured data and free-form queries. In CouchDB, you have free-form data but highly structured index definitions (views). Except during development, changing and rebuilding views should be rare.
not emitting anything will help, but doing the view creation in smaller batches ( there are scripts that do this automagically ) helps more than anything other than not emitting anything at all, which can't be helped sometimes.

riak search result excerpt

Doeas anybody know if riaksearch has the ability to generate excerpt with highlight points in it similar to lucene does?
Riak Search doesn't expose this functionality out of the box, but with a little work you can create a rough approximation.
Riak Search allows you to feed search results into a MapReduce job. If you do this, then your Map or Reduce function will also get a list of token positions in the document that matched the query (this is exposed as keydata, http://www.basho.com/search.php?q=keydata). Using these positions, you can write code to mark up the document or excerpt portions of text.
I think this functionality will hardly ever be implemented in Riak since it's philisophy implies that it doesn't care about what exactly is stored in the values and therefore does not process them in any meaningful way except providing some metadata like indices.

Resources