Including documents in the emit compared to include_docs = true in CouchDB - couchdb

I ran across a mention somewhere that doing an emit(key, doc) will increase the amount of time an index takes to build (or something to that effect).
Is there any merit to it, and is there any reason not to just always do emit(key, null) and then include_docs = true?

Yes, it will increase the size of your index, because CouchDB effectively copies the entire document in those cases. For cases in which you can, use include_docs=true.
There is, however, a race condition to be aware of when using this that is mentioned in the wiki. It is possible, during the time between reading the view data and fetching the document, that said document has changed (or has been deleted, in which case _deleted will be true). This is documented here under "Querying Options".

This is a classic time/space tradeoff.
Emitting document data into your index will increase the size of the index file on disk because CouchDB includes the emitted data directly into the index file. However, this means that, when querying your data, CouchDB can just stream the content directly from the index file on disk. This is obviously quite fast.
Relying instead on include_docs=true will decrease the size of your on-disk index, it's true. However, on querying, CouchDB must perform a document read for every returned row. This involves essentially random document lookups from the main data file, meaning that the cost and time of returning data increases significantly.
While the query time difference for small numbers of documents is slow, it will add up over every call made by the application. For me, therefore, emitting needed fields from a document into the index is usually the right call -- disk is cheap, user's attention spans less so. This is broadly similar to using covering indexes in a relational database, another widely echoed piece of advice.
I did a totally unscientific test on this to get a feel for what the difference is. I found about an 8x increase in response time and 50% increase in CPU when using include_docs=true to read 100,000 documents from a view when compared to a view where the documents were emitted directly into the index itself.

Related

Sorting enormous dataset

I have an enormous dataset (over 300 million documents). It is a system for archiving data and rollback capability.
The rollback capability is a cursor which iterates trough the whole dataset and performs few post requests to some external end points, it's a simple piece of code.
The data being iterated over needs to be send ordered by the timestamp (filed in the document). The DB was down for some time, so backup DB was used, but has received older data which has been archived manually, and later all was merged with the main DB.
Older data breaks the order. I need to sort this dataset, but the problem is the size; there is not enough RAM available to perform this operation at once. How I can achieve this sorting?
PS: The documents do not contain any indexed fields.
There's no way to do an efficient sort without an index. If you had an index on the date field then things would already be sorted (in a sense), so getting things in a desired order is very cheap (after the overhead of the index).
The only way to sort all entries without an index is to fetch the field you want to sort for every single document and sort them all in memory.
The only good options I see are to either create an index on the date field (by far the best option) or increase the RAM on the database (expensive and not scalable).
Note: since you have a large number of documents it's possible that even your index wouldn't be super scalable -- in that case you'd need to look into sharding the database.

MongoDB (+ Node/Mongoose): How to track processed records without "marking" all of them?

I have several large "raw" collections of documents which are processed in a queue, and the processed results are all placed into a single collection.
The queue only runs when the system isn't otherwise indisposed, and new data is being added into the "raw" collections all the time.
What I need to do is make sure the queue knows which documents it has already processed, so it doesn't either (a) process any documents more than once, or (b) skip documents. Updating each raw record with a "processed" flag as I go isn't a good option because it adds too much overhead.
I'm using MongoDB 4.x, with NodeJS and Mongoose. (I don't need a strictly mongoose-powered answer, but one would be OK).
My initial attempt was to do this by retrieving the raw documents sorted by _id in a smallish batch (say 100), then grabbing the first and last _id values in the return result, and storing those values, so when I'm ready to process the next batch, I can limit my find({}) query to records with an _id greater than what I stored as the last-processed result.
But looking into it a bit more, unless I'm misunderstanding something, it appears I can't really count on a strict ordering by _id.
I've looked into ways to implement an auto-incrementing numeric ID field (SQL style), which would have a strict ordering, but the solutions I've seen look like they add a nontrivial amount of overhead each time I create a record (not dissimilar to what it would take to mark processed records, just would be on the insertion end instead of the processing end), and this system needs to process a LOT of records very fast.
Any ideas? Is there a way to do an auto-incrementing numeric ID that's super efficient? Will default _id properties actually work in this case and I'm misunderstanding? Is there some other way to do it?
As per the documentation of ObjectID:
While ObjectId values should increase over time, they are not
necessarily monotonic. This is because they:
Only contain one second of temporal resolution, so ObjectId values created within the same second do not have a guaranteed ordering, and
Are generated by clients, which may have differing system clocks.
So if you are creating that many records per second then _id ordering is not for you.
However Timestamp within a mongo instance is guaranteed to be unique.
BSON has a special timestamp type for internal MongoDB use and is not
associated with the regular Date type. Timestamp values are a 64 bit
value where:
the first 32 bits are a time_t value (seconds since the Unix epoch)
the second 32 bits are an incrementing ordinal for operations within a
given second.
Within a single mongod instance, timestamp values are always unique.
Although it clearly states that this is for internal use it maybe something for you to consider. Assuming you are dealing with a single mongod instance you can decorate your records when they are getting into the "raw" collections with timestamps ... then you could remember the last processed record only. Your queue would only pick records with timestamps larger that the last processed timestamp.

How to reduce the Querying time for larger size data in MongoDB

I am trying to querying the collection in MongoDB which matches more than 10000 data for the query. Even though I have used index, the querying time exceeds 25 seconds.
For example, I am having a table People with field name, age.
I need to fetch the People data whose age is 25, if query finds the matched objects is 10000, then it takes time to fetch the whole data.
I have created index like db.people.createIndex({"age":1})
Here, how can I reduce the querying time
run db.collection.find().explain() and make sure that your index is in fact used. Make sure that you do not have COLLSCANs there https://docs.mongodb.com/manual/reference/explain-results/.
if your documents have some/many large attributes and you need only some attributes try to request only them (e.g. only _id or _id and name). Less data transferred gives higher speed.
if your db does not fit in memory, make it fit in memory. Once the database does not fit the performance will be much worse.
if you are not running on a sharded cluster, create one based on a reasonable sharding key. Age may not be a good one because than all age=25 documents will end up on one node. Even if you have one computer with multiple CPUs it still may work better for you (if you have enough memory for that). It may even work the other way around. If you have a sharded cluster on one computer and your replicas do not fit in the memory, it may be better to use just one node.

CouchDB _change Feed

for my application I implemented a logical seperation of my documents with a type attribute. I have several views. I implemented for every view a dedicated change feed which gets triggerd if a certain document was added or updated. At the moment the performance is quite well, do I have to expect a slow down in the future?
Well, every filter function associated with your feed is executed once for each new (or updated) document. So, you may expect a slowdown with a large number of concurrent inserts and updates. It's not something related to the database dimension, but to the number of concurrent updates.

Why is search performance is slow for about 1M documents - how to scale the application?

I have created a search project that based on lucene 4.5.1
There are about 1 million documents and each of them is about few kb, and I index them with fields: docname(stored), lastmodified,content. The overall size of index folder is about 1.7GB
I used one document (the original one) as a sample, and query the content of that document against index. the problems now is each query result is coming up slow. After some tests, I found that my queries are too large although I removed stopwords, but I have no idea how to reduce query string size. plus, the smaller size the query string is, the less accurate the result comes.
This is not limited to specific file, because I also tested with other original files, the performance of search is relatively slow (often 1-8 seconds)
Also, I have tried to copy entire index directory to RAMDirectory while search, that didn't help.
In addition, I have one index searcher only across multiple threads, but in testing, I only used one thread as benchmark, the expected response time should be a few ms
So, how can improve search performance in this case?
Hint: I'm searching top 1000
If the number of fields is large a nice solution is to not store them then serialize the whole object to a binary field.
The plus is, when projecting the object back out after query, it's a single field rather than many. getField(name) iterates over the entire set so O(n/2) then getting the values and setting fields. Just one field and deserialize.
Second might be worth at something like a MoreLikeThis query. See https://stackoverflow.com/a/7657757/277700

Resources