ArangoDB - Document Revision Increases Size on Disk - arangodb

Requirement is to update document stored in ArangoDB with values every one minute on large dataset. Reading Updating the document is faster and works properly but on every update the size of document on disk keeps increasing gradually.
Found that document revision is a methodology which keeps the last updated record track before any new update. And as per the official site documentation about revision found that it is not configurable and managed by arango itself.
Question is if data is updated every one minute the size on disk will increase gradually over a period of time.
Does ArangoDB will clear the previous revisions frequently ? If yes how frequent that will happen ?

Related

Why does delete/insert result in a larger database size than using upsert?

I have a table with 1.6 million records. This table has multiple Interleaved tables. Periodically I receive updates. If I apply the update by deleting and then inserting, the database size is approximately 35% larger than by using UPSERT.
The database Retention period is set to 1 hour. Even after the retention period has passed, the database size does not go down.
Any idea why this is?
Update: The backup size for the database is the same regardless of the update method.
Thanks
The database retention period which is referred to as version_retention_period in the official documentation is the period in which all the versions of the data are guaranteed to exist.
There are various background processes which perform compactions etc and can take anywhere from a few hours to a few days to run depending on the load etc on the database. Thus, your observation is as expected.

CouchDB taking lot of space due to revisions

We have a project that involved with a database sync with pouchdb in mobile devices. We have faced issue when updating multiple documents (8400 docs per minute), internal storage increasing (around 20MB per minute) frequency.
We figured one main reason for that couchdb revisions. So we decided to decrease database rev_limit to around 5. But we heard it may impact replication process between couchdb and pouchdb. My first question is
how this decrease of revision limit impact to the replication process?.
And we figured out views taking more space than normal document storage. My second question, is there any way to reduce couchdb view size?
Your data model (fast updates) doesn't play to CouchDB's strengths. Even after compaction, old revisions (including tombstones) take up space. CouchDB is happiest when using small, immutable documents. Such a model is also less likely to suffer from update conflicts.
Look to your documents - can they be broken apart such that updates can be changed to new document writes? Typical indicators are nested objects or arrays that grow in documents over time.

MongoDB to store vehicle tracking information

I am building an app similar to Uber for tracking vehicles. Since the update frequency is so high (accounting for several users), I want to know the general practices involved for making writes faster to mongodb collection.
I am maintaining a database to store historical location information from all vehicles but it is bound to grow very fast once we go into production. I need to get the list of vehicles closest to a point. For this should I implement a separate table (with one row per vehicle) which gets updated after every update or there is better/faster way to do this using the existing table?
Two separate collections would likely be the best option here.
A vehicles collection which includes the current location. It could even include the 50 most recent location entries, added with $push and $slice to not have unbounded array growth. http://docs.mongodb.org/manual/reference/operator/update/slice/#up._S_slice
A locationHistory collection which includes all previous vehicle movements. You could index this by vehicle ID, and/or date.
One thing you for sure want to avoid is having an UNBOUNDED array inside a document.
{_id: ObjectID, VIN: String,
pastLocations: [{...unbounded array... }]
}
When mongodb allocates space for a new vehicle entry, it will use an average of the existing vehicle sizes to determine how much disk space to allocate. Having vastly different sizes of vehicle entries (some move more than others, or are newer etc) will negatively impact performance, and cause a lot more page faults.
The key here is that you're trying to avoid page faults. Keeping 50 entries of vehicle history (if they're just GPS coordinates) as a subdocument array isn't super huge. Keeping an entire year's worth of history that could be more than 1MB would be a big deal (heh) and cause page faults all the time when accessing different vehicles.
I did some extensive data storage loading of 20 GB+ in MongoDB over couple of months (deployed latest stable version in Aug. 2014). I noticed that the database corrupted on Windows OS (using high performance storage - iSCI over fibre channel), so MongoDB service just stopped and could not be started. I can still reproduce the issue by reaching high data loads. I cannot recommend for MongoDB for any production deployment, I hope you can find a better DBMS.
Performance should get better on mongodb due to wiredtiger integration on the newest version. (Not stable yet:http://blog.mongodb.org/post/102461818738/announcing-mongodb-2-8-0-rc0-release-candidate-and)

CouchDB .view file growing out of control?

I recently encountered a situation where my CouchDB instance used all available disk space on a 20GB VM instance.
Upon investigation I discovered that a directory in /usr/local/var/lib/couchdb/ contained a bunch of .view files, the largest of which was 16GB. I was able to remove the *.view files to restore normal operation. I'm not sure why the .view files grew so large and how CouchDB manages .view files.
A bit more information. I have a VM running Ubuntu 9.10 (karmic) with 512MB and CouchDB 0.10. The VM has a cron job which invokes a Python script which queries a view. The cron job runs once every five minutes. Every time the view is queried the size of a .view file increases. I've written a job to monitor this on an hourly basis and after a few days I don't see the file rolling over or otherwise decreasing in size.
Does anyone have any insights into this issue? Is there a piece of documentation I've missed? I haven't been able to find anything on the subject but that may be due to looking in the wrong places or my search terms.
CouchDB is very disk hungry, trading disk space for performance. Views will increase in size as items are added to them. You can recover disk space that is no longer needed with cleanup and compaction.
Every time you create update or delete a document then the view indexes will be updated with the relevant changes to the documents. The update to the view will happen when it is queried. So if you are making lots of document changes then you should expect your index to grow and will need to be managed with compaction and cleanup.
If your views are very large for a given set of documents then you may have poorly designed views. Alternatively your design may just require large views and you will need to manage that as you would any other resource.
It would be easier to tell what is happening if you could describe what document updates (inc create and delete) are happening and what your view functions are emitting, especially for the large view.
That your .view files grow, each time you access a view is because CouchDB updates views on access. CouchDB views need compaction like databases too. If you have frequent changes to your documents, resulting in changes in your view, you should run view compaction from time to time. See http://wiki.apache.org/couchdb/HTTP_view_API#View_Compaction
To reduce the size of your views, have a look at the data, you are emitting. When you emit(foo, doc) the entire document is copied to the view to it is very instantly available when you query the view. the function(doc) { emit(doc.title, doc); } will result in a view as big as the database itself. You could also emit(doc.title, nil); and use the include_docs option to let CouchDB fetch the document from the database when you access the view (which will result in a slightly performance penalty). See http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options
Use sequential or monotonic id's for documents instead of random
Yes, couchdb is very disk hungry, and it needs regular compactions. But there is another thing that can help reducing this disk usage, specially sometimes when it's unnecessary.
Couchdb uses B+ trees for storing data/documents which is very good data structure for performance of data retrieval. However use of B-tree trades in performance for disk space usage. With completely random Id, B+-tree fans out quickly. As the minimum fill rate is 1/2 for every internal node, the nodes are mostly filled up to the 1/2 (as the data spreads evenly due to its randomness) generating more internal nodes. Also new insertions can cause a rewrite of full tree. That's what randomness can cause ;)
Instead, use of sequential or monotonic ids can avoid all.
I've had this problem too, trying out CouchDB for a browsed-based game.
We had about 100.000 unexpected visitors on the first day of a site launch, and within 2 days the CouchDB database was taking about 40GB in space. This made the server crash because the HD was completely full.
Compaction brought that back to about 50MB. I also set the _revs_limit (which defaults to 1000) to 10 since we didn't care about revision history, and it's running perfectly since. After almost 1M users, the database size is usually about 2-3GB. When i run compaction it's about 500MB.
Setting document revision limit to 10:
curl -X PUT -d "10" http://dbuser:dbpassword#127.0.0.1:5984/yourdb/_revs_limit
Or without user:password (not recommended):
curl -X PUT -d "10" http://127.0.0.1:5984/yourdb/_revs_limit

Solr for constantly updating index

I have a news site with 150,000 news articles. About 250 new articles are added daily to the database at an interval of 5-15 minutes. I understand that Solr is optimized for millions of records and my 150K won't be a problem for it. But I am worried the frequent updation will be a problem, since the cache gets invalidated with every update. In my dev server, cold load of a page takes 5-7 seconds to load (since every page runs a few MLT queries).
Will it help, if I split my index into two - An archive index and a latest index. The archive index will be updated once every day.
Can anyone suggest any ways to optimize my installation for a constantly updating index?
Thanks
My answer is: test it! Don't try to optimize yet if you don't know how it performs. Like you said, 150K is not a lot, it should be quick to build an index of that size for your tests. After that, run a couple of MLT queries from a different concurrent threads (to simulate users) while you index more documents to see how it behaves.
One setting that you should keep an eye on is auto-commit. Since you are indexing constantly, you can't commit at each document (you will bring Solr down). The value that you will choose for this setting will let you tune the latency of the system (how many times it takes for new documents to be returned in results) while keeping the system responsive.
Consider using mlt=true in the main query instead of issuing per-result MoreLikeThis queries. You'll save the roundtrips and so it will be faster.

Resources