Size of couchdb documents - couchdb

I have the following document in a couchdb database:
{
"_id": "000013a7-4df6-403b-952c-ed767b61554a",
"_rev": "1-54dc1794443105e9d16ba71531dd2850",
"tags": [
"auto_import"
],
"ZZZZZZZZZZZ": "910111",
"UUUUUUUUUUUUU": "OOOOOOOOO",
"RECEIVING_OPERATOR": "073",
"type": "XXXXXXXXXXXXXXXXXXX",
"src_file": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}
This JSON file takes exactly 319 bytes if saved in my local filesystem. My documents are all like this (give or take a couple of bytes, since some of the fields have varying lengths).
In my database I have currently around 6 millions documents, and they use 15 GB. That gives around 2.5KBytes/document. That means that the documents are taking 8 times more space on CouchDB as they would on disk.
Why is that?

The problem is related to the way the document id is used: it is stored not only in the document, but in other data structures. That means that using a standard UUID (000013a7-4df6-403b-952c-ed767b61554a 36 characters) is going to use up lots of disk space. If collission is a minor issue, with base64 you can number 16 millions of documents with just 4 characters,
and over 1 thousand million documents with 5 characters. A good choice for a dictionary is one which is ordered (in the "View Collation" sense):
-#0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
Using this method, I have reduced the size of my database from 2.5Kbytes/doc to 0.4Kbytes/doc. My new database uses only 16% of the space of the old database, which I would say is a very big improvement.

CouchDB uses something called MVCC which basically means it keeps previous versions of the documents as you modify them. It uses these previous versions to help with replication in case of conflicts and by default keeps 1000 revisions (see this for more info).
You can lower the number of revisions to keep if you aren't using replication or some how know that those sorts of conflicts will never happen.
You also might want to familiarize yourself with compaction as that can help (temporarily) lower the storage footprint as well.

Related

Azure cosmos DB partition key design selection

Selecting partition key is a simple but important design choice in Azure Cosmos DB. In terms of improving performance and costs (RUs). Azure cosmos DB does not allow us to change partition key. So it is very important to select right partition key.
I gone through Microsoft documents Link
But I still have confusion to choose partition key
Below is the item structure, I am planning to create
{
"id": "unique id like UUID", # just to keep some unique ID for item
"file_location": "/videos/news/finance/category/sharemarket/it-sectors/semiconductors/nvidia.mp4", # This value some times contains special symbols like spaces, dollars, caps and many more
"createatedby": "andrew",
"ts": "2022-01-10 16:07:25.773000",
"directory_location": "/videos/news/finance/category/sharemarket/it-sectors/semiconductors/",
"metadata": [
{
"codec": "apple",
"date_created": "2020-07-23 05:42:37",
"date_modified": "2020-07-23 05:42:37",
"format": "mp4",
"internet_media_type": "video/mp4",
"size": "1286011"
}
],
"version_id": "48ad8200-7231-11ec-abda-34519746721"
}
I am using Azure cosmos SQL API. By Default, Azure cosmos take cares of indexing all data. In above case all properties are indexed.
for reading items I use file_location property. Can I make file_location as primary key ? or anything else to consider.
Fews notes:
file_location values contains special characters like spaces, commas, dollars and many more.
Few containers contains 150 millions entries and few containers contains just 20 millions.
my operations are
more reads, frequent writes as new videos are added, less updates in case videos changed.
Few things to keep in mind while selecting partition keys:
Observe the query parameters while reading data, they give you good hints to what partition key candidates are.
You mentioned that few containers contain 150 million documents and few containers contain 20 million documents. Instead of number of documents stored in a container what matters is which containers are getting higher number of requests. If few containers are getting too many requests, that is a good indicator of poorly designed partition keys.
Try to distribute the request load as evenly as possible among containers so that it gets distributed evenly among the physical partitions. Otherwise, you will get hot-partition issues and will workaround by increasing throughput which will cost you more $.
Try to limit cross-partition queries as much as possible

ArangoDB Key/Value Model: value maximum size

With regard to the Key/Value model of ArangoDB, does anyone know the maximum size per Value? I have spent hours searching the Internet for this information but to no avail; you would think that this is a classified information. Thanks in advance.
The answer depends on different things, like the storage engine and whether you mean theoretical or practical limit.
In case of MMFiles, the maximum document size is determined by the startup option wal.logfile-size if wal.allow-oversize-entries is turned off. If it's on, then there's no immediate limit.
In case of RocksDB, it might be limited by some of the server startup options such as rocksdb.intermediate-commit-size, rocksdb.write-buffer-size, rocksdb.total-write-buffer-size or rocksdb.max-transaction-size.
When using arangoimport to import a 1GB JSON document, you will run into the default batch-size limit. You can increase it, but appears to max out at 805306368 bytes (0.75GB). The HTTP API seems to have the same limitation (/_api/cursor with bindVars).
What you should keep in mind: mutating the document is potentially a slow operation because of the append-only nature of the storage layer. In other words, a new copy of the document with a new revision number is persisted and the old revision will be compacted away some time later (I'm not familiar with all the technical details, but I think this is fair to say). For a 500MB document is seems to take a few seconds to update or copy it using RocksDB on a rather strong system. It's much better to have many but small documents.

How can I add a data of size more than 2 mb as a single entry in cosmos db?

As there is a size limit for cosmsos db for single entry of data, how can I add a data of size more than 2 mb as a single entry?
The 2MB limit is a hard-limit, not expandable. You'll need to work out a different model for your storage. Also, depending on how your data is encoded, it's likely that the actual limit will be under 2MB (since data is often expanded when encoded).
If you have content within an array (the typical reason why a document would grow so large), consider refactoring this part of your data model (perhaps store references to other documents, within the array, vs the subdocuments themselves). Also, with arrays, you have to deal with an "unbounded growth" situation: even with documents under 2MB, if the array can keep growing, then eventually you'll run into a size limit issue.

Why is search performance is slow for about 1M documents - how to scale the application?

I have created a search project that based on lucene 4.5.1
There are about 1 million documents and each of them is about few kb, and I index them with fields: docname(stored), lastmodified,content. The overall size of index folder is about 1.7GB
I used one document (the original one) as a sample, and query the content of that document against index. the problems now is each query result is coming up slow. After some tests, I found that my queries are too large although I removed stopwords, but I have no idea how to reduce query string size. plus, the smaller size the query string is, the less accurate the result comes.
This is not limited to specific file, because I also tested with other original files, the performance of search is relatively slow (often 1-8 seconds)
Also, I have tried to copy entire index directory to RAMDirectory while search, that didn't help.
In addition, I have one index searcher only across multiple threads, but in testing, I only used one thread as benchmark, the expected response time should be a few ms
So, how can improve search performance in this case?
Hint: I'm searching top 1000
If the number of fields is large a nice solution is to not store them then serialize the whole object to a binary field.
The plus is, when projecting the object back out after query, it's a single field rather than many. getField(name) iterates over the entire set so O(n/2) then getting the values and setting fields. Just one field and deserialize.
Second might be worth at something like a MoreLikeThis query. See https://stackoverflow.com/a/7657757/277700

CouchDB .view file growing out of control?

I recently encountered a situation where my CouchDB instance used all available disk space on a 20GB VM instance.
Upon investigation I discovered that a directory in /usr/local/var/lib/couchdb/ contained a bunch of .view files, the largest of which was 16GB. I was able to remove the *.view files to restore normal operation. I'm not sure why the .view files grew so large and how CouchDB manages .view files.
A bit more information. I have a VM running Ubuntu 9.10 (karmic) with 512MB and CouchDB 0.10. The VM has a cron job which invokes a Python script which queries a view. The cron job runs once every five minutes. Every time the view is queried the size of a .view file increases. I've written a job to monitor this on an hourly basis and after a few days I don't see the file rolling over or otherwise decreasing in size.
Does anyone have any insights into this issue? Is there a piece of documentation I've missed? I haven't been able to find anything on the subject but that may be due to looking in the wrong places or my search terms.
CouchDB is very disk hungry, trading disk space for performance. Views will increase in size as items are added to them. You can recover disk space that is no longer needed with cleanup and compaction.
Every time you create update or delete a document then the view indexes will be updated with the relevant changes to the documents. The update to the view will happen when it is queried. So if you are making lots of document changes then you should expect your index to grow and will need to be managed with compaction and cleanup.
If your views are very large for a given set of documents then you may have poorly designed views. Alternatively your design may just require large views and you will need to manage that as you would any other resource.
It would be easier to tell what is happening if you could describe what document updates (inc create and delete) are happening and what your view functions are emitting, especially for the large view.
That your .view files grow, each time you access a view is because CouchDB updates views on access. CouchDB views need compaction like databases too. If you have frequent changes to your documents, resulting in changes in your view, you should run view compaction from time to time. See http://wiki.apache.org/couchdb/HTTP_view_API#View_Compaction
To reduce the size of your views, have a look at the data, you are emitting. When you emit(foo, doc) the entire document is copied to the view to it is very instantly available when you query the view. the function(doc) { emit(doc.title, doc); } will result in a view as big as the database itself. You could also emit(doc.title, nil); and use the include_docs option to let CouchDB fetch the document from the database when you access the view (which will result in a slightly performance penalty). See http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options
Use sequential or monotonic id's for documents instead of random
Yes, couchdb is very disk hungry, and it needs regular compactions. But there is another thing that can help reducing this disk usage, specially sometimes when it's unnecessary.
Couchdb uses B+ trees for storing data/documents which is very good data structure for performance of data retrieval. However use of B-tree trades in performance for disk space usage. With completely random Id, B+-tree fans out quickly. As the minimum fill rate is 1/2 for every internal node, the nodes are mostly filled up to the 1/2 (as the data spreads evenly due to its randomness) generating more internal nodes. Also new insertions can cause a rewrite of full tree. That's what randomness can cause ;)
Instead, use of sequential or monotonic ids can avoid all.
I've had this problem too, trying out CouchDB for a browsed-based game.
We had about 100.000 unexpected visitors on the first day of a site launch, and within 2 days the CouchDB database was taking about 40GB in space. This made the server crash because the HD was completely full.
Compaction brought that back to about 50MB. I also set the _revs_limit (which defaults to 1000) to 10 since we didn't care about revision history, and it's running perfectly since. After almost 1M users, the database size is usually about 2-3GB. When i run compaction it's about 500MB.
Setting document revision limit to 10:
curl -X PUT -d "10" http://dbuser:dbpassword#127.0.0.1:5984/yourdb/_revs_limit
Or without user:password (not recommended):
curl -X PUT -d "10" http://127.0.0.1:5984/yourdb/_revs_limit

Resources