Most efficient way to know the size of a remote CMIS folder - cmis

A CMIS folder can contain many sub-folders and documents, so they can be arbitrarily big.
I would like to warn the user before downloading a folder.
For instance: You are about to download 35TB. Are you sure?
It would also be useful for a progress bar.
What is the most efficient way to know the recursive size of a remote CMIS folder.
My naive approach (simplified code) with getDecendants:
// Get all descendants
depth = 99999999; // By the way, this is not great
descendants = cmisFolder.getDescendants(depth);
// Add up all sizes
size = 0;
foreach(descendants) {
size += descendant.getSize();
}
Is there a more efficient approach, with less traffic between me and the server?

Using a query is bit more efficient because it allows you to filter non-document objects (e.g. folders) and documents without content. It also returns a flat list, which is easier to traverse, and supports paging, which may help client and server if many documents are returned.

Related

Image storage performance on file system with Nodejs and Mongo

My Node.js application currently stores the uploaded images to the file system with the paths saved into a MongoDB database. Each document, maybe max 2000 in future, has between 4 and 10 images each. I don't believe I need to store the images in the database directly for my usage (I do not need to track versions etc), I am only concerned with performance.
Currently, I store all images in one folder and associated paths stored in the database. However as the number of documents, hence number of images, increase will this slow performance having so many files in a single folder?
Alternatively I could have a folder for each document. Does this extra level of folder complexity affect performance? Also using MongoDB the obvious folder naming schema would be to use the ObjectID but does folder names of the length (24) affect performance? Should I be using a custom ObjectID?
Are there more efficient ways? Thanks in advance.
For simply accessing files, the number of items in a directory does not really affect performance. However, it is common to split out directories for this as getting the directory index can certainly be slow when you have thousands of files. In addition, file systems have limits to the number of files per directory. (What that limit is depends on your file system.)
If I were you, I'd just have a separate directory for each document, and load the images in there. If you are going to have more than 10,000 documents, you might split those a bit. Suppose your hash is 7813258ef8c6b632dde8cc80f6bda62f. It's pretty common to have a directory structure like /7/8/1/3/2/5/7813258ef8c6b632dde8cc80f6bda62f.

How does Couchdb store duplicated attachments?

I have a CouchDB database which stores mostly document attachments.
The files are sored in db with URL following structure:
/db-name/numeric-file-id/official-human-readable-file-name.ext
There is always only one attachment to one document.
Today I have computed the md5 sums of all of the files and it seems that many of them are duplicates.
I am wondering if couchdb is aware of duplicate attachments and internally stores only some kind of a pointer to a file, and keeps track of reference count, or just simply stores each attachments as is.
I mean, if I put 5 identical 100MB files as attachments, will the database use 100MB or 500MB?
I also couldn't find a direct answer to this question in the CouchDB docs, so I devised a simple empirical test (using CouchDB 1.4):
Experiment:
I incrementally added 3 documents, each with several large (multi MB) attachments that were identical between documents. I then examined the size on-disk of the resulting db.couch file after each document insert.
Results:
The db.couch file increased from 8MB to 16MB and then 24MB for the 1st, 2nd and 3rd document inserts, respectively. So, CouchDB does not appear to be deduplicating identical attachments on different documents. Manually compacting the database after the three documents were added made no difference in the file size, so it's also unlikely that some background maintenance process would get around to noticing/fixing this.
This lack of attachment deduplication is a curious omission given the following three observations:
The authors were concerned enough about efficiently handling large attachments that they added automatic gzip compression of stored attachments (for those with MIME types that indicate some kind of text content.)
Adding an attachment causes an MD5 digest to be calculated and stored with the metadata for the attachment.
CouchDB does seem to deduplicate identical attachments shared among multiple revs of the same document that are still being held in the DB (probably one use of the MD5 digest).
Given these factors, it is surprising that CouchDB isn't more intelligent in this regard, as it would be a valuable and (likely) straightforward optimization.

On-disk lookup table with node.js bindings

For a project I am creating a queuing library and basically store URLs in a Set (it's actually an object, where I set keys to true, but one can see it as an array), so the queue only takes every url once. This works really well, however I am facing the problem that there are many URLs and so the RAM usage becomes really high.
Therefor I want to use an on-disk key-value store (actually only keys are required, no idea whether there is some different approach) with the following requirements:
No need to load the whole data set into RAM
Speedy lookups
Node.js bindings
It doesn't have to be too safe (losing data once in a while isn't a huge problem, low RAM requirements are more important) and even though I use Node.JS in this scenario this lookup doesn't necessarily need to run async.
Actually a side question would be whether there is some better way than a on-disk key-value approach. A term would be nice. Lookuptables somehow always lets me find data sets (IPs, ZIP codes, etc.)
I'd use a sql table with a single column (to store the url). Better control on memory usage than redis (which pretty much stores all in memory).
easy to check if there is already the same value
easy to insert
easy to remove one element
If it really "doesn't have to be too safe", another design would be to keep storing everything in memory but limit the number of URLs you store, for example by using an LRU cache.
You could either use a cache in node.js (easy to find via Google) or use a separate memcached server, possibly on the same machine.

Including documents in the emit compared to include_docs = true in CouchDB

I ran across a mention somewhere that doing an emit(key, doc) will increase the amount of time an index takes to build (or something to that effect).
Is there any merit to it, and is there any reason not to just always do emit(key, null) and then include_docs = true?
Yes, it will increase the size of your index, because CouchDB effectively copies the entire document in those cases. For cases in which you can, use include_docs=true.
There is, however, a race condition to be aware of when using this that is mentioned in the wiki. It is possible, during the time between reading the view data and fetching the document, that said document has changed (or has been deleted, in which case _deleted will be true). This is documented here under "Querying Options".
This is a classic time/space tradeoff.
Emitting document data into your index will increase the size of the index file on disk because CouchDB includes the emitted data directly into the index file. However, this means that, when querying your data, CouchDB can just stream the content directly from the index file on disk. This is obviously quite fast.
Relying instead on include_docs=true will decrease the size of your on-disk index, it's true. However, on querying, CouchDB must perform a document read for every returned row. This involves essentially random document lookups from the main data file, meaning that the cost and time of returning data increases significantly.
While the query time difference for small numbers of documents is slow, it will add up over every call made by the application. For me, therefore, emitting needed fields from a document into the index is usually the right call -- disk is cheap, user's attention spans less so. This is broadly similar to using covering indexes in a relational database, another widely echoed piece of advice.
I did a totally unscientific test on this to get a feel for what the difference is. I found about an 8x increase in response time and 50% increase in CPU when using include_docs=true to read 100,000 documents from a view when compared to a view where the documents were emitted directly into the index itself.

How do we get around the Lotus Notes 60 Gb database barrier

Are there ways to get around the upper database size limit on Notes databases? We are compacting a database that is still approaching 60 gigs in size. Thank you very much if you can offer a suggestion.
Even if you could find a way to get over the 64GB limit it would not be the recommended solution. Splitting up the application into multiple databases is far better if you wish to improve performance and retain the stability of your Domino server. If you think you have to have everything in the same database in order to be able to search, please look up domain search and multi-database search in the Domino Administrator help.
Maybe some parts of the data is "old" and could be put into one or more archive databases instead?
Maybe you have a lot of large attachments and can store them in a series of attachment databases?
Maybe you have a lot of complicated views that can be streamlined or eliminated and thereby save a lot of space and keep everything in the same database for the time being? (Remove sorting on columns where not needed, using "click on column header to sort" is a sure way to increase the size of the view index.)
I'm assuming your database is large because of file attachments as well. In that case look into DAOS - it will store all file attachments on filesystem (server functionality - transparent to clients and existing applications).
As a bonus it finds duplicates and stores them only once.
More here: http://www.ibm.com/developerworks/lotus/library/domino-green/
Just a stab in the dark:
Use the DB2 storage method instead of to a Domino server?
I'm guessing that 80-90% of that space is taken up by file attachments. My suggestion is to move all the attachments to a file share, provided everyone can access that share, or to an FTP server that everyone can connect to.
It's not ideal because security becomes an issue - now you need to manage credentials to the Notes database AND to the external file share - however it'll be worth the effort from a Notes administrator's perspective.
In the Notes documents, just provide a link to the file. If users are adding these files via a Notes form, perhaps you can add some background code to extract the file from the document after it has been saved, and replace it with a link to that file.
The 64GB is not actually an absolute limit, you can go above that, I've seen 80GB and even close to 100Gb although once your past 64Gb you can get problems at any time. The limit is not actually Notes, its the underlying file system, I've seen this on AS400 but the great thing about Notes is that if you do get a huge crash you can still access all the documents and pull everything out to new copies using scheduled agents even if you can no longer get views to open in the client.
Your best best is regular archiving, if it is file attachments then anything over two years old doesn't need to be in main system, just brief synopsis and link, you could even have 5 year archive, 2 year archive 1 year archive etc, data will continue to accumulate and has to be managed, irrespective of what platform you use to store it.
If the issue really is large file attachments, I would certainly recommend looking into implementing DAOS on your server / database. It is only available with Domino Server 8.5 and later. On the other hand, if your database contains over 100,000+ documents, you may want to look seriously at dividing the data into multiple NSF's - at that number of documents, you need to be very careful about your view design, your lookup code, etc.
Some documented successes with DAOS:
http://www.edbrill.com/ebrill/edbrill.nsf/dx/yet-another-daos-success-story-from-darren-duke?opendocument&comments
If you're database is getting to 60gb.. don't use a Domino solution you need to switch to a relational database. You need to archive or move documents across several databases. Although you can get to 60gb, you shouldn't do it. The performance hit for active databases is significant. Not so much a problem for static databases.
I would also look at removing any unnecessary views & their indexes. View indexes can occupy 80-90% of your disk space. If you can't remove them, simplify their sorting arrangements/formulas and remove any unnecessary column sorting options. I halved a 50gb down to 25gb with a few simple changes like this and virtually no users noticed.
One path could be, for once, to start with the user. Do all the users need to access all that data all the time ? If no, it's time to split or archive. If yes, there is probably a flaw in the design of the application.
Technically, I would add to the previous comments a suggestion to check the many options for compaction. Quick and dirty : disard all view indices, but be sure to rebuild at least the one for the default view if you don't want your users to riot. See updall
One more thing to check: make sure you have checked
[x] Use LZ1 compression for attachments
in db properties.

Resources