How does Couchdb store duplicated attachments? - couchdb

I have a CouchDB database which stores mostly document attachments.
The files are sored in db with URL following structure:
/db-name/numeric-file-id/official-human-readable-file-name.ext
There is always only one attachment to one document.
Today I have computed the md5 sums of all of the files and it seems that many of them are duplicates.
I am wondering if couchdb is aware of duplicate attachments and internally stores only some kind of a pointer to a file, and keeps track of reference count, or just simply stores each attachments as is.
I mean, if I put 5 identical 100MB files as attachments, will the database use 100MB or 500MB?

I also couldn't find a direct answer to this question in the CouchDB docs, so I devised a simple empirical test (using CouchDB 1.4):
Experiment:
I incrementally added 3 documents, each with several large (multi MB) attachments that were identical between documents. I then examined the size on-disk of the resulting db.couch file after each document insert.
Results:
The db.couch file increased from 8MB to 16MB and then 24MB for the 1st, 2nd and 3rd document inserts, respectively. So, CouchDB does not appear to be deduplicating identical attachments on different documents. Manually compacting the database after the three documents were added made no difference in the file size, so it's also unlikely that some background maintenance process would get around to noticing/fixing this.
This lack of attachment deduplication is a curious omission given the following three observations:
The authors were concerned enough about efficiently handling large attachments that they added automatic gzip compression of stored attachments (for those with MIME types that indicate some kind of text content.)
Adding an attachment causes an MD5 digest to be calculated and stored with the metadata for the attachment.
CouchDB does seem to deduplicate identical attachments shared among multiple revs of the same document that are still being held in the DB (probably one use of the MD5 digest).
Given these factors, it is surprising that CouchDB isn't more intelligent in this regard, as it would be a valuable and (likely) straightforward optimization.

Related

Sorting enormous dataset

I have an enormous dataset (over 300 million documents). It is a system for archiving data and rollback capability.
The rollback capability is a cursor which iterates trough the whole dataset and performs few post requests to some external end points, it's a simple piece of code.
The data being iterated over needs to be send ordered by the timestamp (filed in the document). The DB was down for some time, so backup DB was used, but has received older data which has been archived manually, and later all was merged with the main DB.
Older data breaks the order. I need to sort this dataset, but the problem is the size; there is not enough RAM available to perform this operation at once. How I can achieve this sorting?
PS: The documents do not contain any indexed fields.
There's no way to do an efficient sort without an index. If you had an index on the date field then things would already be sorted (in a sense), so getting things in a desired order is very cheap (after the overhead of the index).
The only way to sort all entries without an index is to fetch the field you want to sort for every single document and sort them all in memory.
The only good options I see are to either create an index on the date field (by far the best option) or increase the RAM on the database (expensive and not scalable).
Note: since you have a large number of documents it's possible that even your index wouldn't be super scalable -- in that case you'd need to look into sharding the database.

Image storage performance on file system with Nodejs and Mongo

My Node.js application currently stores the uploaded images to the file system with the paths saved into a MongoDB database. Each document, maybe max 2000 in future, has between 4 and 10 images each. I don't believe I need to store the images in the database directly for my usage (I do not need to track versions etc), I am only concerned with performance.
Currently, I store all images in one folder and associated paths stored in the database. However as the number of documents, hence number of images, increase will this slow performance having so many files in a single folder?
Alternatively I could have a folder for each document. Does this extra level of folder complexity affect performance? Also using MongoDB the obvious folder naming schema would be to use the ObjectID but does folder names of the length (24) affect performance? Should I be using a custom ObjectID?
Are there more efficient ways? Thanks in advance.
For simply accessing files, the number of items in a directory does not really affect performance. However, it is common to split out directories for this as getting the directory index can certainly be slow when you have thousands of files. In addition, file systems have limits to the number of files per directory. (What that limit is depends on your file system.)
If I were you, I'd just have a separate directory for each document, and load the images in there. If you are going to have more than 10,000 documents, you might split those a bit. Suppose your hash is 7813258ef8c6b632dde8cc80f6bda62f. It's pretty common to have a directory structure like /7/8/1/3/2/5/7813258ef8c6b632dde8cc80f6bda62f.

Data retrieval - Database VS Programming language

I have been working with databases recently and before that I was developing standalone components that do not use databases.
With all the DB work I have a few questions that sprang up.
Why is a database query faster than a programming language data retrieval from a file.
To elaborate my question further -
Assume I have a table called Employee, with fields Name, ID, DOB, Email and Sex. For reasons of simplicity we will also assume they are all strings of fixed length and they do not have any indexes or primary keys or any other constraints.
Imagine we have 1 million rows of data in the table. At the end of the day this table is going to be stored somewhere on the disk. When I write a query Select Name,ID from Employee where DOB="12/12/1985", the DBMS picks up the data from the file, processes it, filters it and gives me a result which is a subset of the 1 million rows of data.
Now, assume I store the same 1 million rows in a flat file, each field similarly being fixed length string for simplicity. The data is available on a file in the disk.
When I write a program in C++ or C or C# or Java and do the same task of finding the Name and ID where DOB="12/12/1985", I will read the file record by record and check for each row of data if the DOB="12/12/1985", if it matches then I store present the row to the user.
This way of doing it by a program is too slow when compared to the speed at which a SQL query returns the results.
I assume the DBMS is also written in some programming language and there is also an additional overhead of parsing the query and what not.
So what happens in a DBMS that makes it faster to retrieve data than through a programming language?
If this question is inappropriate on this forum, please delete but do provide me some pointers where I may find an answer.
I use SQL Server if that is of any help.
Why is a database query faster than a programming language data retrieval from a file
That depends on many things - network latency and disk seek speeds being two of the important ones. Sometimes it is faster to read from a file.
In your description of finding a row within a million rows, a database will normally be faster than seeking in a file because it employs indexing on the data.
If you pre-process you data file and provided index files for the different fields, you could speedup data lookup from the filesystem as well.
Note: databases are normally used not for this feature, but because they are ACID compliant and therefore are suitable for working in environments where you have multiple processes (normally many clients on many computers) querying the database at the time.
There are lots of techniques to speed up various kinds of access. As #Oded says, indexing is the big solution to your specific example: if the database has been set up to maintain an index by date, it can go directly to the entries for that date, instead of reading through the entire file. (Note that maintaining an index does take up space and time, though -- it's not free!)
On the other hand, if such an index has not been set up, and the database has not been stored in date order, then a query by date will need to go through the entire database, just like your flat-file program.
Of course, you can write your own programs to maintain and use a date index for your file, which will speed up date queries just like a database. And, you might find that you want to add other indices, to speed up other kinds of queries -- or remove an index that turns out to use more resources than it is worth.
Eventually, managing all the features you've added to your file manager may become a complex task; you may want to store this kind of configuration in its own file, rather than hard-coding it into your program. At the minimum, you'll need features to make sure that changing your configuration will not corrupt your file...
In other words, you will have written your own database.
...an old one, I know... just for if somebody finds this: The question contained "assume ... do not have any indexes"
...so the question was about the sequential dataread fight between the database and a flat file WITHOUT indexes, which the database wins...
And the answer is: if you read record by record from disk you do lots of disk seeking, which is expensive performance wise. A database always loads pages by concept - so a couple of records all at once. Less disk seeking is definitely faster. If you would do a mem buffered read from a flat file you could achieve the same or better read values.

Can you find the logical size of a single NotesDocument in a DAOS-enabled database or an uploaded file size?

I'm doing some feasability for an XPages application. One of the aspects is checking the amount of space used by users.
The database will be DAOS-enabled to minimise the size of the NSF. Is it possible to identify the logical size of a NotesDocuemnt that has a DAOSed attachment? I know I can find the logical size of the overall database, but need to identify it based on users.
LotusScript or Java would be acceptable options.
The other option is to capture file sizes at upload time and store that information against the user. Is it possible to identify the attachment size at the point of upload and deletion? This would need to be captured before the attachment was moved to the DAOS store.
Paul,
As far as I know from the client point of view he can't see if a Database/Document has been DAOS'ed or not. SO this meahs that using LotusScript against the document would report the document size as if the attachment(s) would be in the document. I haven't tested it myself to give you a 100% guarantee but you could test it for yourself very easily by enabling a database for DAOS and then create 10 docs with all of them the exact same attachment attached to the documents. If the docs report a size of arround the attachment size when accessed via LotusScript you will have your answer !
You could check the logical size of the database before and after saving the document. But unfortunately, you would have to rig the critical section of this code with a semaphore or some other mechanism that assures that only one instance can run at a time, otherwise two simultaneous saves would give you bad results.
Build a view with a column whose formulas is #DocLength or #Sum(#AttachmentLengths) This will show the logical size of the docs as if DAOS was not active.
/Newbs

Including documents in the emit compared to include_docs = true in CouchDB

I ran across a mention somewhere that doing an emit(key, doc) will increase the amount of time an index takes to build (or something to that effect).
Is there any merit to it, and is there any reason not to just always do emit(key, null) and then include_docs = true?
Yes, it will increase the size of your index, because CouchDB effectively copies the entire document in those cases. For cases in which you can, use include_docs=true.
There is, however, a race condition to be aware of when using this that is mentioned in the wiki. It is possible, during the time between reading the view data and fetching the document, that said document has changed (or has been deleted, in which case _deleted will be true). This is documented here under "Querying Options".
This is a classic time/space tradeoff.
Emitting document data into your index will increase the size of the index file on disk because CouchDB includes the emitted data directly into the index file. However, this means that, when querying your data, CouchDB can just stream the content directly from the index file on disk. This is obviously quite fast.
Relying instead on include_docs=true will decrease the size of your on-disk index, it's true. However, on querying, CouchDB must perform a document read for every returned row. This involves essentially random document lookups from the main data file, meaning that the cost and time of returning data increases significantly.
While the query time difference for small numbers of documents is slow, it will add up over every call made by the application. For me, therefore, emitting needed fields from a document into the index is usually the right call -- disk is cheap, user's attention spans less so. This is broadly similar to using covering indexes in a relational database, another widely echoed piece of advice.
I did a totally unscientific test on this to get a feel for what the difference is. I found about an 8x increase in response time and 50% increase in CPU when using include_docs=true to read 100,000 documents from a view when compared to a view where the documents were emitted directly into the index itself.

Resources