Storing lots of attachments in single CouchDB document - couchdb

tl;dr : Should I store directories in CouchDB as a list of attachments, or a single tar
I've been using CouchDB to store project documents. I just create documents via Futon and upload them directly from there. I've also written a script to bulk-upload directories. I am using it like a basic content repository. I replicate it, so other people on my team have a copy of the repository.
I noticed that saving directories as a series of files seems to have a lot of storage overhead, so instead I upload a .tar.gz file containing the directory. This does significantly reduce the size of the document but now any change to the directory requires replicating the entire tarball.
I am looking for thoughts or perspective on the matter.

It really depends one what you want to achieve. I will try and provide some options for you to consider.
Storing one tar.gz will save you space, but it does make it harder to work with. If you are simply archiving it may work for you.
Storing all the attachments on one document works well for couchapps. The workflow is you mess around with attachments until you are ready to release the application, then there is not a lot of overhead for replication, because it is usually one time. It is nice that they are one one document because they all move/replicate as one bundle. Downsides for using this approach for a content management system are that you can get a lot of history baggage that you have to compact on your local couch. Also you will get a lot of conflicts during replication between couches, and couch will keep conflicts around for you to resolve. Therefore if you choose this model, you should compact frequently to reduce disk size.
For a content management system, I might recommend using one document per attachment. That would give you less conflicts. There will be a slight overhead as each doc will have some space allocated for the doc itself, but the savings in having to do frequent compaction and/or conflict resolution will be better.
Hope that gives you some options to weigh out.

Related

Apache Chemistry CMIS session.createDocument vs folder.createDocument

I would like someone to give me the difference between the session createDocument and folder createDocument methods.
Also within this context is there a sample on how I could use document appendContentStream() method, I was struggling to see an example online, I have a requirement where documents sizes can be up to 300-350MB and I was keen to know more about the appendContentStream() after it was recommended at the Nuxeo webinar by Jeff Potts though he did mention size around 1GB.
Session.createDocument() creates a document and returns the document ID. Folder.createDocument() creates a document and returns a complete Document object. To do that, Folder.createDocument() needs one more round-trip to the server. If you just want to create a document and you are not interested in the document properties, or the document permissions, or the document renditions, etc., use the Session variant. It's faster.
The CMIS specification does not limit the document size. Some repositories support uploading a document of several GBs in one go. If such an upload fails, for example if there is a connection problem, you have to repeat the complete upload, though. appendContentStream() allows uploading a document in chunks. If uploading a chunk fails, you only have to repeat the upload of that one chunk. If that makes sense depends on your application, your repository, and your network.
There is a appendContentStream() code example (maybe not a good one) in the OpenCMIS TCK:
https://svn.apache.org/viewvc/chemistry/opencmis/trunk/chemistry-opencmis-test/chemistry-opencmis-test-tck/src/main/java/org/apache/chemistry/opencmis/tck/tests/crud/SetAndDeleteContentTest.java?view=markup

How can I alter the incoming documents on replication in CouchDB

I need to replicate in CouchDB data from one database to another but in the process I want to alter the documents being replicated over,
mostly stripping out particular fields (but other applications mentioned in comments).
The replication would always be 100% one way (but other applications mentioned in comments could use bi-directional and sync)
I would prefer if this process did not increment their revision ID but that might be asking for too much.
But I don't see any of the design document functions that do what I am trying to do.
As it seems doesn't do this, what plans are there for adding this? And meanwhile, what workarounds are there?
No, there is no out-of-the-box solution, as this would defy the whole purpose and logic of multi-master, MVCC logic.
The only option I can see here is to create your own solution, but I would not call this a replication, but rather ETL (Extract, Transform, Load). And for ETL there are tools available that will let you do the trick, like (mixing open source and commercial here):
Scriptella
CloverETL
Pentaho Data Integration, or to be more specific Kettle
Jespersoft ETL
Talend have some tools as well
There is plenty more of ETL tools on the market.
I believe the best approach here would be to break out the fields you want to filter out into a separate document and then filter out the document during replication.
Of course the best way would be to have built-support for this, but a workaround which occurs to me would be, instead of here using the built-in replication, to code and use a custom replication which will do the additional needed alterations/transformations, still using rather than going beneith, the other built-ins, and with good coding, in many situations (especially if each master can push to its slaves), it feels this could be nearly as efficient.
This requires efficient triggers be put on each source/master to detect any changes, which I believe CouchDB does offer (or at least PouchDB appears to), which would then copy the changes to another location also doing the full alterations.
If the source of the change is unable to push the change to the final destination, this fixed store may to be local to it where the destination can pull from -- which could get pretty expensive especially in multi-master, as each location has to not only store & maintain its own data but also the data (being sent) of everyone it sends to.
This replicate would also place each source document's revision ID in the the document's copy...
...that is ideally, including essential if the copy was to be {updated, aka a master}, too.
...in form of either:
ideally the normal "_rev" property. Indeed this looks quite possible per it ("preserve their revisions ID") already done by the normal replication algorithm using the builtin "Bulk Docs API" which seemingly our varient would use, too
otherwise have a new copy object (with its own _rev) plus another field as "_rev_original" ntelling the original rev. But well that would work?
Clearly such copy could be created no problem.
Probably no big if the destination is just reading the data.
Seems hairy if the destination is also writing the data. As we'd now have to merge with these non-standard revisions. But doable.
Relevant to this (coding an a custom/improved replication (to do this apparently-missing functionality) ideally without altering Pouch and especially Couch source code), as starter/basis material (the standard method), here's the normal Couch replication algorithm which unfortunately doens't clearly say it only uses builtin ops but it looks like it, and also the official overview of what it does; I'm suspecting Pouch implements this, likely in Pouch's replicate.js (latest release as of 2014.07).
Futher implementation particulars? - those who would know, please put it here.
This is a "community wiki" answer so please extend it.
Also please comment links & details of anyone/system already doing or trying to do this or similar.

Should I use NSFileWrappers in UIManagedDocument?

I am trying to store a plist and several binary files (let's say images) as part of an UIManagedDocument. The name of the binary files are an attribute in Core Data and I don't need to enumerate them, just access the right one when showing the related entity.
The file structure that I want to have is:
- <File yyyyMMdd-HHmmss>.extdoc
- StoreContent
- persistentStore
- AdditionalContent
- ListStatus.plist (used to store per document defaults)
- Images
- uuid1.png
- uuid2.png
- ...
- uuidn.png
So far, I have successfully followed the instructions in How do I save additional content into my UIManagedDocument file packages?, but when I try to add the binary files there are some things that I don't know how to do.
Should I treat the URL /the/path/File yyyyMMdd-HHmmss.extdoc/AdditionalContent (the default one provided with readAdditionalContentFromURL:error:) as a NSFileWrapper? Are there any advantages/disadvantages vs just using the URLs? I find it more complicated to use the file wrapper, since the plist has to be read using the file wrapper accessors and NSCoder (I guess), and the files, I have to store the file wrapper for the Images directory and then obtain the corresponding node with objectForKey (I assume). But Apple's Document-Based Apps Programming Guide for iOS regarding custom formats instead of NSData or NSFileWrapper, states "Keep in mind that your code will have to duplicate what UIDocument does for you, and so you must deal with greater complexity and a greater possibility of error." Am I misunderstanding this?
Per document defaults are declared as properties: the setter modifies the NSDictionary that maps the plist and marks the document as updated, and the getter accesses the dictionary with the proper key. How do I expose the ability to read/write the binary files? Should I add a method to my subclass of UIManagedDocument? - (void)writeImage:(NSString*)uuid; and -(UIImage *)readImage:(NSString *)uuid; And should I keep this data in memory until the document is saved? How?
Assuming that NSFileWrapper is the way to go, if I plan to use this document with iCloud should I use file coordinators with the file wrapper? If so, how?
Any source code for each question will be greatly appreciated. Thank you.
P.S.: I know that I could save some binary data inside of Core Data, but I don't feel comfortable with that solution. Among other reasons, I rather store the PNG data for image files that a serialized version of UIImage that won't be compatible with NSImage if I want to create a desktop app.
I'd like to say that, in general I rather like UIManagedDocument. It has a few advantages over raw Core Data. For example, it sets up the entire core data stack for you automatically. It also sets up nested managed object contexts for you, so you get free background saving. None of that is particularly earth-shattering, but it's a lot of functionality from a tiny amount of code.
I haven't played around with saving additional information...but here are my thoughts.
First, you shouldn't need to treat the new URL as a file wrapper. You should just be able to do regular file operations on the provided URL. Just make sure you have everything implemented properly in additionalContentForURL:error:, writeAdditionalContent:toURL:originalContentsURL:error: and readAdditionalContentFromURL:error:. The read and write operations need to be symmetric. And you should probably snapshot your data in additionalContentsForURL:error: so that everything will be saved in a known, good state (since the save operations are asynchronous).
As an alternative, have you considered using the Store in External Record File flag in your data model instead of saving it manually? This should force Core Data to (depending on the size of the binary data) automatically store them externally. I looked at the release notes, and I didn't see anything saying you couldn't use this feature with iCloud. That might be the easiest fix.
Attacking a side point for the moment (as I have not had ANY good experience with UIManagedDocument).
You can save the binary inside of Core Data for a iOS 5.0+ application using the external file reference. Then you can save the PNG of the image to Core Data directly and not need to worry about a UIManagedDocument or about bloating the sqlite file.
There is nothing stopping you from storing the PNG instead of a UIImage.
One other thought. You may need to use an NSFileCoordinator for the read and write operations. Technically, any read or write operations in the iCloud container need to use a file coordinator (to coordinate with the iCloud sync service--this prevents accidentally corrupting a file by reading it while another process is writing to it).
I know that UIDocument wraps most of its input and output methods automatically. I'd guess that these methods are similarly wrapped (since they give you a URL to use)--However, the docs aren't very clear.

How does one store history of edits effectively?

I was just wondering for sites like stackoverflow and wikipedia, they stores history of edits indefinitely and allows user to roll back the edits. Can someone recommend any resources/books/articles regarding how to do this using any suitable technology (such as databases etc)
Thanks a lot!
There are a number of options, the simplest, of course, being to simply record all versions independently. For a site like Stack Overflow, where posts aren't usually edited very many times, this is appropriate. However for something like Wikipedia, one needs to be more clever to save space.
In the case of Wikipedia, pages are initially stored with each version separate, in the text table. Periodically, a number of older revisions are compressed together, then packed into a single field. Since there will be a lot of repetition, you save a lot of space this way.
You might also want to look into how some version control systems do it - for example, subversion uses skip deltas, where revisions are stored as a difference from a revision halfway down the history. This means that one will have to examine at most O(lg n) revisions to reconstruct one's revision of interest.
Git, on the other hand, uses something more similar to Wikipedia's approach.
Revisions are stored as individually compressed 'loose' objects at first, then periodically git takes all of the loose objects, sorts them according to a somewhat complex heuristic, then builds compressed deltas between 'nearby' objects and dumps the result as a packfile.
The number of revisions that need to be read to reconstruct a file is bounded by an argument to the pack building process. This has the interesting property that deltas can be built between objects that are unrelated, in some cases.

Deleting files securely in delphi7

I need to delete my input file securely once I have finished with it, at the moment I'm overwriting all the data with zero, this is messy as my temp folder becomes full of old files also the name of the files is a security issue.
Rather than just moving them to the recycle bin I would like them to skip it and just disappear, this is in conjunction with being wiped byte wise as data recovery software can recover items from beyond the recycle bin. As the name is also important I need to rename them before I delete them.
This is a progressive problem. What is "secure" for one application is insecure for another. If security is really important and you find yourself asking these kinds of questions on Stack Overflow, then most likely need to contract with an external security consultant. Examples of really important include financial information, medical records, or anything else where there is a law or contract requiring the securing of the data. I don't say this to be mean or imply that you are incapable of solving the problem, but to point out that this is a rather complex and evolving problem.
Basically to accomplish what you want to accomplish:
Once your code you wrote finishes then change the file size to empty - this makes recovery more difficult because the original file size is lost.
Then rename the file (RenameFile)to a different name.
Finally delete the file using DeleteFile, which does not move the file to the recycle bin.
Make sure you maintain an exclusive handle on the files the whole time they are on the disk too, or they can just be copied before they are deleted.
As I said, this is a progressive problem. This is a really basic solution, and is subject to a number of vulnerabilities. So depending on the level of security needed you might consider never letting the file be written to disk, or using multiple pass overwrites. If security is really important, then actually burning the hard drive platter at a high temperature, and then smashing it is the only way to be sure.
Edit: It appears you removed your code sample.
There are third-party utilities to do this kind of thing from the command - I found PGP Command Line has this feature, if you search around you can probably find a free app that will do this from the command line. You could then just call the command from your app in order to securely delete the file.
I would say that if you are insistent upon writing your own code to do this, then instead of using all 0's, write random bytes to the disk. And don't use the built-in c++ rand function, use a more secure random number generator.
As Jim McKeeth said, this is not something you want to do yourself if there are serious legal repercussions for getting it wrong.
Jim has described well the issues with solving your problem in code. The problem is indeed progressive, and any solution you implement will only approximate complete security without ever attaining it. So one thing to do is to decide exactly what you need to protect the file against (snooping family members? co-workers? corporate espionage? totalitarian governments?), then design your solution accordingly and document its limitations.
I have a sort of an orthogonal suggestion though. Instead of - or in addition to - implementing secure wiping in code, you can require cooperation from users. For example, you can suggest (or require) that input files be stored on an encrypted volume. In corporate environments PGP Disk might be preferred, since it's a recognizable brand, while home users would be well served to use the free and well-tested TrueCrupt. Both products support creating virtual encrypted volumes as well as encrypting whole partitions. This would go a great length to keeping the names and contents of input files secure, even before you write a single line of code.
Deleting a file can be touchy subject...
Depending on the need of your customer I would like to point to the Data remanence phenomenon. Which is residual data left after a simple overwrite. Data erasure is a method of destroying the residual data.
There are a few standards on how to erase the residual data, DoD 5220.22-M is mostly referred to by "secure file delete" applications, but apparently the rules have changed.
As of the June 2007 edition of the DSS
C&SM, overwriting is no longer
acceptable for sanitization of
magnetic media; only degaussing or
physical destruction is acceptable.
So what I'm saying is, try to get the rules which your customer has to follow.
Beware of "wear leveling" algorithms used with flash storage. To promote even wear, files are moved around on the drive, and it's invisible to your app, and even the operating system. So you can "secure delete" the file all you want, and you will only affect the most recent copy of the file. But prior copies are recoverable/discoverable with recovery software. So the only way to solve that, is to encrypt the file contents.

Resources