I'm creating a RESTful API with node, express, and mongodb and the book I'm using as a reference recommends using GridFS (namely gridfs-stream) for cases where one needs to handle files larger than the MongoDB cut-off (16MB)
I'm not sure if my app will ever need to handle files that size, but I'm wondering if there are cons to using it anyways in case I may need that feature later.
Are there any cons (i.e. significant unnecessary performance penalties, stability issues) that I should be aware of to help make this decision?
I'm also open to suggestions for alternate file management solutions that you may have.
Thanks!
dont use Gridfs for small binary data
GridFS requires two queries: one to fetch a file’s metadata and one to fetch its contents
Therefore, if you use GridFS to store small files, you are doubling the number
of queries that your application has to do. GridFS is basically a way of breaking up large
binary objects for storage in the database.
GridFS is for storing big data—larger than will fit in a single document. As a rule of best practice anything that is too big to load all at once on the client is probably not something
you want to load all at once on the server. Therefore, anything you’re going to
stream to a client is a good candidate for GridFS. Things that will be loaded all at once
on the client, such as images, sounds, or even small video clips, should generally just
be embedded in your main document
Furthermore, if your files are all smaller the 16 MB BSON Document Size limit, consider storing the file manually within a single document instead of using GridFS. You may use the BinData data type to store the binary data. See your drivers documentation for details on using BinData.
see https://docs.mongodb.com/manual/core/gridfs/
please mark correct if this helped
Related
I would like someone to give me the difference between the session createDocument and folder createDocument methods.
Also within this context is there a sample on how I could use document appendContentStream() method, I was struggling to see an example online, I have a requirement where documents sizes can be up to 300-350MB and I was keen to know more about the appendContentStream() after it was recommended at the Nuxeo webinar by Jeff Potts though he did mention size around 1GB.
Session.createDocument() creates a document and returns the document ID. Folder.createDocument() creates a document and returns a complete Document object. To do that, Folder.createDocument() needs one more round-trip to the server. If you just want to create a document and you are not interested in the document properties, or the document permissions, or the document renditions, etc., use the Session variant. It's faster.
The CMIS specification does not limit the document size. Some repositories support uploading a document of several GBs in one go. If such an upload fails, for example if there is a connection problem, you have to repeat the complete upload, though. appendContentStream() allows uploading a document in chunks. If uploading a chunk fails, you only have to repeat the upload of that one chunk. If that makes sense depends on your application, your repository, and your network.
There is a appendContentStream() code example (maybe not a good one) in the OpenCMIS TCK:
https://svn.apache.org/viewvc/chemistry/opencmis/trunk/chemistry-opencmis-test/chemistry-opencmis-test-tck/src/main/java/org/apache/chemistry/opencmis/tck/tests/crud/SetAndDeleteContentTest.java?view=markup
I am building a personal project in Node, and I want my users to be able to upload 'cover' photos. The approach I am using right now is to save these images to my fs, and just add the path to that image in a MongoDB database. So if a user adds an image I add that image to my images folder with the name lets say userID.jpg, and I save "/public/images/userID.jpg" as a string in my database. I have a feeling that this approach might not be the most efficient. Should I be directly saving it to the database? What are the advantages or disadvantages ?
Storing your images as files is actually pretty efficient (unless we're talking about 10's of K's of images, but still). Also, on the serving side, the images would probably be handled by something like express.static() (assuming that you're using Express), which is also quite lightweight.
However, if you eventually want to be a bit more scalable, you can take a look at using GridFS, which implements a file-system like storage on top of MongoDB. I use gridfs-stream for something similar (uploads/downloads) and it works fine.
If the images are small enough (smaller than about 16MB, which is the size limit for BSON documents), you might not even need to use GridFS and just store the images as Binary type.
I am trying to store a plist and several binary files (let's say images) as part of an UIManagedDocument. The name of the binary files are an attribute in Core Data and I don't need to enumerate them, just access the right one when showing the related entity.
The file structure that I want to have is:
- <File yyyyMMdd-HHmmss>.extdoc
- StoreContent
- persistentStore
- AdditionalContent
- ListStatus.plist (used to store per document defaults)
- Images
- uuid1.png
- uuid2.png
- ...
- uuidn.png
So far, I have successfully followed the instructions in How do I save additional content into my UIManagedDocument file packages?, but when I try to add the binary files there are some things that I don't know how to do.
Should I treat the URL /the/path/File yyyyMMdd-HHmmss.extdoc/AdditionalContent (the default one provided with readAdditionalContentFromURL:error:) as a NSFileWrapper? Are there any advantages/disadvantages vs just using the URLs? I find it more complicated to use the file wrapper, since the plist has to be read using the file wrapper accessors and NSCoder (I guess), and the files, I have to store the file wrapper for the Images directory and then obtain the corresponding node with objectForKey (I assume). But Apple's Document-Based Apps Programming Guide for iOS regarding custom formats instead of NSData or NSFileWrapper, states "Keep in mind that your code will have to duplicate what UIDocument does for you, and so you must deal with greater complexity and a greater possibility of error." Am I misunderstanding this?
Per document defaults are declared as properties: the setter modifies the NSDictionary that maps the plist and marks the document as updated, and the getter accesses the dictionary with the proper key. How do I expose the ability to read/write the binary files? Should I add a method to my subclass of UIManagedDocument? - (void)writeImage:(NSString*)uuid; and -(UIImage *)readImage:(NSString *)uuid; And should I keep this data in memory until the document is saved? How?
Assuming that NSFileWrapper is the way to go, if I plan to use this document with iCloud should I use file coordinators with the file wrapper? If so, how?
Any source code for each question will be greatly appreciated. Thank you.
P.S.: I know that I could save some binary data inside of Core Data, but I don't feel comfortable with that solution. Among other reasons, I rather store the PNG data for image files that a serialized version of UIImage that won't be compatible with NSImage if I want to create a desktop app.
I'd like to say that, in general I rather like UIManagedDocument. It has a few advantages over raw Core Data. For example, it sets up the entire core data stack for you automatically. It also sets up nested managed object contexts for you, so you get free background saving. None of that is particularly earth-shattering, but it's a lot of functionality from a tiny amount of code.
I haven't played around with saving additional information...but here are my thoughts.
First, you shouldn't need to treat the new URL as a file wrapper. You should just be able to do regular file operations on the provided URL. Just make sure you have everything implemented properly in additionalContentForURL:error:, writeAdditionalContent:toURL:originalContentsURL:error: and readAdditionalContentFromURL:error:. The read and write operations need to be symmetric. And you should probably snapshot your data in additionalContentsForURL:error: so that everything will be saved in a known, good state (since the save operations are asynchronous).
As an alternative, have you considered using the Store in External Record File flag in your data model instead of saving it manually? This should force Core Data to (depending on the size of the binary data) automatically store them externally. I looked at the release notes, and I didn't see anything saying you couldn't use this feature with iCloud. That might be the easiest fix.
Attacking a side point for the moment (as I have not had ANY good experience with UIManagedDocument).
You can save the binary inside of Core Data for a iOS 5.0+ application using the external file reference. Then you can save the PNG of the image to Core Data directly and not need to worry about a UIManagedDocument or about bloating the sqlite file.
There is nothing stopping you from storing the PNG instead of a UIImage.
One other thought. You may need to use an NSFileCoordinator for the read and write operations. Technically, any read or write operations in the iCloud container need to use a file coordinator (to coordinate with the iCloud sync service--this prevents accidentally corrupting a file by reading it while another process is writing to it).
I know that UIDocument wraps most of its input and output methods automatically. I'd guess that these methods are similarly wrapped (since they give you a URL to use)--However, the docs aren't very clear.
I am trying to build excerpts for each document returned as a search results on my website. I am using the Sphinx search engine and the Apache web server on Linux CentOS. The function within the Sphinx API that I'd like to use is called BuildExcerpts. This function requires you to pass an array of strings where each string contains the documents contents.
I'm wondering what the best practice is for retrieving the document contents in real time as I serve the results on the web. Currently, these documents are in text files on my system, spread across multiple drives. There are roughly 100MM of them and they take up a few terabytes of space.
It's easy for me to call something like file_get_contents(), but that feels like the wrong way to do this. My databases are already gigantic ( 100GB+ ) and I don't particularly want to throw the document contents in there along with the document attributes that already exist. Perhaps this is the best way to do this, however.
Suggestions?
Well the source needs to be fetched from somewhere. If you dont want to duplicate it in your database, then you will need to fetch it from the filesystem. (using file_get_contets or similar)
Although the BuildExerpts function does give you one extra option "load_files"
... then sphinx will read the data from the filename for you.
What problem are you experiencing with reading it from files? Is it too slow? If so maybe use some caching in front - using memcache maybe.
tl;dr : Should I store directories in CouchDB as a list of attachments, or a single tar
I've been using CouchDB to store project documents. I just create documents via Futon and upload them directly from there. I've also written a script to bulk-upload directories. I am using it like a basic content repository. I replicate it, so other people on my team have a copy of the repository.
I noticed that saving directories as a series of files seems to have a lot of storage overhead, so instead I upload a .tar.gz file containing the directory. This does significantly reduce the size of the document but now any change to the directory requires replicating the entire tarball.
I am looking for thoughts or perspective on the matter.
It really depends one what you want to achieve. I will try and provide some options for you to consider.
Storing one tar.gz will save you space, but it does make it harder to work with. If you are simply archiving it may work for you.
Storing all the attachments on one document works well for couchapps. The workflow is you mess around with attachments until you are ready to release the application, then there is not a lot of overhead for replication, because it is usually one time. It is nice that they are one one document because they all move/replicate as one bundle. Downsides for using this approach for a content management system are that you can get a lot of history baggage that you have to compact on your local couch. Also you will get a lot of conflicts during replication between couches, and couch will keep conflicts around for you to resolve. Therefore if you choose this model, you should compact frequently to reduce disk size.
For a content management system, I might recommend using one document per attachment. That would give you less conflicts. There will be a slight overhead as each doc will have some space allocated for the doc itself, but the savings in having to do frequent compaction and/or conflict resolution will be better.
Hope that gives you some options to weigh out.