How to prevent saving same named file on CouchDb? - c#-4.0

I am using CouchDB with Divan - C# interfacing library for CouchDb.
A file can be uploaded many times on CouchDb. Every time the "id" is changed after file is uploaded, but the "rev" remains the same.
This happens even if all custom attributes defined for file being uploaded are same any existing file on CouchDb with same name.
Is there any way that can avoid uploading same named file if all custom attributes are same? Fetching all files and checking them for file name repetition could be a way, but definitely not preferable for its required time depending on other factors.
Thanking you.

Let's say you have 3 attributes for a file :
name
size in bytes
Date of modification
I see two main possibilities to avoid duplicates in your database.
Client approach
You query the database to check if the document with the same attributes exists with a view. If it's not existing, create it.
User defined id
You could generate an id from the attributes as this library is doing.
For example, if my document has those attributes :
"name":"test.txt",
"size":"512",
"lastModified":"2016-11-08T15:44:29.563Z"
You could build a unique id like this :
"_id":"test.txt/2016-11-08T15:44:29.563Z/512"

Related

How to store metadata with Ceph?

I want to store user files in CephFS. The problem is that I also need to store some metadata of these files (Download date, verification status for example), as well as the ability to sort by date or the ability to give a number. If I use Mongodb for metadata, I have synchronization problem (file can be in database but not in CephFS or vice versa)
The file structure in CephFS is as follows:
/{user.id}/{Media collection name}/{media.id}
The media.id is uuidv4.
What idea I have:
Create a "meta" folder, in which to put the metadata of the files by their id, but without the date the file was uploaded to CephFS. To access the date, use data from CephFS(It stores the date the file was created, changed, just like any other file system(?))
I didn't find information in the Ceph documentation that it stores metadata as well, so I'm not sure if this option would work.

CouchDB document replication(updating specific attributes of a document)

I have an issue of replication and I need your help in it.In couchDb replication,I want to replicate in such a way that during Couchdb replication I want to reset/update some specific attributes of a a document for some purpose and then these edited documents should be saved in replicated db without effecting the original ones.For example:
A document named Student with attributes id,name,class etc.
And I want to replicate this document in the way that its name and class should be reset/updated.
Will you please tell me how can I achieve it.
Thanks.
You can't update docs during the replication.
But you can exclude docs from being replicated with the help of a CouchDB filter (e.g. preventing all docs with a revision higher then 1 from being replicated).
If you want to have multiple versions of the same dataset (e.g. to have dataset revisions) - i use the term "dataset" instead of "doc" to clearly express that not the internal CouchDB doc revision handling is involved - you have to store them as separated docs that have all a unique id and a reference property like original: "UUID_of_the_original".
you can't use the CouchDB doc revision handling for that purpose (thats what many people think when they see the _rev property in the docs)

Keeping elasticsearch in sync with key or versioning

So I have a situation where I get in a lot of large XML files and I want that data sycronised on elasticsearch.
Current way
Have index_1
When data is updated create blank index_2
Load all of latest data into index_2
Alias to index_2 and delete index_1
Proposed way
Have a synced.xml file which has been sycronised with elasticsearch
When a new timedated xml file is availiable compare against synced.xml
If anything is new in the timedated xml file, add just that to ES
Rename timedated xml to synced.xml
This means out of 500,000 items, I only have to add the 5,000 items that have changed for example, not duplicate the 500,000 items.
Question
In a scenario like this, how to I ensure they are sycronised? For example, what happens if elasticsearch gets wiped, how can I tell my program that it would need to add the whole lot again. Is there a way to use some sort of sycronisation key on elasticsearch, or perhaps a better approach?
Here is what I recommend...
Add a stored field to your type to store a hash like MD5
Use Scan/Scroll to export the ID and Hash from ES
In your backing dataset export ID and Hash
Use something like MapReduce to "join" on exported ids from each
set
Where there are differences via comparing the hash or finding
missing keys, index/update
The hash is only useful if want to detect document changes. This also assume that either you persist ES's IDs back to your backing store or that you self assign IDs.

Search keywords in files stored in mongodb

I have stored a .txt file in mongodb using gridFS with node.js.
Can we store .pdf and other format? When I tried to store .pdf and retrieve the content on the console, it displays text in the doc and some junk values in it. I used this line to retrieve "GridStore.read(db,id,function(err, fileData)"
Is there any other better way to do it?
Can we do text search on the content in the files stored in mongodb directly? If so how can we do that?.
Also can you please tell where the data of files stored in mongodb and in what format?
Any help in this regard will be great.
--Thanks
What you really seem to want here is "text search" capabilities, which in MongoDB requires you to simply store the "text" in a field or fields within your document. Putting "text" into MongoDB is really very simple, as you just supply the "text" as the content for the field and MongoDB will store it. The same goes for other data of just about any type which will merely just be stored under the field your specify.
The general case here is that you really seem to want "text search" and for that you must store the "text" of your data. But before implementing that, let's talk about what GridFS actually is and also what it is not, and how it most certainly is not what you think it is.
GridFS
GridFS is not software or a special function of MongoDB. It is in fact a specification for functionality to be implemented by available drivers for the sole intent of enabling you to store content that exceeds the 16MB BSON storage limit.
For this purpose, the implementation uses two collections. By default these are named fs.files and fs.chunks but in fact can be whatever you tell tour driver implementation to actually use. These collections store what is indicated by those default names. Being the unique identifier and metadata for the "file" and the other collection storing the
Here is a quick snippet of what happens to the data you send via the GridFS API as a document in the "chunks" collection:
{
"_id" : ObjectId("539fc66ac8b5e6dc058b4568"),
"files_id" : ObjectId("539fc66ac8b5e6dc058b4567"),
"n" : NumberLong(0),
"data" : BinData(2,"agQAADw/cGhwCgokZGJ....
}
For context, that data belongs to a "text" file I sent via the GridFS API functions. As you can see, despite the actual content being text, what is being displayed here is a "hashed" form of the raw binary data.
This is in fact what the API functions do, by reading the data that you provide as a stream of bytes and submitting that binary stream, and in manageable "chunks", so in all likelihood parts of your "file" will not in fact be kept in the same document. Which actually is the point of the implementation.
To MongoDB itself these are just ordinary collections and you can treat them as such for all general operations such and find and delete and update. The GridFS API spec as implemented by your driver, gives you functions to "read" from all of those chunks and even return that data as if it was a file. But in fact it is just data in a collection, in a binary format, and split across documents. None of which is going to help you with performing a "search" is this is neither "text" or contained in the same document.
Text Search
So what you really seem to want here is "text search" to allow you to find the words you are searching for. If you want to store "text" from a PDF file for example, then you would need to externally extract that text and store in documents. Or otherwise use an external text search system which will do much the same.
For the MongoDB implementation, any extracted text would be stored in a document, or possibly several documents in order for you to enable a "text index" in order to enable the search functionality. Basically you would do this on a collection like this:
db.collection.ensureIndex({ "content": "text" })
Once the field or "fields" on your documents in your collection is covered by a text index then you can actually search using the $text operator with .find():
db.collection.find({ "$text": { "$search": "word" } })
This form of query allows you to match documents on the terms you specify in your search and to also determine a relevance to your search and "rank" the documents accordingly.
More information can be found in the tutorials section on text search.
Combined
There is nothing stopping you from in fact taking a combined approach. Here you would actually store your orginal data documents using the GridFS API methods, and then store the extracted "text" in another collection that was aware of and contained a reference to the original fs.files document referring to your large text document or PDF file or whatever.
But you would need to extract the "text" from the original "documents" and store that within the MongoDB documents in your collection. Otherwise a similar approach can be taken with an external text search solution, where it is quite common to provide interfaces that can do things such as extract text from things like PDF documents.
With an external solution you would also send the reference to the GridFS form of the document to allow this data to be retrieved from any search with another request if it was your intention to deliver the original content.
So ultimately you see that the two methods are in fact for different things. You can build your own approach around "combining" functionality, but "search" is for search and the "chunk" storage is for doing exactly what you want it to do.
Of course if your content is always under 16MB, then just store it in a document as you normally would. But of course, if that is binary data and not text, it is no good to you for search unless you explicitly extract the text.

Redefine folder structure of document library with metadata

I have a problem in my sharepoint document library structure. Currently the document library consiste of folder sub-folder structure to store a document categorywise. Now our client want to redefine this folder structure with a metadata structure.
Can any one tell me how can I use metadata instade of folder sub folder structure..?
any related articles or links will be appriciated.
Thanks
Sachin
As already stated, you need to use columns for the metadata, preferably through a new Content Type. After creating this Content Type, you need to attach it to the library and convert all documents to it. Lastly, you also need to modify the views of the library, e.g. depending on your metadata you might only want to display certain columns or filter them.
There is an excellent whitepaper from Microsoft on Content Types available here:
http://technet.microsoft.com/en-us/library/cc262729.aspx
You can also read more about content type planning on Technet:
http://technet.microsoft.com/en-us/library/cc262735.aspx
And here's some info about Views:
http://office.microsoft.com/en-us/sharepointtechnology/HA100215771033.aspx
You must define columns for the metadata fields you want to have, create a content type that includes these columns, and assign this content type to your documents.
You might also change the default view of your document library, or create a new view, to make the new metadata columns visible.

Resources