MongoDB GridFS driver in NodeJS overwrites files with the same name

MongoDB GridFS driver in NodeJS overwrites files with the same name - node.js

I have the following code (removed error checking to keep it concise) which uses the node-mongodb-native.
var mongo = require('mongodb').MongoClient;
var grid = require('mongodb').GridStore;
var url = 'mongodb://localhost:27017/mydatabase';
mongo.connect(url, function(err, db) {
var gs = new grid(db, 'myfile.txt', 'w', {
"metadata": {
// metadata here
}
});
gs.open(function(err, store) {
gs.writeFile('~/myfile.txt', function(err, doc) {
fs.unlink(req.files.save.path, function (err) {
// error checking etc
});
}
});
});
If I run that once it works fine and stores the file in GridFS.
Now, if I delete that file on my system and create a new one with the same name, but different contents, and run it though that code again it uploads it. However, it seems to overwrite the file that is already stored in GridFS. _id stays the same, but md5 has been updated to the new value. So even though the file is different, because the name is the same it overwrites the current file in GridFS.
Is there a way to upload two files with the same name? If _id is unique, why does the driver overwrite the file based on file name alone?
I found a similar issue on GitHub, but I am using the latest version of the driver from npm and it does what I explained above.

Like a real filesystem, the filename becomes the logical key in GridFS for reading and writing. You cannot have two files with the same name.
You'll need to come up with a secondary index of some sort or a new generated file name.
For example, add a timestamp to the file name.
Or, create another collection that maps generated file names to the GridFS structure to whatever it is that you need.

To avoid having to create additional unique identifiers for your files you should omit the 'write mode' option. This will allow gridfs to create a new file even if it contains the exact same data.
'w' overwrites data, which is why you are overwriting the existing file.
http://mongodb.github.io/node-mongodb-native/api-generated/gridstore.html

Related

Update many objects, create if not exist

I'm working with a mean application and one of it functions is to do upload of csv files and convert it to json and persist in a mongodb database. But even month i receive a csv with new records and records that already exists (with new informations or not) in the database. Summing up, i need to update many objects and create it if not exist. My question is, what is the better way to do this, because these files are very large.
The current version just create these records like this:
Patient.create(records ,function(err, records) {
if (err){
res.send(err);
console.log(err);
}
res.json(records);
});

You can do this by following some simple steps
At Node JS use parse CSV to convert CSV into JSON
Then take all the data from collection and compare it with the new data.
For Comparison of data you can use lodash library, it's method is really fast.
Once you got the _id of the data which is new or need to be update. You can use the below query
db.collectionName.update({$in:{"_id":ids}},{$set:{"key":"value"}},{upsert:true, multi:true},function(err,doc){
console.log(doc)
})
don't forget to add upsert true because you have new data also.
Hope it Help!!

Does MongoDB's filemd5 have ability to set readPreference

I have a file storage service built in Node/Meteor, which utilizes GridFS, and it is replicated across several containers. What I'm currently trying to find, is if this piece of code is actually aware of the read/write consistency
db.command({
filemd5: someFileId,
root: 'fs'
}, function callback(err, results) {
...
})
I'm uploading file in chunks, and after merging all chunks into a single file that command is executed. And I have a feeling that it's using secondary members (i got a couple md5 values which are of empty file - d41d8cd98f00b204e9800998ecf8427e). Is there any documentation or additional settings for it?
Those 2 params are the only options described in docs.. https://docs.mongodb.com/manual/reference/command/filemd5/
UPDATE
The exact code for merging the chunks is here in a 3rd party package:
cursor = files.find(
{
'metadata._Resumable.resumableIdentifier': file.metadata._Resumable.resumableIdentifier
length:
$ne: 0
},
{
fields:
length: 1
metadata: 1
sort:
'metadata._Resumable.resumableChunkNumber': 1
}
)
https://github.com/vsivsi/meteor-file-collection/blob/master/src/resumable_server.coffee#L26
And then there are line 111-119 which execute filemd5 first, and the run an update on the file
#db.command md5Command, (err, results) ->
if err
lock.releaseLock()
return callback err
# Update the size and md5 to the file data
files.update { _id: fileId }, { $set: { length: file.metadata._Resumable.resumableTotalSize, md5: results.md5 }},
(err, res) =>
lock.releaseLock()
callback err
https://github.com/vsivsi/meteor-file-collection/blob/master/src/resumable_server.coffee#L111-L119
After writing the last chunk, the cursor = files.find() is launched with all the merging stuff, hence if read preference is secondaryPreferred then they might not still be there? Should that code be refactored to use primary only?

GridFS creates 2 collections: files and chunks.
A typical files entry looks like the following:
{
"_id" : ObjectId("58cfbc8b6900bb31c7b1b8d9"),
"length" : 4,
"chunkSize" : 261120,
"uploadDate" : ISODate("2017-03-20T11:27:07.812Z"),
"md5" : "d3b07384d113edec49eaa6238ad5ff00",
"filename" : "foo.txt"
}
The filemd5 administrative command should simply return the md5 field of the relevant file document (and the number of chunks).
files.md5
An MD5 hash of the complete file returned by the filemd5 command. This value has the String type.
source: GridFS docs
It should represent the full file's hash, or at least of the one originally saved.
What is the ‘md5’ field of a files collection document and how is it used?
‘md5’ holds an MD5 checksum that is computed from the original contents of a user file. Historically, GridFS did not use acknowledged writes, so this checksum was necessary to ensure that writes went through properly. With acknowledged writes, the MD5 checksum is still useful to ensure that files in GridFS have not been corrupted. A third party directly accessing the 'files' and ‘chunks’ collections under GridFS could, inadvertently or maliciously, make changes to documents that would make them unusable by GridFS. Comparing the MD5 in the files collection document to a re-computed MD5 allows detecting such errors and corruption. However, drivers now assume that the stored file is not corrupted, and applications that want to use the MD5 value to check for corruption must do so themselves.
source: GridFS spec
If it is updated in such a way such that the driver's mongoc_gridfs_file_save is not used (for example, streaming), the md5 field will not be updated.
Actually, further reading the spec:
Why store the MD5 checksum instead of creating the hash as-needed?
The MD5 checksum must be computed when a file is initially uploaded to GridFS, as this is the only time we are guaranteed to have the entire uncorrupted file. Computing it on-the-fly as a file is read from GridFS would ensure that our reads were successful, but guarantees nothing about the state of the file in the system. A successful check against the stored MD5 checksum guarantees that the stored file matches the original and no corruption has occurred.
And that is what we are doing. Only the mongoc_gridfs_file_save will calculate a md5 sum for the file and store it. Any other entry points, such as streaming, expect the user having created all the supporting mongoc_gridfs_file_opt_t and properly calculating the md5
source: JIRA issue

Delete multiple couchbase entities having common key pattern

I have a use case where I have to remove a subset of entities stored in couchbase, e.g. removing all entities with keys starting with "pii_".
I am using NodeJS SDK but there is only one remove method which takes one key at a time: http://docs.couchbase.com/sdk-api/couchbase-node-client-2.0.0/Bucket.html#remove
In some cases thousands of entities need to be deleted and it takes very long time if I delete them one by one especially because I don't keep list of keys in my application.

I agree with the #ThinkFloyd when he saying: Delete on server should be delete on server, rather than requiring three steps like get data from server, iterate over it on client side and finally for each record fire delete on the server again.
In this regards, I think old fashioned RDBMS were better all you need to do is 'DELETE * from database where something=something'.
Fortunately, there is something similar to SQL is available in CouchBase called N1QL (pronounced nickle). I am not aware about JavaScript (and other language syntax) but this is how I did it in python.
Query to be used: DELETE from <bucketname> b where META(b).id LIKE "%"
layer_name_prefix = cb_layer_key + "|" + "%"
query = ""
try:
query = N1QLQuery('DELETE from `test-feature` b where META(b).id LIKE $1', layer_name_prefix)
cb.n1ql_query(query).execute()
except CouchbaseError, e:
logger.exception(e)
To achieve the same thing: alternate query could be as below if you are storing 'type' and/or other meta data like 'parent_id'.
DELETE from <bucket_name> where type='Feature' and parent_id=8;
But I prefer to use first version of the query as it operates on key, and I believe Couchbase must have some internal indexes to operate/query faster on key (and other metadata).

The best way to accomplish this is to create a Couchbase view by key and then range query over that view via your NodeJS code, making deletes on the results.
http://docs.couchbase.com/admin/admin/Views/views-querySample.html
http://docs.couchbase.com/couchbase-manual-2.0/#couchbase-views-writing-querying-selection-partial
http://docs.couchbase.com/sdk-api/couchbase-node-client-2.0.8/ViewQuery.html
For example, your Couchbase view could look like the following:
function(doc, meta) {
emit(meta.id, null);
}
Then in your NodeJS code, you could have something that looks like this:
var couchbase = require('couchbase');
var ViewQuery = couchbase.ViewQuery;
var query = ViewQuery.from('designdoc', 'by_id');
query.range("pii_", "pii_" + "\u0000", false);
var myBucket = myCluster.openBucket();
myBucket.query(query, function(err, results) {
for(i in results) {
// Delete code in here
}
});
Of course your Couchbase design document and view will be named differently than the example that I gave, but the important part is the ViewQuery.range function that was used.
All document ids prefixed with pii_ would be returned, in which case you can loop over them and start deleting.
Best,

add an attachment to a document in couch db using nodejs

I want to update an existing document in couchdb. I have an image and i want to add it to an existing document in the db without lose the previus fields.
I'm using nodejs with nano.
tanks, this put me in the right orientation. At the end i do it in this way:
db.get(id,{ revs_info: true }, function (error, objeto) {
if(error)
console.log('wrong id');
fs.readFile('image.jpg', function(err, data)
{
if (!err)
{
db.attachment.insert(id, 'imagen.jpg', data, 'image/jpg',{ rev: objeto._rev}, function(err, body) {
if (!err)
console.log(body);
});
}
});
});

Your question is not really clear about the specific problem. So here just some general guidance on updating documents.
When designing the database make sure you set the ID rather than allowing couchdb to edit it. This way you can access the document directly when updating it.
When updating, you are required to prove that you are updating the most recent version of the document. I usually retrieve the document first and make sure you have the most recent '_rev' in the document you'll insert.
finally the update may fail if a different process has edited the document in the time between retrieving and updating it. So you should capture a failure in the insert and repeat the process until you succeed.
That being said, there are two ways you can store an image:
As an attachment: I believe nano support the attachment.insert() and attachment.get() functions to do so.
As a reference: I would usually rather store the images elsewhere and just store the url or filepath to access them. I've not used nano much but believe you can do this by doing the below.
doc = db.get(docname); // get the document with all existing elements
doc['mynewimage'] = myimageurl; // update the document with the new image
// this assumes it's a dictionary
db.insert(doc); // inserts the document with the correct _id (= docname)
// and _rev

JGit: Is there a thread safe way to add and update files

The easy way to add or update files in JGit is like this:
git.add().addFilepattern(file).call()
But that assumes that the file exists in the Git working directory.
If I have a multi-threaded setup (using Scala and Akka), is there a way to work only on a bare repository, writing the data directly to JGit, avoiding having to first write the file in the working directory?
For getting the file, that seems to work with:
git.getRepository().open(objId).getBytes()
Is there something similar for adding or updating files?

"Add" is a high-level abstraction that places a file in the index. In a bare repository, you lack an index, so this is not a 1:1 correspondence between the functionality. Instead, you can create a file in a new commit. To do this, you would use an ObjectInserter to add objects to the repository (one per thread, please). Then you would:
Add the contents of the file to the repository, as a blob, by inserting its bytes (or providing an InputStream).
Create a tree that includes the new file, by using a TreeFormatter.
Create a commit that points to the tree, by using a CommitBuilder.
For example, to create a new commit (with no parents) that contains only your file:
ObjectInserter repoInserter = repository.newObjectInserter();
ObjectId blobId;
try
{
// Add a blob to the repository
ObjectId blobId = repoInserter.insert(OBJ_BLOB, "Hello World!\n".getBytes());
// Create a tree that contains the blob as file "hello.txt"
TreeFormatter treeFormatter = new TreeFormatter();
treeFormatter.append("hello.txt", FileMode.TYPE_FILE, blobId);
ObjectId treeId = treeFormatter.insertTo(repoInserter);
// Create a commit that contains this tree
CommitBuilder commit = new CommitBuilder();
PersonIdent ident = new PersonIdent("Me", "me#example.com");
commit.setCommitter(ident);
commit.setAuthor(ident);
commit.setMessage("This is a new commit!");
commit.setTreeId(treeId);
ObjectId commitId = repositoryInserter.insert(commit);
repoInserter.flush();
}
finally
{
repoInserter.release();
}
Now you can git checkout the commit id returned as commitId.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string