I have a file storage service built in Node/Meteor, which utilizes GridFS, and it is replicated across several containers. What I'm currently trying to find, is if this piece of code is actually aware of the read/write consistency
db.command({
filemd5: someFileId,
root: 'fs'
}, function callback(err, results) {
...
})
I'm uploading file in chunks, and after merging all chunks into a single file that command is executed. And I have a feeling that it's using secondary members (i got a couple md5 values which are of empty file - d41d8cd98f00b204e9800998ecf8427e). Is there any documentation or additional settings for it?
Those 2 params are the only options described in docs.. https://docs.mongodb.com/manual/reference/command/filemd5/
UPDATE
The exact code for merging the chunks is here in a 3rd party package:
cursor = files.find(
{
'metadata._Resumable.resumableIdentifier': file.metadata._Resumable.resumableIdentifier
length:
$ne: 0
},
{
fields:
length: 1
metadata: 1
sort:
'metadata._Resumable.resumableChunkNumber': 1
}
)
https://github.com/vsivsi/meteor-file-collection/blob/master/src/resumable_server.coffee#L26
And then there are line 111-119 which execute filemd5 first, and the run an update on the file
#db.command md5Command, (err, results) ->
if err
lock.releaseLock()
return callback err
# Update the size and md5 to the file data
files.update { _id: fileId }, { $set: { length: file.metadata._Resumable.resumableTotalSize, md5: results.md5 }},
(err, res) =>
lock.releaseLock()
callback err
https://github.com/vsivsi/meteor-file-collection/blob/master/src/resumable_server.coffee#L111-L119
After writing the last chunk, the cursor = files.find() is launched with all the merging stuff, hence if read preference is secondaryPreferred then they might not still be there? Should that code be refactored to use primary only?
GridFS creates 2 collections: files and chunks.
A typical files entry looks like the following:
{
"_id" : ObjectId("58cfbc8b6900bb31c7b1b8d9"),
"length" : 4,
"chunkSize" : 261120,
"uploadDate" : ISODate("2017-03-20T11:27:07.812Z"),
"md5" : "d3b07384d113edec49eaa6238ad5ff00",
"filename" : "foo.txt"
}
The filemd5 administrative command should simply return the md5 field of the relevant file document (and the number of chunks).
files.md5
An MD5 hash of the complete file returned by the filemd5 command. This value has the String type.
source: GridFS docs
It should represent the full file's hash, or at least of the one originally saved.
What is the ‘md5’ field of a files collection document and how is it used?
‘md5’ holds an MD5 checksum that is computed from the original contents of a user file. Historically, GridFS did not use acknowledged writes, so this checksum was necessary to ensure that writes went through properly. With acknowledged writes, the MD5 checksum is still useful to ensure that files in GridFS have not been corrupted. A third party directly accessing the 'files' and ‘chunks’ collections under GridFS could, inadvertently or maliciously, make changes to documents that would make them unusable by GridFS. Comparing the MD5 in the files collection document to a re-computed MD5 allows detecting such errors and corruption. However, drivers now assume that the stored file is not corrupted, and applications that want to use the MD5 value to check for corruption must do so themselves.
source: GridFS spec
If it is updated in such a way such that the driver's mongoc_gridfs_file_save is not used (for example, streaming), the md5 field will not be updated.
Actually, further reading the spec:
Why store the MD5 checksum instead of creating the hash as-needed?
The MD5 checksum must be computed when a file is initially uploaded to GridFS, as this is the only time we are guaranteed to have the entire uncorrupted file. Computing it on-the-fly as a file is read from GridFS would ensure that our reads were successful, but guarantees nothing about the state of the file in the system. A successful check against the stored MD5 checksum guarantees that the stored file matches the original and no corruption has occurred.
And that is what we are doing. Only the mongoc_gridfs_file_save will calculate a md5 sum for the file and store it. Any other entry points, such as streaming, expect the user having created all the supporting mongoc_gridfs_file_opt_t and properly calculating the md5
source: JIRA issue
Related
I understand that SQL Server uses GZip for COMPRESS/DECOMPRESS. I've read, that the data is stored in SQL Server as hex and adds extra characters to the beginning.
It must be possible to send and receive compressed data to and from SQL Server. My case relates to large JSON payloads, however this would be applicable to importing large amounts of structured data from CSV/XML as well.
Stored procedure:
CREATE OR ALTER PROCEDURE gzipTest
#compressed VARBINARY(MAX)
AS
BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;
SELECT CAST(DECOMPRESS(#compressed) as VARCHAR(MAX)) as json
END;
Example Code:
mssql.exec({
data: {
compressed: {
type: Datalayer.sql.VarBinary(Datalayer.sql.MAX),
value: deflateSync(json),
},
},
spName: 'gzipTest',
})
A co-worker found this example code, for receiving compressed data:
https://jsfiddle.net/58mgsy9a/
However iterating over the zip file buffer array.map(pair => parseInt(pair, 16)) seems like way too much processing for something that should adhere to standards on both sides.
I am building a website where users upload their details and work to the database but their work is a combination of text, videos and images. And I don't know how to upload such file to the mongodb at once with multer. Someone should please help.
I tried uploading it without multer but when a document exceed 16mb it displays an error.
A single MongoDB document can be no more than 16MB. Multer is not the problem here.
Luckily there is a solution to storing large files in MongoDB: GridFS.
The file is broken up into chunks that are smaller than 16MB and then stored in a separate collection. Rather than this:
{
_id: ...,
name: "John Smith",
work: [{
file: (a very large binary file),
filename: "foo.mp4"
}]
}
You'd save something like this:
GridFS will create a document in a collection named, for example 'work.files'
{
_id: ...,
filename: "foo.mp4",
...
}
GridFS will create multiple documents, each smaller than 16MB, in a collection named for example 'work.chunks'
{
_id: ...,
files_id: (a reference to a work.files._id),
n: 1,
data: ...
}, {
_id: ...,
files_id: (the same reference to a work.files._id),
n: 2,
data: ...
}
Rather than the embedded file, you store a reference to the entry in work.files.
{
_id: ...,
name: "John Smith",
work: [{
_id: (a reference to a work.files._id)
}]
}
You don't need to create these collections by yourself and there's no need to chunk the files up in parts yourself, as there are libraries to help you with this, such as Mongoose-GridFS. Most libraries provide methods to both stream files into and out of a GridFS collection, so it can be very performant, as well.
For very small files (such as an avatar) it might still be advantageous to store them embedded in your document.
A side note: There's not a lot "special" about GridFS—it's not an extension you need to install but more of a blessed standard that many people follow for interoperability. The documents are stored in standard, normal MongoDB collections. As such you can store any metadata/ custom fields (example from the previously mentioned library) in (whatever).files if you want. Depending on how large the files you're going to store are & if you're going to stream them (eg. video) you may want to tune chunk size (default 255kb)—even per file.
I'm using the Microsoft.Azure.Storage.DataMovement nuget package to transfer multiple, very large (150GB) files into Azure cold storage using
TransferManager.UploadDirectoryAsync
It works very well, but a choke point in my process is that after upload I am attaching to the FileTransferred event and reading the local file all over again to calculate the md5 checksum and compare it to the remote copy:
private void FileTransferredCallback(object sender, TransferEventArgs e)
{
var sourceFile = e.Source.ToString();
var destinationFile = (ICloudBlob) e.Destination;
var localMd5 = CalculateMd5(e.Source.ToString());
var remoteMd5 = destinationFile.Properties.ContentMD5;
if (localMd5 == remoteMd5)
{
destinationFile.Metadata.Add(Md5VerifiedKey, DateTimeOffset.UtcNow.ToDisplayText());
destinationFile.SetMetadata();
}
}
It is slower than it needs to be since every file is getting double handled - first by the library, then by my MD5 check.
Is this check even necessary or is the library already doing the heavy lifting for me? I can see Md5HashStream but after quickly looking through the source it isn't clear to me if it is being used to verify the entire remote file.
Note that metadata blob.Properties.ContentMD5 of the entire blob is actually set by Microsoft.Azure.Storage.DataMovement library per its local calculation result after uploading all the blocks of this blob, not by Azure Storage Blob Service.
The data integrity of blob uploading is guaranteed by Content-MD5 HTTP header when putting every single block, not by metadata blob.Properties.ContentMD5 of the entire blob, since Azure Storage Blob Service doesn't really validate the value when Microsoft.Azure.Storage.DataMovement library is setting metadata (check the introduction of x-ms-blob-content-md5 HTTP header).
The main purpose of blob.Properties.ContentMD5 is to verify the data integrity when downloading the blob back to local disk via Microsoft.Azure.Storage.DataMovement library (if DownloadOptions.DisableContentMD5Validation is set to false, which is the default behavior).
Is this check even necessary or is the library already doing the heavy lifting for me?
Based on my knowledge, we just need to check the blob whether there is a value for the ContentMD5 propetry.
When using Microsoft.Azure.Storage.DataMovement to upload the large file,it is actually composed of multiple PutBlock requests plus one PutBlockList request.Each PutBlock request uploads only part of the content, so MD5 in such requests may only be for the current upload content, and can not be used as the final blob MD5 value.
The contents of the PutBlockList request is a list of all the above upload Block identity, so the MD5 value of this request can only be done on this list integrity check.
when all of these requests are validated, the integrity of the content is guaranteed. For the sake of performance, the Storage server does not summarize the contents of all the blocks in the previous request to calculate the MD5 value of the entire blob, but provides a special request header, x-ms-blob-content-md5, The end will set this header property value to the blob's MD5 value.So the client as long as the final PutBlockList request set the entire contents of the MD5 value to x-ms-blob-content-md5, then ensure the verification, blob also has the MD5 value.
So the blocking upload MD5 based on the integrity of the work process is:
The uploaded file is divided into pieces
Send each block as a PutBlock request and calculate the MD5 value of the current block to the Content-MD5 header
After all the blocks have been sent, the PutBlockList request is sent
Calculate the MD value of the entire uploaded file and set it to the head of x-ms-blob-content-md5
Make a list of the identities of the blocks sent earlier as the contents of the request
Set the MD5 value for the block ID list to the Content-MD5 header
Then assign the x-ms-blob-content-md5 value in the PutBlockList request to the blob's MD5 attribute
In summary, when blocking upload, it depends on whether x-ms-blob-content-md5 has a value.
I have an analytics system that tracks customers and their attributes as well as their behavior in the form of events. It is implemented using Node.js and MongoDB (with Mongoose).
Now I need to implement a segmentation feature that allows to group stored users into segments based on certain conditions. For example something like purchases > 3 AND country = 'Netherlands'
In the frontend this would look something like this:
An important requirement here is that the segments get updated in realtime and not just periodically. This basically means, that every time a user's attributes change or he triggers a new event, I have to check again which segments he does belong to.
My current approach is to store the conditions for the segments as MongoDB queries, that I can then execute on the user collection in order to determine which users belong to a certain segment.
For example a segment to filter out all users that are using Gmail would look like this:
{
_id: '591638bf833f8c843e4fef24',
name: 'Gmail Users',
condition: {'email': { $regex : '.*gmail.*'}}
}
When a user matches the condition I would then store that he belongs to the 'Gmail Users' segment directly on the user's document:
{
username: 'john.doe',
email: 'john.doe#gmail.com',
segments: ['591638bf833f8c843e4fef24']
}
However by doing this, I would have to execute all queries for all segments every time a user's data changes, so I can check if he is part of the segment or not. This feels a bit complicated and cumbersome from a performance point of view.
Can you think of any alternative way to approach this? Maybe use a rule-engine and do the processing in the application and not on the database?
Unfortunately I don't know a better approach but you can optimize this solution a little bit.
I would do the same:
Store the segment conditions in a collection
Once you find a matching user, store the segment id in the user's document (segments)
An important requirement here is that the segments get updated in realtime and not just periodically.
You have no choice, you need to run the segmentation query every times when a segment changes.
I would have to execute all queries for all segments every time a user's data changes
This is where I would change your solution, actually just optimise it a little bit:
You don't need to run the segmentation queries on the whole collection. If you put your user id into the query with an $and, Mongodb will fetch the user first and after that will check the rest of the segmentation conditions. You need to make sure Mongodb uses the user's _id as an index, for this you can use .explain() to check it or .hint() to force it. Unfortunately you need to run N+1 queries if you have N segments (+1 is for the user update)
I would fetch every segments and store them in a cache (redis). If someone changed the segment I would update the cache as well. (Or just invalidate the cache and the next query will handle the rest, depends on the implementation). The point is that I would have every segments without fetching the database and if a user updated a record I would go through every segments with Node.js and validate the user by the conditions and I could update the user's segments array in the original update query so it would not require any extra database operation.
I know it could be a pain in the ass implementing something like this but it doesn't overload the database ...
Update
Let me give you some technical details about my second suggestion:
(This is just a pseudo code!)
Segment cache
module.exporst = function() {
return new Promise(resolve) {
Redis.get('cache:segments', function(err, segments) {
// handle error
// Segments are cached
if(segments) {
segments = JSON.parse(segments);
return resolve(segments);
}
//fetch segments and save it to the cache
Segments.find().exec(function(err, segments) {
// handle error
segments = JSON.stringify(segments);
// Save to the database but set 60 seconds as an expiration
Redis.set('cache:segments', segments, 'EX', 60, function(err) {
// handle error
return resolve(segments);
})
});
})
}
}
User update
// ...
let user = user.findOne(_id: ObjectId(req.body.userId));
// etc ...
// fetch segments from cache or from the database
let segments = yield segmentCache();
let userSegments = [];
segments.forEach(function(segment) {
if(checkSegment(user, segment)) {
userSegments.push(segment._id)
}
});
// Override user's segments with userSegments
This is where the magic happens, somehow you need to define the conditions in a way you can use them in an if statement.
Hint: Lodash has this functions: _.gt, _.gte, _.eq ...
Check segments
module.exports = function(user, segment) {
let keys = Object.keys(segment.condition);
keys.forEach(function(key) {
if(user[key] === segment.condition[key]) {
return false;
}
})
return true;
}
You are already storing an entire segment "query" in a document in segments collection - why not include a field in the same document which will enumerate which fields in the users document impact membership in a particular segment.
Since action of changing user data will know which fields are being changed, it can fetch only the segments which are computed using the fields being changed significantly reducing the size of segmentation "queries" you have to re-run.
Note that a change in user's data may add them to a segment they are not currently a member of, so checking only the segments currently stored in the user is not sufficient.
I have the following code (removed error checking to keep it concise) which uses the node-mongodb-native.
var mongo = require('mongodb').MongoClient;
var grid = require('mongodb').GridStore;
var url = 'mongodb://localhost:27017/mydatabase';
mongo.connect(url, function(err, db) {
var gs = new grid(db, 'myfile.txt', 'w', {
"metadata": {
// metadata here
}
});
gs.open(function(err, store) {
gs.writeFile('~/myfile.txt', function(err, doc) {
fs.unlink(req.files.save.path, function (err) {
// error checking etc
});
}
});
});
If I run that once it works fine and stores the file in GridFS.
Now, if I delete that file on my system and create a new one with the same name, but different contents, and run it though that code again it uploads it. However, it seems to overwrite the file that is already stored in GridFS. _id stays the same, but md5 has been updated to the new value. So even though the file is different, because the name is the same it overwrites the current file in GridFS.
Is there a way to upload two files with the same name? If _id is unique, why does the driver overwrite the file based on file name alone?
I found a similar issue on GitHub, but I am using the latest version of the driver from npm and it does what I explained above.
Like a real filesystem, the filename becomes the logical key in GridFS for reading and writing. You cannot have two files with the same name.
You'll need to come up with a secondary index of some sort or a new generated file name.
For example, add a timestamp to the file name.
Or, create another collection that maps generated file names to the GridFS structure to whatever it is that you need.
To avoid having to create additional unique identifiers for your files you should omit the 'write mode' option. This will allow gridfs to create a new file even if it contains the exact same data.
'w' overwrites data, which is why you are overwriting the existing file.
http://mongodb.github.io/node-mongodb-native/api-generated/gridstore.html