Sending GZipped data to SQL Server

Sending GZipped data to SQL Server - node.js

I understand that SQL Server uses GZip for COMPRESS/DECOMPRESS. I've read, that the data is stored in SQL Server as hex and adds extra characters to the beginning.
It must be possible to send and receive compressed data to and from SQL Server. My case relates to large JSON payloads, however this would be applicable to importing large amounts of structured data from CSV/XML as well.
Stored procedure:
CREATE OR ALTER PROCEDURE gzipTest
#compressed VARBINARY(MAX)
AS
BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;
SELECT CAST(DECOMPRESS(#compressed) as VARCHAR(MAX)) as json
END;
Example Code:
mssql.exec({
data: {
compressed: {
type: Datalayer.sql.VarBinary(Datalayer.sql.MAX),
value: deflateSync(json),
},
},
spName: 'gzipTest',
})
A co-worker found this example code, for receiving compressed data:
https://jsfiddle.net/58mgsy9a/
However iterating over the zip file buffer array.map(pair => parseInt(pair, 16)) seems like way too much processing for something that should adhere to standards on both sides.

Related

ADF: How to pass binary data to stored procedure in Azure Data Factory

I reading binary data (jpeg image) using an api (Web Action) and i want to store it as varbinary or base64 in azure sql server.
As it looks there is no way to base64 encode binary data using azure data factory. Is that correct?
So i am trying to pass it as byte[] using a varbinary parameter. The parameter of the stored procedure looks like this:
#Photo varbinary(max) NULL
The parameter in the stored procedure action in ADF looks like this:
But this seems also not to work, because the pipeline is failing with this error:
The value of the property 'Value' is invalid for the stored procedure parameter 'Photo'.
Is it possible to store that image using that approach? And if not, how can this be achieved (using ADF and Stored Procuedure)?

Just to be safe, are you missing a '#' before the activity ?
Cant see it on the picture
Peace

Microsoft.Azure.Storage.DataMovement MD5 check

I'm using the Microsoft.Azure.Storage.DataMovement nuget package to transfer multiple, very large (150GB) files into Azure cold storage using
TransferManager.UploadDirectoryAsync
It works very well, but a choke point in my process is that after upload I am attaching to the FileTransferred event and reading the local file all over again to calculate the md5 checksum and compare it to the remote copy:
private void FileTransferredCallback(object sender, TransferEventArgs e)
{
var sourceFile = e.Source.ToString();
var destinationFile = (ICloudBlob) e.Destination;
var localMd5 = CalculateMd5(e.Source.ToString());
var remoteMd5 = destinationFile.Properties.ContentMD5;
if (localMd5 == remoteMd5)
{
destinationFile.Metadata.Add(Md5VerifiedKey, DateTimeOffset.UtcNow.ToDisplayText());
destinationFile.SetMetadata();
}
}
It is slower than it needs to be since every file is getting double handled - first by the library, then by my MD5 check.
Is this check even necessary or is the library already doing the heavy lifting for me? I can see Md5HashStream but after quickly looking through the source it isn't clear to me if it is being used to verify the entire remote file.

Note that metadata blob.Properties.ContentMD5 of the entire blob is actually set by Microsoft.Azure.Storage.DataMovement library per its local calculation result after uploading all the blocks of this blob, not by Azure Storage Blob Service.
The data integrity of blob uploading is guaranteed by Content-MD5 HTTP header when putting every single block, not by metadata blob.Properties.ContentMD5 of the entire blob, since Azure Storage Blob Service doesn't really validate the value when Microsoft.Azure.Storage.DataMovement library is setting metadata (check the introduction of x-ms-blob-content-md5 HTTP header).
The main purpose of blob.Properties.ContentMD5 is to verify the data integrity when downloading the blob back to local disk via Microsoft.Azure.Storage.DataMovement library (if DownloadOptions.DisableContentMD5Validation is set to false, which is the default behavior).

Is this check even necessary or is the library already doing the heavy lifting for me?
Based on my knowledge, we just need to check the blob whether there is a value for the ContentMD5 propetry.
When using Microsoft.Azure.Storage.DataMovement to upload the large file,it is actually composed of multiple PutBlock requests plus one PutBlockList request.Each PutBlock request uploads only part of the content, so MD5 in such requests may only be for the current upload content, and can not be used as the final blob MD5 value.
The contents of the PutBlockList request is a list of all the above upload Block identity, so the MD5 value of this request can only be done on this list integrity check.
when all of these requests are validated, the integrity of the content is guaranteed. For the sake of performance, the Storage server does not summarize the contents of all the blocks in the previous request to calculate the MD5 value of the entire blob, but provides a special request header, x-ms-blob-content-md5, The end will set this header property value to the blob's MD5 value.So the client as long as the final PutBlockList request set the entire contents of the MD5 value to x-ms-blob-content-md5, then ensure the verification, blob also has the MD5 value.
So the blocking upload MD5 based on the integrity of the work process is:
The uploaded file is divided into pieces
Send each block as a PutBlock request and calculate the MD5 value of the current block to the Content-MD5 header
After all the blocks have been sent, the PutBlockList request is sent
Calculate the MD value of the entire uploaded file and set it to the head of x-ms-blob-content-md5
Make a list of the identities of the blocks sent earlier as the contents of the request
Set the MD5 value for the block ID list to the Content-MD5 header
Then assign the x-ms-blob-content-md5 value in the PutBlockList request to the blob's MD5 attribute
In summary, when blocking upload, it depends on whether x-ms-blob-content-md5 has a value.

Does MongoDB's filemd5 have ability to set readPreference

I have a file storage service built in Node/Meteor, which utilizes GridFS, and it is replicated across several containers. What I'm currently trying to find, is if this piece of code is actually aware of the read/write consistency
db.command({
filemd5: someFileId,
root: 'fs'
}, function callback(err, results) {
...
})
I'm uploading file in chunks, and after merging all chunks into a single file that command is executed. And I have a feeling that it's using secondary members (i got a couple md5 values which are of empty file - d41d8cd98f00b204e9800998ecf8427e). Is there any documentation or additional settings for it?
Those 2 params are the only options described in docs.. https://docs.mongodb.com/manual/reference/command/filemd5/
UPDATE
The exact code for merging the chunks is here in a 3rd party package:
cursor = files.find(
{
'metadata._Resumable.resumableIdentifier': file.metadata._Resumable.resumableIdentifier
length:
$ne: 0
},
{
fields:
length: 1
metadata: 1
sort:
'metadata._Resumable.resumableChunkNumber': 1
}
)
https://github.com/vsivsi/meteor-file-collection/blob/master/src/resumable_server.coffee#L26
And then there are line 111-119 which execute filemd5 first, and the run an update on the file
#db.command md5Command, (err, results) ->
if err
lock.releaseLock()
return callback err
# Update the size and md5 to the file data
files.update { _id: fileId }, { $set: { length: file.metadata._Resumable.resumableTotalSize, md5: results.md5 }},
(err, res) =>
lock.releaseLock()
callback err
https://github.com/vsivsi/meteor-file-collection/blob/master/src/resumable_server.coffee#L111-L119
After writing the last chunk, the cursor = files.find() is launched with all the merging stuff, hence if read preference is secondaryPreferred then they might not still be there? Should that code be refactored to use primary only?

GridFS creates 2 collections: files and chunks.
A typical files entry looks like the following:
{
"_id" : ObjectId("58cfbc8b6900bb31c7b1b8d9"),
"length" : 4,
"chunkSize" : 261120,
"uploadDate" : ISODate("2017-03-20T11:27:07.812Z"),
"md5" : "d3b07384d113edec49eaa6238ad5ff00",
"filename" : "foo.txt"
}
The filemd5 administrative command should simply return the md5 field of the relevant file document (and the number of chunks).
files.md5
An MD5 hash of the complete file returned by the filemd5 command. This value has the String type.
source: GridFS docs
It should represent the full file's hash, or at least of the one originally saved.
What is the ‘md5’ field of a files collection document and how is it used?
‘md5’ holds an MD5 checksum that is computed from the original contents of a user file. Historically, GridFS did not use acknowledged writes, so this checksum was necessary to ensure that writes went through properly. With acknowledged writes, the MD5 checksum is still useful to ensure that files in GridFS have not been corrupted. A third party directly accessing the 'files' and ‘chunks’ collections under GridFS could, inadvertently or maliciously, make changes to documents that would make them unusable by GridFS. Comparing the MD5 in the files collection document to a re-computed MD5 allows detecting such errors and corruption. However, drivers now assume that the stored file is not corrupted, and applications that want to use the MD5 value to check for corruption must do so themselves.
source: GridFS spec
If it is updated in such a way such that the driver's mongoc_gridfs_file_save is not used (for example, streaming), the md5 field will not be updated.
Actually, further reading the spec:
Why store the MD5 checksum instead of creating the hash as-needed?
The MD5 checksum must be computed when a file is initially uploaded to GridFS, as this is the only time we are guaranteed to have the entire uncorrupted file. Computing it on-the-fly as a file is read from GridFS would ensure that our reads were successful, but guarantees nothing about the state of the file in the system. A successful check against the stored MD5 checksum guarantees that the stored file matches the original and no corruption has occurred.
And that is what we are doing. Only the mongoc_gridfs_file_save will calculate a md5 sum for the file and store it. Any other entry points, such as streaming, expect the user having created all the supporting mongoc_gridfs_file_opt_t and properly calculating the md5
source: JIRA issue

How to implement a fast, queryable and persistant database in phantomjs?

I have been using phantomjs for doing some heavy lifting for me in a server side dom environment. Till now I have been putting by data structures in-memory (i.e. doing nothing special with them) and everything was fine.
But recently under some use cases i started running into following problems:
memory usage becoming too high making swap to kick in and seriously effecting my performance.
not being able to resume from the last save point since in-memory data structures are not persistent (obviously)
This forced me to look for a database solution to be used on phantom but again I am running into issues while deciding on a solution:
I don't want my performance to get too effected.
it has to be persistent and queryable
how do i even connect to a database from inside phantom script.
Can anyone guide me to a satisfactory solution?
Note: I have almost decided on sqlite but connecting to it from phantom is still an issue. Nodejs provides sqlite3 node module, i am trying to browserify it for phantom.
Note Note: Browserify didn't worked! Back to ground zero!! :-(
Thanx in advance!

Phantomjs' filesystem API allows you to read and write binary files with:
buf = fs.read(FILENAME, 'b') and
fs.write(FILENAME, buf, 'b')
sql.js (https://github.com/kripken/sql.js/) gives you a javascript SQLite
implementation you can run in phantomjs.
Combine the 2 and you have a fast, persistent, queryable SQL database.
Example walkthrough
Get javascript SQLite implementation (saving to /tmp/sql.js)
$ wget https://raw.githubusercontent.com/kripken/sql.js/master/js/sql.js -O /tmp/sql.js
Create a test SQLite database using the command-line sqlite3 app (showing it is persistent and external to your phantomjs application).
sqlite3 /tmp/eg.db
sqlite> CREATE TABLE IF NOT EXISTS test(id INTEGER PRIMARY KEY AUTOINCREMENT, created INTEGER NOT NULL DEFAULT CURRENT_TIMESTAMP);
sqlite> .quit
Save this test phantomjs script to add entries to the test database and verify behaviour.
$ cat /tmp/eg.js
var fs = require('fs'),
sqlite3 = require('./sql.js'),
dbfile = '/tmp/eg.db',
sql = 'INSERT INTO test(id) VALUES (NULL)',
// fs.read returns binary 'string' (not 'String' or 'Uint8Array')
read = fs.read(dbfile, 'b'),
// Database argument must be a 'string' (binary) not 'Uint8Array'
db = new sqlite3.Database(read),
write,
uint8array;
try {
db.run(sql);
} catch (e) {
console.error('ERROR: ' + e);
phantom.exit();
}
// db.export() returns 'Uint8Array' but we must pass binary 'string' to write
uint8array = db.export();
write = String.fromCharCode.apply(null, Array.prototype.slice.apply(uint8array));
fs.write(dbfile, write, 'b');
db.close();
phantom.exit();
Run the phantomjs script to test
$ /usr/local/phantomjs-2.0.0-macosx/bin/phantomjs /tmp/eg.js
Use external tool to verify changes were persisted.
sqlite3 /tmp/eg.db
sqlite> SELECT * FROM test;
id created
1 2015-03-28 10:21:09
sqlite>
Some things to keep in mind:
The database is modified on disk only when you call fs.write.
Any changes you make are invisible to external programs accessing the same SQLite database file until you call fs.write.
The entire database is read into memory with fs.read.
You may want to have different OS files for different tables -- or versions of tables -- depending on your application and the amount of data in the tables, to address the memory requirements you mentioned.
Passing what is returned by sqlite3.export() to fs.write will corrupt the SQLite database file on disk (it will no longer be a valid SQLite database file).
Uint8Array is NOT the correct type for the fs.write parameter.

Writing a binary data in phantomjs works like this:
var db_file = fs.open(db_name, {mode: 'wb', charset: ''});
db_file.write(String.fromCharCode.apply(null, db.export()));
db_file.close();
You have to set the charset to '' because otherwise the writing goes wrong.

Azure Table Storage access time - inserting/reading from

I'm making a program that stores and reads from Azure tables some that are stored in CSV files. What I got are CSV files that that can have various number of columns, and between 3k and 50k rows. What I need to do is upload that data in Azure table. So far I managed to both upload data and retrieve it.
I'm using REST API, and for uploading I'm creating XML batch request, with 100 rows per request. Now that works fine, except it takes a bit too long to upload, ex. for 3k rows it takes around 30seconds. Is there any way to speed that up? I noticed that it takes most time when proccessing response ( for ReadToEnd() command ). I read somewhere that setting proxy to null could help, but it doesn't do much in my case.
I also found somewhere that it is possible to upload whole XML request to blob and then execute it from there, but I couldn't find any example for doing that.
using (Stream requestStream = request.GetRequestStream())
{
requestStream.Write(content, 0, content.Length);
}
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
Stream dataStream = response.GetResponseStream();
using (var reader = new StreamReader(dataStream))
{
String responseFromServer = reader.ReadToEnd();
}
}
As for retrieving data from azure tables, I managed to get 1000 entities per request. As for that, it takes me around 9s for CS with 3k rows. It also takes most time when reading from stream. When I'm calling this part of the code (again ReadToEnd() ):
response = request.GetResponse() as HttpWebResponse;
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
string result = reader.ReadToEnd();
}
Any tips?

As you mentioned you are using REST API you may have to write extra code and depend on your own methods to implement performance improvement differently then using client library. In your case using Storage client library would be best as you can use already build features to expedite insert, upsert etc as described here.
However if you were using Storage Client Library and ADO.NET you can use the article below which is written by Windows Azure Table team as supported way to improve Azure Access Performance:
.NET and ADO.NET Data Service Performance Tips for Windows Azure Tables

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string