Large file upload into WSS v3 - sharepoint

I'd built an WSSv3 application which upload files in small chunks; when every data piece arrives, I temporarly keep it into a SQL 2005 image data type field for performance reasons**.
Problem come when upload ends; I need to move data from my SQL Server to Sharepoint Document Library through WSSv3 object model.
Right now, I can think two approaches:
SPFileCollection.Add(string, (byte[])reader[0]); // OutOfMemoryException
and
SPFile file = folder.Files.Add("filename", new byte[]{ });
using(Stream stream = file.OpenBinaryStream())
{
// ... init vars and stuff ...
while ((bytes = reader.GetBytes(0, offset, buffer, 0, BUFFER_SIZE)) > 0)
{
stream.Write(buffer, 0, (int)bytes); // Timeout issues
}
file.SaveBinary(stream);
}
Are there any other way to complete successfully this task?
** Performance reasons: if you tries to write every chunk directly at Sharepoint, you'll note a performance degradation as file grows up (>100Mb).

I ended with following code:
myFolder.Files.Add("filename",
new DataRecordStream(dataReader,
dataReader.GetOrdinal("Content"), length));
You can find DataRecordStream implementation here. It's basically a Stream whos read data from a DbDataRecord through .GetBytes
This approach is similar to OpenBinaryStream()/SaveBinary(stream), but it's doesnt keeps all byte[] in memory while you transfer data. In some point, DataRecordStream will be accessed from Microsoft.SharePoint.SPFile.CloneStreamToSPFileStream using 64k chunks.
Thank you all for valuable infos!

The first thing I would say is that SharePoint is really, really not designed for this. It stores all files in its own database so that's where these large files are going. This is not a good idea for lots of reasons: scalability, cost, backup/restore, performance, etc... So I strongly recommend using file shares instead.
You can increase the timeout of the web request by changing the executionTimeout attribute of the httpRuntime element in web.config.
Apart from that, I'm not sure what else to suggest. I haven't heard of such large files being stored in SharePoint. If you absolutely must do this, try also asking on Server Fault.

As mentioned previously, storing large files in Sharepoint is generally a bad idea. See this article for more information: http://blogs.msdn.com/joelo/archive/2007/11/08/what-not-to-store-in-sharepoint.aspx
With that said, it is possible to use external storage for BLOBs, which may or may not help your performance issues -- Microsoft released a half-complete external BLOB storage provider that does the trick, but it unfortunately works at the farm level and affects all uploads. Ick.
Fortunately, since you can implement your own external BLOB provider, you may be able to write something to better handle these specific files. See this article for details: http://207.46.16.252/en-us/magazine/2009.06.insidesharepoint.aspx
Whether or not this would be worth the overhead depends on how much of a problem you're having. :)

Related

Node uploaded image save - stream vs buffer

I am working on image upload and don't know how to properly deal with storing the received file. It would be nice to analyze the file first if it is really an image or someone just changed the extension. Luckily I use package sharp which has exactly such a feature. I currently work with two approaches.
Buffering approach
I can parse multipart form as a buffer and easily decide whether save a file or not.
const metadata = await sharp(buffer).metadata();
if (metadata) {
saveImage(buffer);
} else {
throw new Error('It is not an image');
}
Streaming approach
I can parse multipart form as a readable stream. First I need to forward the readable stream to writable and store file to disk. Afterward, I need again to create a readable stream from saved file and verify whether it is really image. Otherwise, revert all.
// save uploaded file to file system with stream
readableStream.pipe(createWriteStream('./uploaded-file.jpg'));
// verify whether it is an image
createReadStream('./uploaded-file.jpg').pipe(
sharp().metadata((err, metadata) => {
if (!metadata) {
revertAll();
throw new Error('It is not an image');
}
})
)
It was my intention to avoid using buffer because as I know it needs to store the whole file in RAM. But on the other hand, the approach using streams seems to be really clunky.
Can someone help me to understand how these two approaches differ in terms of performance and used resources? Or is there some better approach to how to deal with such a situation?
In buffer mode, all the data coming from a resource is collected into a buffer, think of it as a data pool, until the operation is completed; it is then passed back to the caller as one single blob of data. Buffers in V8 are limited in size. You cannot allocate more than a few gigabytes of data, so you may hit a wall way before running out of physical memory if you need to read a big file.
On the other hand, streams allow us to process the data as soon as it arrives from the resource. So streams execute their data without storing it all in memory. Streams can be more efficient in terms of both space (memory usage) and time (computation clock time).

Objective C For loops with #autorelease and ARC

As part of an app that allows auditors to create findings and associate photos to them (Saved as Base64 strings due to a limitation on the web service) I have to loop through all findings and their photos within an audit and set their sync value to true.
Whilst I perform this loop I see a memory spike jumping from around 40MB up to 500MB (for roughly 350 photos and 255 findings) and this number never goes down. On average our users are creating around 1000 findings and 500-700 photos before attempting to use this feature. I have attempted to use #autorelease pools to keep the memory down but it never seems to get released.
for (Finding * __autoreleasing f in self.audit.findings){
#autoreleasepool {
[f setToSync:#YES];
NSLog(#"%#", f.idFinding);
for (FindingPhoto * __autoreleasing p in f.photos){
#autoreleasepool {
[p setToSync:#YES];
p = nil;
}
}
f = nil;
}
}
The relationships and retain cycles look like this
Audit has a strong reference to Finding
Finding has a weak reference to Audit and a strong reference to FindingPhoto
FindingPhoto has a weak reference to Finding
What am I missing in terms of being able to effectively loop through these objects and set their properties without causing such a huge spike in memory. I'm assuming it's got something to do with so many Base64 strings being loaded into memory when looping through but never being released.
So, first, make sure you have a batch size set on the fetch request. Choose a relatively small number, but not too small because this isn't for UI processing. You want to batch a reasonable number of objects into memory to reduce loading overhead while keeping memory usage down. Try 50 or 100 and see how it goes, then consider upping the batch size a little.
If all of the objects you're loading are managed objects then the correct way to evict them during processing is to turn them into faults. That's done by calling refreshObject:mergeChanges: on the context. BUT - that discards any changes, and your loop is specifically there to make changes.
So, what you should really be doing is batch saving the objects you've modified and then turning those objects back into faults to remove the data from memory.
So, in your loop, keep a counter of how many you've modified and save the context each time you hit that count and refresh all the objects that were processed so far. The batch on the fetch and the batch size to save should be the same number.
There's probably a big difference in size between your "Finding" objects and the associated images. So your primary aim should be to redesign your database in a way so that unfaulting (loading) a Finding object does not automatically load the base64 encoded image.
That's actually one of the major strengths of Code Data: Loading part of an object hierarchy. Just try to move the base64 encoded data to an own (managed) object so that Core Data does not load it. It will still be loaded as needed when the reference is touched.

Give reads priority over writes in Elasticsearch

I have an EC2 server running Elasticsearch 0.9 with a nginx server for read/write access. My index has about 750k small-medium documents. I have a pretty continuous stream of minimal writes (mainly updates) to the content. The speeds/consistency I receive with search is fine with me, but I have some sporadic timeout issues with multi-get (/_mget).
On some pages in my app, our server will request a multi-get of a dozen to a few thousand documents (this usually takes less than 1-2 seconds). The requests that fail, fail with a 30,000 millisecond timeout from the nginx server. I am assuming this happens because the index was temporarily locked for writing/optimizing purposes. Does anyone have any ideas on what I can do here?
A temporary solution would be to lower the timeout and return a user friendly message saying documents couldn't be retrieved (however they still would have to wait ~10 seconds to see an error message).
Some of my other thoughts were to give read priority over writes. Anytime someone is trying to read a part of the index, don't allow any writes/locks to that section. I don't think this would be scalable and it may not even be possible?
Finally, I was thinking I could have a read-only alias and a write-only alias. I can figure out how to set this up through the documentation, but I am not sure if it will actually work like I expect it to (and I'm not sure how I can reliably test it in a local environment). If I set up aliases like this, would the read-only alias still have moments where the index was locked due to information being written through the write-only alias?
I'm sure someone else has come across this before, what is the typical solution to make sure a user can always read data from the index with a higher priority over writes. I would consider increasing our server power, if required. Currently we have 2 m2x-large EC2 instances. One is the primary and the replica, each with 4 shards.
An example dump of cURL info from a failed request (with an error of Operation timed out after 30000 milliseconds with 0 bytes received):
{
"url":"127.0.0.1:9200\/_mget",
"content_type":null,
"http_code":100,
"header_size":25,
"request_size":221,
"filetime":-1,
"ssl_verify_result":0,
"redirect_count":0,
"total_time":30.391506,
"namelookup_time":7.5e-5,
"connect_time":0.0593,
"pretransfer_time":0.059303,
"size_upload":167002,
"size_download":0,
"speed_download":0,
"speed_upload":5495,
"download_content_length":-1,
"upload_content_length":167002,
"starttransfer_time":0.119166,
"redirect_time":0,
"certinfo":[
],
"primary_ip":"127.0.0.1",
"redirect_url":""
}
After more monitoring using the Paramedic plugin, I noticed that I would get timeouts when my CPU would hit ~80-98% (no obvious spikes in indexing/searching traffic). I finally stumbled across a helpful thread on the Elasticsearch forum. It seems this happens when the index is doing a refresh and large merges are occurring.
Merges can be throttled at a cluster or index level and I've updated them from the indicies.store.throttle.max_bytes_per_sec from the default 20mb to 5mb. This can be done during runtime with the cluster update settings API.
PUT /_cluster/settings HTTP/1.1
Host: 127.0.0.1:9200
{
"persistent" : {
"indices.store.throttle.max_bytes_per_sec" : "5mb"
}
}
So far Parmedic is showing a decrease in CPU usage. From an average of ~5-25% down to an average of ~1-5%. Hopefully this can help me avoid the 90%+ spikes I was having lock up my queries before, I'll report back by selecting this answer if I don't have any more problems.
As a side note, I guess I could have opted for more balanced EC2 instances (rather than memory-optimized). I think I'm happy with my current choice, but my next purchase will also take more CPU into account.

GDBM file import and export

I am migrating a system from the old server (Slackware) to the new one (Redhat). The system includes some .gdbm files. I find out that on my new server, when running
WEB_SERVICES = file.gdbm
tie( %webservices, 'GDBM_File', $WEB_SERVICES, O_RDONLY, 0 )
the %webservices turns out to be empty. But this was working fine on my old server.
So my question is, are .gdbm files able to be simply transferred (using scp command) from one server to another (different operating system and different version of gdbm)?
Also I read the documents http://www.gnu.org.ua/software/gdbm/manual/gdbm.html#SEC12, which says .gdbm files need to be converted into flat format before sending over the network. But still I'm not sure how to do it.
Please help, thanks in advance!
On the old system, GDBM-tie to the hash, dump the hash. Move the dump to the new system. Read the dump into a hash, tie to GDBM to write it.
For dumping, use a platform independent serialisation format (Sereal is best), or if the dump needs to be human readable, Data::Dumper or similar for writing and Data::Undump for reading.

Saving a base 64 string to a file via createWriteStream

I have an image coming into my Node.js application via email (through cloud service provider Mandrill). The image comes in as a base64 encoded string, email.content in the example below. I'm currently writing the image to a buffer, and then a file like this:
//create buffer and write to file
var dataBuffer = new Buffer(email.content, 'base64');
var writeStream = fs.createWriteStream(tmpFileName);
writeStream.once('open', function(fd) {
console.log('Our steam is open, lets write to it');
writeStream.write(dataBuffer);
writeStream.end();
}); //writeSteam.once('open')
writeStream.on('close', function() {
fileStats = fs.statSync(tmpFileName);
This works fine and is all well and good, but am I essentially doubling the memory requirements for this section of code, since I have my image in memory (as the original string), and then create a buffer of that same string before writing the file? I'm going to be dealing with a lot of inbound images so doubling my memory requirements is a concern.
I tried several ways to write email.content directly to the stream, but it always produced an invalid file. I'm a rank amateur with modern coding, so you're welcome to tell me this concern is completely unfounded as long as you tell me why so some light will dawn on marble head.
Thanks!
Since you already have the entire file in memory, there's no point in creating a write stream. Just use fs.writeFile
fs.writeFile(tmpFileName, email.content, 'base64', callback)
#Jonathan's answer is a better way to shorten the code you already have, so definitely do that.
I will expand on your question about memory though. The fact is that Node will not write anything to a file without converting it to a Buffer first, so given when you have told us about email.content, there is nothing more you can do.
If you are really worried about this though, then you would need some way to process the value of email.content as it comes in from where ever you are getting it from, as a stream. Then as the data is being streamed into the server, you immediately write it to a file, thus not taking up any more RAM than needed.
If you elaborate more, I can try to fill in more info.

Resources