Sending multiple sha1 checksums in a single file? - sha

I'm working on an application that needs to verify a large number of files that have been transferred over the web. Our current thinking is to put the checksums for multiple files into a single file and name is foobar.sha1
I have some hesitation about using this extension as it seems to mostly be used for communicating a single checksum (as opposed to a large batch). Is this common usage?
Google has not yielded a clear answer.
Thanks

Related

Stream files generated on request in memory

I have a loop where I generate files (around 500KB each) and if there is too much data Node throws out of memory error (no wonder, it's around 4GB of data). I read about streams and I'm trying to understand how can I incorporate it in my app.
Most of the information I find is about streaming file that is already on the disk. What I want to do is to create files on the fly (which I already do), send one by one (or however chunks work) as they are generated and hand it to the client in a zip when it's done (so it's easy on the RAM).
I don't ask for specific code - more about where to look so I can read about it.

What is the optimal way of merge few lines or few words in the large file using NodeJS?

I would appreciate insight from anyone who can suggest the best or better solution in editing large files anyway ranges from 1MB to 200MB using nodejs.
Our process needs to merge lines to an existing file in the filesystem, we get the changed data in the following format which needs to be merged to filesystem file at the position defined in the changed details.
[{"range":{"startLineNumber":3,"startColumn":3,"endLineNumber":3,"endColumn":3},"rangeLength":0,"text":"\n","rangeOffset":4,"forceMoveMarkers":false},{"range":{"startLineNumber":4,"startColumn":1,"endLineNumber":4,"endColumn":1},"rangeLength":0,"text":"\n","rangeOffset":5,"forceMoveMarkers":false},{"range":{"startLineNumber":5,"startColumn":1,"endLineNumber":5,"endColumn":1},"rangeLength":0,"text":"\n","rangeOffset":6,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":1,"endLineNumber":6,"endColumn":1},"rangeLength":0,"text":"f","rangeOffset":7,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":2,"endLineNumber":6,"endColumn":2},"rangeLength":0,"text":"a","rangeOffset":8,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":3,"endLineNumber":6,"endColumn":3},"rangeLength":0,"text":"s","rangeOffset":9,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":4,"endLineNumber":6,"endColumn":4},"rangeLength":0,"text":"d","rangeOffset":10,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":5,"endLineNumber":6,"endColumn":5},"rangeLength":0,"text":"f","rangeOffset":11,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":6,"endLineNumber":6,"endColumn":6},"rangeLength":0,"text":"a","rangeOffset":12,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":7,"endLineNumber":6,"endColumn":7},"rangeLength":0,"text":"s","rangeOffset":13,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":8,"endLineNumber":6,"endColumn":8},"rangeLength":0,"text":"f","rangeOffset":14,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":9,"endLineNumber":6,"endColumn":9},"rangeLength":0,"text":"s","rangeOffset":15,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":10,"endLineNumber":6,"endColumn":10},"rangeLength":0,"text":"a","rangeOffset":16,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":11,"endLineNumber":6,"endColumn":11},"rangeLength":0,"text":"f","rangeOffset":17,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":12,"endLineNumber":6,"endColumn":12},"rangeLength":0,"text":"s","rangeOffset":18,"forceMoveMarkers":false}]
If we just open the full file and merge those details would work but it would break if we getting too many of those changed details very frequently that can cause out of memory issues as the file been opened many times which is also a very inefficient way.
There is a similar question aimed specifically at c# here. If we open the file in stream mode, is there similar example in nodejs?
I would appreciate insight from anyone who can suggest the best or better solution in editing large files anyway ranges from 1MB to 200MB using nodejs.
Our process needs to merge lines to an existing file in the filesystem, we get the changed data in the following format which needs to be merged to filesystem file at the position defined in the changed details.
General OS file systems do not directly support the concept of inserting info into a file. So, if you have a flat file and you want to insert data into it starting at a particular line number, you have to do the following steps:
Open the file and start reading from the beginning.
As you read data from the file, count lines until you reach the desired linenumber.
Then, if you're inserting new data, you need to read some more and buffer into memory the amount of data you intend to insert.
Then do a write to the file at the position of insertion of the data to insert.
Now using another buffer the size of the data you inserted, take turns reading another buffer, then writing out the previous buffer.
Continue until the end of the file is reach and all data is written back to the file (after the newly inserted data).
This has the effect of rewriting all the data after the insertion point back to the file so it will now correctly be in its new location in the file.
As you can tell, this is not efficient at all for large files as you have to read the entire file a buffer at a time and you have to write the insertion and everything after the insertion point.
In node.js, you can use features in the fs module to carry out all these steps, but you have to write the logic to connect them all together as there is no built-in feature to insert new data into a file while pushing the existing data after it.
There is a similar question aimed specifically at c# here. If we open the file in stream mode, is there similar example in nodejs?
The C# example you reference appears to just be appending new data onto the end of the file. That's trivial to do in pretty much any file system library. In node.js, you can do that with fs.appendFile() or you can open any file handle in append mode and then write to it.
To insert data into a file more efficiently, you would need to use a more efficient storage system than a single flat file for all the data. For example, if you stored the file in pieces in approximately 100 line blocks, then to insert data you'd only have to rewrite a portion of one block of data and then perhaps have some cleanup process that rebalances the block boundaries if a block gets way too big or too small.
For efficient line management, you would need to maintain an accurate index of how many lines each file piece contains and obviously what order the pieces should be in. This would allow you to insert data at a somewhat fixed cost no matter how big the entire file was as the most you would need to do is to rewrite one or two blocks of data, even if the entire content was hundreds of GB in size.
Note, you would essentially be building a new file system on top of the OS file system in order to give yourself more efficient inserts or deletions within the overall data. Obviously, the chunks of data could also be stored in a database too and managed there.
Note, if this project is really an editor, text editing a line-based structure is a very well studied problem and you could also study the architectures used in previous projects for further ideas. It's a bit beyond the scope of a typical answer here to study the pros and cons of various architectures. If your system is also a client/server editor where the change instructions are being sent from a client to a server, that also affects some of the desired tradeoffs in the design since you may desire differing tradeoffs in terms of the number of transactions or the amount of data to be sent between client and server.
If some other language uses an optimal way then I think it would be better to find that option as you saying nodejs might not have that option.
This doesn't really have anything to do with the language you choose. This is about how modern and typical operating systems store data in files.
In fs module there is a function named appendFile. It would let you append data in your file. Link.

Resize image when uploading to server or when serving from server to client?

My website uses many images. On a weak day users will upload hundreds of new images.
I'm trying to figure out what is the best-practice for manipulating sizes of images.
This project uses Node.js with gm module for manipulating images, but I don't think this question is node or gm specific.
I came up with several strategies, but I can't make a decision as to which is the best, and I am not sure if I am missing an obvious best-practice strategy.
Please enlighten me with your thoughts and experience.
Option 1: Resize the file with gm on every client request.
Option 1 pros:
If I run gm function every time I serve a file, I can control the size, quality, compression, filters and so on whenever I need it.
On the server I only save 1, full quality - full size version of the file and save storage space.
Option 1 cons:
gm is very resource intensive, and that means that I will be abusing my RAM for every single image server to every single client.
It means I will be always working from a big file, which makes things even worse.
I will always have to fetch the file from my storage (in my case S3) to the server, then manipulate it, then serve it. It seems like it would create redundant bandwidth issues.
Option 2: resize the file on first upload and keep multiple sizes of the file on the server.
Option 2 pros:
I will only have to use gm on uploads.
Serving the files will require almost no resources.
Option 2 cons:
I will use more storage because I will be saving multiple versions of the same file (i.e full, large, medium, small, x-small) instead of only one version.
I will be limited to using only the sizes that were created when the user uploaded their image.
Not flexible - If in the future I decide I need an additional size version (x-x-small for instance) I will have to run a script that processes every image in my storage to create the new version of the image.
Option 3:
Use option 2 to only process files on upload, but retain a resize module when serving file sizes that don't have a stored version in my storage.
Option 3 pros:
I will be able to reduce resource usage significantly when serving files in a selection of set sizes.
Option 3 cons:
I would still take more storage as in option 2 vs option 1.
I will still have to process files when I serve them in cases where I don't have the file size I want
Option 4: I do not create multiple versions of files on upload. I do resize the images when I serve them, BUT when ever an image size was requested, this version of the file will be saved in my storage and for future requests I will not have to process the image again.
Option 4 pros:
I will only use storage for the versions I use.
I could add a new file size when ever I need, it will be automatically created on a need-basis if it doesn't already exists.
Will use a lot of resources only once per file
Option 4 cons:
Files that are only accessed once will be both resource intensive AND storage intensive. Because I will access the file, see that the size version I need doesn't exist, create the new file version, use the resources needed, and save it to my storage wasting storage space for a file that will only be used once (note, I can't know how many times files will be used)
I will have to check if the file already exists for every request.
So,
Which would you choose? Why?
Is there a better way than the ways I suggested?
Solution highly depends on the usage you have for your resources. If you have an intensive utilisation then option 2 is from far the better. If not, option 1 could work nicely also.
From a qualitative point of view I think option 4 is the best of course. But for a question of simplicity and automation, I think option 2 is way better.
Because simplicity matter, I suggest to mix the option 2 and 4 : you will have a list of size (e.g. large,medium,small), but will not process them on upload but when requested as in option 4.
So that in the end, in the worst case you will arrive to the option 2 solution.
My final word would be that you should also use the <img> and/or <canvas> object in your website to perform the final sizing, so that the small computation overhead is not done on the server side.

For a peer-to-peer app that can resume file transfers, is it sufficient to check filesize/modified date for changes before resuming a file?

I'm working on a networked application that has a peer-to-peer file transfer component (think instant messenger), and I'd like to make it able to resume file transfers gracefully.
If there is an ongoing file transfer, and one user drops out, the recipient still knows how much of the file he's successfully received and therefore where to resume the transfer from. However, if the file has changed in the meantime, how can this be detected? With regards to my questions, I'm not focused here on corruption by the network so much as corruption by the source file being altered.
The way I was starting out on this was by having the sender hash the file before sending it, so the recipient has a hash to check the finished file against. However, this only detects corruption at the very end, unless each resume also hashes. This problem could be alleviated by viewing the file in chunks, and hashing each of those. However, the bigger problem with hashing is that it can take a really, really long time, which is just a bad user experience when a user just wants to immediately send something (Ex: Linux ISO on a slow network share is the file to be sent).
I was thinking about changing to simply checking the file size and modified date each time a transfer begins or is resumed. While this is clearly not foolproof, unless I'm missing something (and please correct me if I am), almost every means an end-user would be using to alter files will be well-behaved and at the very least mark the modified date, and even if not, the change in size should catch 99% of cases. Does this seem like an acceptable compromise? Bad idea?
How do the established protocols handle this?
The quick answer to your question is that it will work in most cases, unless files are modified often.
Instead of hashes, use check sums (CRC32 for example). These are much faster to check whether a file has been modified.
If a connection breaks, you only need to send the computed chunk checksums back to the source which can compute whether the current chunks have been modified in between. Then, it can decide which one to resend and send the missing chunks.
Chunk & checksums are the best trade-off over full files and hashes regarding user experience.

Uploading & extracting archive (zip, rar, targz, tarbz) automatically - security issue?

I'd like to create following functionality for my web-based application:
user uploads an archive file (zip/rar/tar.gz/tar.bz etc) (content - several image files)
archive is automatically extracted after upload
images are shown in the HTML list (whatever)
Are there any security issues involved with extraction process? E.g. possibility of malicious code execution contained within uploaded files (or well-prepared archive file), or else?
Aside the possibility of exploiting the system with things like buffer overflows if it's not implemented carefully, there can be issues if you blindly extract a well crafted compressed file with a large file with redundant patterns inside (a zip bomb). The compressed version is very small but when you extract, it'll take up the whole disk causing denial of service and possibly crashing the system.
Also, if you are not careful enough, the client might hand a zip file with server-side executable contents (.php, .asp, .aspx, ...) inside and request the file over HTTP, which, if not configured properly can result in arbitrary code execution on the server.
In addition to Medrdad's answer: Hosting user supplied content is a bit tricky. If you are hosting a zip file, then that can be used to store Java class files (also used for other formats) and therefore the "same origin policy" can be broken. (There was the GIFAR attack where a zip was attached to the end of another file, but that no longer works with the Java PlugIn/WebStart.) Image files should at the very least be checked that they actually are image files. Obviously there is a problem with web browsers having buffer overflow vulnerabilities, that now your site could be used to attack your visitors (this may make you unpopular). You may find some client side software using, say, regexs to pass data, so data in the middle of the image file can be executed. Zip files may have naughty file names (for instance, directory traversal with ../ and strange characters).
What to do (not necessarily an exhaustive list):
Host user supplied files on a completely different domain.
The domain with user files should use different IP addresses.
If possible decode and re-encode the data.
There's another stackoverflow question on zip bombs - I suggest decompressing using ZipInputStream and stopping if it gets too big.
Where native code touches user data, do it in a chroot gaol.
White list characters or entirely replace file names.
Potentially you could use an IDS of some description to scan for suspicious data (I really don't know how much this gets done - make sure your IDS isn't written in C!).

Resources