Does ActiveStorage use the checksum for anything? - rails-activestorage

In a very real sense, my question is actually 'can I skip generating a checksum', but answering that question rests on the above question.
To give you some background, I'm (finally) converting from Paperclip to ActiveStorage, and one of the pains of my particular conversion process is that I'm storing a decent sized number of fairly large files -- in addition to normal sized thumbnail images, I'm also storing large multimedia files, some in excess of 10GBs (currently poking at a 15GB file).
The basic conversion process has me downloading the file to generate a checksum, and a few other minor details that could be done with a head request instead of downloading the full file. We also copy the file from it's old 'home' to its new 'home', but that is done as an S3 to S3 copy, and doesn't take as long as downloading and uploading.
I'd love to skip the download & generate checksum process -- or at least, put it off for another day, as a cleanup step that isn't important to what we're actually doing.
So the question is: does the checksum actually do anything in ActiveStorage, or is it just a 'nice-to-have' feature that would allow me to, for example, publish the checksum if someone wanted to verify their version?

Found in code Rails
Prior to uploading, we compute the checksum, which is sent to the
service for transit integrity validation. If the checksum does not
match what the service receives, an exception will be raised.
You can create your own checksum without downloading the file:
Found in code Rails
def compute_checksum_in_chunks(io)
OpenSSL::Digest::MD5.new.tap do |checksum|
while chunk = io.read(5.megabytes)
checksum << chunk
end
io.rewind
end.base64digest
end

Related

Simplest format to archive a directory

I have a script that needs to work on multiple platforms and machines. Some of those machines don't have any available archiving software (e.g. zip, tar). I can't download any software onto these machines.
The script creates a directory containing output files. I need to package all those files into a single file so i can download it easily.
What is the simplest possible archiving format to implement, so I can easily roll my own impl in the script. It doesn't have to support compression.
I could make up something ad-hoc, e.g.
file1 base64EncodedContents
dir1/file1 base64EncodedContents
etc.
However if one already exists then that will save me having to roll my own packing and unpacking, only packing, which would be nice. Bonus points if it's zip compatible, so that I can try zipping it with compression if possible, and them impl my own without compression otherwise, and not have to worry about which it is on the other side.
The tar archive format is extremely simple - simple enough that I was able to implement a tar archiver in powershell in a couple of hours.
It consists of a sequence of file header, file data, file header, file data etc.
The header is pure ascii, so doesn't require any bit manipulation - you can literally append strings. Once you've written the header, you then append the file bytes, and pad it with nil chars till it's a multiple of 512 bytes. You then repeat for the next file.
Wikipedia has more details on the exact format: https://en.wikipedia.org/wiki/Tar_(computing).

How can I detect corrupt/incomplete MP3 file, from a node.js app?

The common situation when the integrity of an MP3 file is not correct, is when the file has been partially uploaded to the server. In this case, the indicated audio duration doesn't correspond to what is really in the MP3 file: we can hear the beginning, but at some point the playing stops and the indicated duration of the audio player is broken.
I tried with libraries like node-ffprobe, but it seems they just read metadata, without making comparison with real audio data in the file. Is there a way to detect efficiently a corrupted or incomplete MP3 file from node.js?
Note: the client uploading MP3 files is a hardware (an audio recorder), uploading files on a FTP server. Not a browser. So I'm not able to upload potentially more useful data from the client.
MP3 files don't normally have a duration. They're just a series of MPEG frames. Sometimes, there is an ID3 tag indicating duration, but not always.
Players can determine duration by choosing one of a few methods:
Decode the entire audio file.This is the slowest method, but if you're going to decode the file anyway, you might as well go this route as it gives you an exact duration.
Read the whole file, skimming through frame headers.You'll have to read the whole file from disk, but you won't have to decode it. Can be slow if I/O is slow, but gives you an exact duration.
Read the first frame's bitrate and estimate duration by file size.Definitely the fastest method, and the one most commonly used by players. Duration is an estimate only, and is reasonably accurate for CBR, but can be wildly inaccurate for VBR.
What I'm getting at is that these files might not actually be broken. They might just be VBR files that your player doesn't know the duration of.
If you're convinced they are broken (such as stopping in the middle of content), then you'll have to figure out how you want to handle it. There are probably only a couple ways to determine this:
Ideally, there's an ID3 tag indicating duration, and you can decode the whole file and determine its real duration to compare.
Usually, that ID3 tag won't exist, so you'll have to check to see if the last frame is complete or not.
Beyond that, you don't really have a good way of knowing if the stream is incomplete, since there is no outer container that actually specifies number of frames to expect.
The expression for calculating the filesize of an mp3 based on duration and encoding (from this answer) is quite simple:
x = length of song in seconds
y = bitrate in kilobits per second
(x * y) / 1024 = filesize (MB)
There is also a javascript implementation for the Web Audio API in another answer on that same question. Perhaps that would be useful in your Node implementation.
mp3diags is some older open source software for fixing mp3s and which was great for batch processing stuff like this. The source is c++ and still available if you're feeling nosy and want to see how some of these features are implemented.
Worth a look since it has some features that might be be useful in your context:
What is MP3 Diags and what does it do?
low quality audio
missing VBR header
missing normalization data
Correcting files that show incorrect song duration
Correcting files in which the player cannot seek correctly

ZIP file format. How to read file properly?

I'm currently working on one Node.js project. I want to have an ability to read, modify and write ZIP file without saving it into FS (we receive it by TCP and send it back after modifications were made), and so far it looks like possible bocause of simple ZIP file structure. Currently I refer to this documentation.
So ZIP file has simple structure:
File header 1
File data 1
File data descriptor 1
File header 2
File data 2
File data descriptor 2
...
[other not important yet]
First we need to read file header, which contains field compressed size, and it could be the perfect way to read file data 1 by it's length. But it's actually not. This field may contain '0' or '0xFFFFFFFF', and those values don't describe its actual length. In that case we have to read file data without information about it's length. But how?..
Compression/Decopression algorithm descriptions looks pretty complex to me, and I plan to use ZLIB for compression itself anyway. So if something useful described there, then I missed the point.
Can someone explain the proper way to read those files?
P.S. Please avoid suggesting npm modules. I do not want to only solve the problem, but also to understand how things work.
Note - I'm assuming you want to read and process the zip file as
it comes off the socket, rather than reading the complete zip file into
memory before processing. Both options are valid.
I'd initially ignore the use cases where the compressed size has a value of '0' or '0xFFFFFFFF'. The former is only present in zip files created in streaming mode, the latter for zip files larger than 4Gig.
Dealing with them adds a lot of complexity - you can add support for them later, if necessary. Whether you ever need to support the 0/0xFFFFFFFF use cases depends on the nature of the zip files you intend to process.
When the compression method is deflated (8), use zlib for compression/decompression. You also need to support compression method stored (0). It gets used for very small files where compression isn't appropriate.

Stream definition: Ignore all files but one filetype

We have a server with a depot that does not allow committing files which are in a client mapping therefore I need a stream configuration.
Now I struggle with a task which I would assume should be simple:
We have a very large stream with lots of different file types and I would like to check out the entire stream but get only a certain file type.
Can this be done with perforce without black-listing every file type in question?
Edit: Sorry that I (for some reason omitted) so many information in my question.
I am already setting up a virtual stream where the UI gives me three nice fields:
Paths – where I can enter import, share isolate paths
Remapping – ignored in my case
Ignored – here I can enter wildcards to ignore directories or files
I was hoping that by creating a virtual stream I actually could define the file types I want, e.g. I could write an import statement like
import RootDir/....txt //Depot/mainline/RootDir/....txt (note the 4 dots, 3 for perforce and the other as a "wildcard"
however the stream definition does not support this and only allows me to write
import RootDir/... //Depot/mainline/RootDir/...
Since I was not able to find a way to white list the files I wanted I only knew a way to blacklist all things I did not want but I would like to avoid that because my Ignored list would be dozens of entries long.
Now I will look into that sync hint because I could use the full stream spec without filter and only sync the files I need on disk, which might be very good.
There are a few different things going on in your question but this seems the most like a statement of what you're trying to do so I'm going to zero in on it:
I would like to check out the entire stream but get only a certain
file type.
If by "check out" you mean you only want to sync that file type to your local workspace:
p4 sync ....TXT
If by "check out" you mean you want to open only that file type for edit:
p4 edit ....TXT
ANY operation in Perforce that operates on files accepts an arbitrary file path, because Perforce tracks all of its state per-file. This is true whether you're using classic clients or streams.
There needs to be some mechanism for telling the Helix (Perforce) server that you only want to retrieve certain files from the stream.
Virtual Streams may be a good fit here, as they allow you to filter the view of an existing stream.
This means you can sync only the files you want and when you submit you will be submitting directly back to the stream your virtual stream is based on.
More information is available here:
https://www.perforce.com/perforce/doc.current/manuals/p4v/p4v_virtual_streams.html

GDBM file import and export

I am migrating a system from the old server (Slackware) to the new one (Redhat). The system includes some .gdbm files. I find out that on my new server, when running
WEB_SERVICES = file.gdbm
tie( %webservices, 'GDBM_File', $WEB_SERVICES, O_RDONLY, 0 )
the %webservices turns out to be empty. But this was working fine on my old server.
So my question is, are .gdbm files able to be simply transferred (using scp command) from one server to another (different operating system and different version of gdbm)?
Also I read the documents http://www.gnu.org.ua/software/gdbm/manual/gdbm.html#SEC12, which says .gdbm files need to be converted into flat format before sending over the network. But still I'm not sure how to do it.
Please help, thanks in advance!
On the old system, GDBM-tie to the hash, dump the hash. Move the dump to the new system. Read the dump into a hash, tie to GDBM to write it.
For dumping, use a platform independent serialisation format (Sereal is best), or if the dump needs to be human readable, Data::Dumper or similar for writing and Data::Undump for reading.

Resources