When do browsers send application/octet-stream as Content-Type? - jsf

I'm developing a file upload with JSF. The application saves three dates about the file:
Filename
Bytes
Content-Type as submitted by the browser.
My problem is that some files are saved with content type = application/octet-stream even if they are *.doc files oder *.pdf.
When does the browser submits such a content type?
I would like to clean up the database so I need to know when the browser information are incorrect.

Ignore the value sent by the browser. This is indeed dependent on the client platform, browser and configuration used.
If you want full control over content types based on the file extension, then better determine it yourself using ServletContext#getMimeType().
String mimeType = servletContext.getMimeType(filename);
The default mime types are definied in the web.xml of the servletcontainer in question. In for example Tomcat, it's located in /conf/web.xml. You can extend/override it in the webapp's /WEB-INF/web.xml as follows:
<mime-mapping>
<extension>xlsx</extension>
<mime-type>application/vnd.openxmlformats-officedocument.spreadsheetml.sheet</mime-type>
</mime-mapping>
You can also determine the mime type based on the actual file content (because the file extension may not per se be accurate, it can be fooled by the client), but this is a lot of work. Consider using a 3rd party library to do all the work. I've found JMimeMagic useful for this. You can use it as follows:
String mimeType = Magic.getMagicMatch(file, false).getMimeType();
Note that it doesn't support all mimetypes as reliable. You can also consider a combination of both approaches. E.g. if the one returns null or application/octet-stream, use the other. Or if both returns a different but "valid" mimetype, prefer the one returned by JMimeMagic.
Oh, I almost forgot to add, in JSF you can obtain the ServletContext as follows:
ServletContext servletContext = (ServletContext) FacesContext.getCurrentInstance().getExternalContext().getContext();
Or if you happen to use JSF 2.x already, use ExternalContext#getMimeType() instead.

It depends on the OS, the browser, and how the user has configured them. It's based on the way the browser determines the file type of local files (to display them). On most OS/browser combinations this is based on the file's extension, but on some it may be determined by other means. (eg: on Mac OS)
In ay case, you shouldn't really rely on the Content-type sent by the browser. The best approach would be to actually look at the contents of the file. You could probably also use the filename, but keep in mind that browsers aren't necessarily going to be good about telling you that either (though it's probably still a lot more reliable than the Content-type they send).

Related

How do you change the MIME type of a file from the terminal?

What I'm looking for is a counterpart to file -I (Darwin; -i on Linux).
For example, given:
$ file -I filename.pdf
filename.pdf: application/octet-stream; charset=binary
I would like to be able to do something like this:
$ [someCommand] filename.pdf application/pdf
The result would be that filename.pdf would then be typed as application/pdf.
The reason for the question is that sometimes web servers use the wrong MIME type, which results in programs refusing to open the file. (Most often text/plain, in my experience.)
I've been searching man, the web and this site for about two and a half hours. Tried everything from hex dumps to xattr to text editors.
Your help would very much be appreciated.
Chris
The thing about MIME types is they're almost entirely fictional.
MIME and HTTP ask us to pretend that all of our files have a piece of metadata identifying the "content type". When we send files around the network, the "content type" metadata goes with them, so nobody ever misinterprets the content of a file.
The truth is this metadata doesn't exist. By the time MIME was invented, it was really too late to convince any OS vendors to adopt a new type system for files. Unix had settled on magic numbers, DOS had settled on 3-letter filename suffixes, and classic MacOS had its creator codes and type codes. (MacOS type codes were closest to the MIME model, since they actually were separate from both the filename and the content. But being only 4 letters long, MIME types wouldn't fit.)
Nobody stores MIME-compatible content types in their filesystem. When a MIME message composer or HTTP server wants to send a file, it decides the file type in the traditional way (filename suffix and/or magic number) and maps the result to a MIME type.
In contrast to the theory (where MIME eliminates file type guessing), MIME as implemented in practice has moved the "guess file type based on filename suffix and/or magic number" logic from the receiver of the file to the sender. As you have noticed, the sender doesn't usually do a better job than the receiver would have done if forced to figure it out for itself. Frequently in the case of a web server, the server's eagerness to slap a Content-type on a file makes things worse. There's no reason for a web server to know anything about the format of files it serves when it is only being used to distribute them and has no need to interpret their contents.
The file command guesses file type by reading the content and looking for magic numbers and strings. The -I option doesn't change that. It just chooses a different output format.
To change the Content-Type header that a web server sends for a specific file, you should be looking in your web server's configuration manual. There's nothing you can do to the file itself.
It's a bit of a category mistake to talk about ‘the MIME type of a file’ – ‘files’ don't have MIME types; only octet streams have them (I'm not necessarily disagreeing with #wumpus-q-wumbley's description of MIME types as ‘fictional’, but this is another way of thinking about it).
MIME stands for Multipurpose Internet Mail Extensions, as originally described in in RFC 2045, and MIME types were originally intended to describe what a receiver is supposed to do with the bunch of bytes soon to follow down the wire, in the rest of the email message. They were very naturally repurposed in (for example) the HTTP protocol, to let a client understand how it is to interpret the bytes in the HTTP response which this MIME type forms the header of.
The fact that the file command can display a MIME type suggests the further extension of the idea, to act as the key which lets a windowing system look up the name of an application which should be used to open the file.
Thus, if ‘the MIME type of a file’ means anything, it means ‘the MIME type which a web server would prefix to this file if it were to be delivered in response to an HTTP request’ (or something like that). Thought of like that, it's clear that the MIME type is part of the web server's configuration, and not anything intrinsic to the file – a single file might be delivered with various MIME types depending on the URL which retrieves it, and details of the request and configuration. Thus an XHTML file might be delivered as text/html or application/xml or application/octet-stream depending on the details of the HTTP request, the directory the file's located in, or indeed the phase of the moon (the latter would be an unhelpful server configuration).
A web server might have a number of mechanisms for deciding on this MIME type, which might include a lookup table based on any file extension, a .htaccess file, or indeed the output of the file command.
So the answer to your question is: it depends.
If what you want to do is change how a web server delivers this file, then you need to look at either your web server documentation, or the contents of your system's /etc/mime.types file (if your system uses that and if the server is configured to fall back on that).
If what you want to do is to change the application which opens a given (type of) file, then your OS/window-manager documentation should help.
If you need to change the output of the file command specifically, for some other reason, then man file is your friend, and you'll probably need to grub around in the magic numbers file, reasonably carefully.
If you have a pdf, and the $file --mime-type command answer octet-stream and not application/pdf, you have a corruption in your file.
The pdf readers will read it, and ignore the problem, but if you upload this file to a web application, the application will recognize the mime-type as a octet-sream. Sometimes it is a problem, mainly if you validate the mime-type (I sometimes have this problem in my application).
To get a fast solution, use a ghost script like this:
gs -o new.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress old.pdf

How to use ImageMagick to test if received input is an image (for security purposes)?

Imagine an environment in which users can upload images to a website by either uploading it from their pc or referring to a remote url.
As part of some security checks I'd like to make sure that the referenced object is indeed an image.
In the case of a remote-url, I of course check the content-type, but this isn't bullet-proof.
I figured I could use ImageMagick to do the task. Perhaps executing the ImageMagick.identify() method and if no error is returned and returned type is either JPG|GIF|,etc. the content is an image. (In a quick check I noticed that TXT files are identified correctly as well, so I have to blacklist these)
Is there any better way in doing this?
You could probably simply load the image via ImageMagick's appropriate function for your language of choice. If the image isn't formatted properly (in terms of internal formatting, not its aesthetic properties, that is), I would expect ImageMagick to refuse to load it and report an error. In PHP, for example, readImage returns false if the image fails to load.
Alternatively, you could read the first few hundred bytes of the file and determine if the expected image file format headers are present; e.g., "GIF89" etc.
These checks may backfire, if your image is in a compressable format (PNG, GIF) and it is constructed in a way similar to a zip bomb https://en.wikipedia.org/wiki/Zip_bomb
Some examples at ftp://ftp.aerasec.de/pub/advisories/decompressionbombs/pictures/ (nothing special about that site, I just googled decompression bombs)
Another related issue is that formats like SVG are in fact XML and some image processing tools are prone to a variant of "billion laughs" attack https://en.wikipedia.org/wiki/Billion_laughs
You should not store the original file. The generally recommended approach is to always re-process the image and convert it to an entirely new file. There have been vulnerabilites exploited inside valid image files (see GIFAR), so checking for this would have been useless.
Never expose your visitors to an image file that you have not written out yourself and for which you did not choose the file name yourself.

zip mime types, when to pick which one

So far for Mime Types for Zip files I've seen:
application/octet-stream
multipart/x-zip
application/zip
application/zip-compressed
application/x-zip-compressed
I guess my question is which is the "best" and why? Why is there so many choices? I use winrar and it doesn't seem to care what the Mimetype is, but WinZip seems to only like multipart/x-zip and application/octet-stream. is there a Mimetype I can have all Zip files be downloaded as that will work in all programs?
thanks!
Registered with IANA MIME type is application/zip : http://www.iana.org/assignments/media-types/application/zip
WinZip is not a reference implementation (since originally ZIP standard is developed by PkWare).
Some facts about MIME types
MIME types follow a format: media-type/subtype-identifier. Example: image/png.
IANA maintains a list of all registered media types and subtypes.
The x- prefix of a subtype-identifier indicates that it is experimental and non-standard (not registered with IANA).
Now about the zip specifically...
application/zip is a standard MIME type for zip files, officially registered with IANA. It seems like a good first choice :)
application/octet-stream is defined in RFC 2045 and 2046: The “octet-stream” subtype is used to indicate that a body contains arbitrary binary data, so the content can be anything, not just zip.
multipart/x-zip - unlike a “discrete” type, the “multipart” type is one which represents a document that's comprised of multiple component parts, each of which may have its own individual MIME type. I suspect that the logic here is that a compressed file consists of multiple files. Thus, zip fits the “multipart” definition. But to me, it looks like overinterpretation, I would expect plain-text delimiters between parts to classify content as multipart. Moreover, it's not registered as a standard.
application/zip-compressed - a non-standard type, the naming violates RFC2046: A media type value beginning with the characters “X-” is a private value, to be used by consenting systems by mutual agreement. Any format without a rigorous and public definition must be named with an “X-” prefix
application/x-zip-compressed - some non-standard convention, I'm not sure if there is any significant usage
application/x-zip - some non-standard convention, I'm not sure if there is any significant usage

What does the 'x' in the extensions aspx, docx, xlsx, etc. represent?

Or at least describe about.aspx
For .aspx I assumed it stands for:
Active Server Page eXtended format
Though another opinion is that:
these files typically contain static (X)HTML markup, as well as markup defining server-side Web Controls and User Controls
Apparently it was the cool thing to do at time (the quote actually talks about the original name XSP, but doesn't rule it out as an option):
The initial prototype was called "XSP"; Guthrie explained in a 2007 interview that, "People would always ask what the X stood for. At the time it really didn't stand for anything. XML started with that; XSLT started with that. Everything cool seemed to start with an X, so that's what we originally named it."
For the office documents, since they are in XML format, it stands for XML.
I guess it stands for XML.
Since XML was used heavily in .NET Framework and later on in Open XML formats for Excel, Word.
If I was correctly informed, it stands for 'XML' - these files are renamed, zipped XML documents. That goes for .docx, .xlsx etc.; don't know about .aspx since that's web stuff.
They usually contain static XHTML
This is to indicate that it is a page contains (X)HTML and the rest of the code is in the code behind (e.g. about.aspx.cs or about.aspx.vb)

What is the difference between: image/x-citrix-pjpeg and image/pjpeg

Some files are uploaded with a reported MIME type:
image/x-citrix-pjpeg
They are valid jpeg files and I accept them as such.
I was wondering however: why is the MIME type different?
Is there any difference in the format? or was this mimetype invented by some light bulb at citrix for no apparent reason?
Update:
Ok, I did some more searching and testing on this question, and it turns out they're all lying about MIME-type (never trust any info send by the client, I know).
I've checked a bunch of files with different encodings (created with libjpeg)
Official MIME type for jpeg files: image/jpeg
But some applications (most notably MS Internet Explores but also Yahoo! mail) send jpeg files as image/pjpeg
I thought I knew that pjpeg stood for 'progressive' jpeg. It turns out that progressive/standard encoding has nothing to do with it.
MS Internet explorer send out all jpeg files as pjpeg regardless of the contents of the file.
The same goes for citrix: all jpeg files send from a citrix client are reported as the image/x-citrix-pjpeg MIME type.
The files themselves are untouched (identical before and after upload). So it turns out that difference in MIME type is only an indication the software used to send the file?
Why would people invent a new MIME type if there is no differences to the file contents?
image/x-citrix-pjpeg seems to be the MIME type sent by images which are exported from a Citrix session.
I haven't come across any format differences between them and regular JPEGs - most image conversion utilities will handle them the same as a regular pjpeg, once the appropriate mime-type rule is added.
It's possible that in a Citrix session there is some internal magic done when managing jpegs which led them to create this mime-type, which they leave on the file when it's exported from their systems, but that's only my guess. As I say, I haven't noticed any actual format differences from the occasional files in this format we receive.
The closest i have come to find out what this is, is this thread. Hope it helps.
http://forums.citrix.com/message.jspa?messageID=713174
For some reason, when people are running Internet Explorer via Citrix, it changes the mime type for GIF and JPG files.
JPG: image/x-citrix-pjpeg
GIF: image/x-citrix-gif
Based on my testing, PNG files are not affected. I don't know if this is an Internet Explorer issue or Citrix.
It's to do with a feature of Citrix called SpeedBrowse, which intercepts jpegs and gifs in webpages on the [Citrix] server side, so that it can send them whole via ICA (the Citrix remoting protocol) -- this is more efficient than screen-scraping them. As a previous poster suggested, this is implemented by marking the images with a changed mime type.
IIRC it hooks FindMimeFromData in IE to change the mime type on the fly, but this is being applied to uploaded files as well as downloaded ones - surely a bug.
From what I recall the Progressive JPG format is the one that would allow the image to be shown with progressively higher resolution as the download of the file progressed. I am not entirely aware of the details, but if you remember back in the days of dial up, some files would show blurry, then better and eventually complete as they were downloaded. For this to work the data needs to be sent in a different order than a JPEG would typically be sent.
The actual data, once you view it, is identical it is just sent in a different order. The JPEG encoding itself may very well group pixels differently, I forget.

Resources