I have a table with documents saved some of them in pdf, some of them image.
I want to create a web app, to show the images (that can be either pdf, either jpg) in the same control.
I can manage to see pdf, if I set the Response.ContentType = "application/pdf" or image if I set "application/jpg". But the problem is that how can I get the file type, having only the stream saved into the database? Does it have the stream the file type information in it?
Thanks.
No, a stream does not have a content type associated with it. If you had the original filename, you could attempt to derive the content type from that, but it wouldn't be foolproof.
Many file formats have a series of "magic bytes" that allow you to detect what (might) be in the file. PDF, for example, begins with the bytes "%PDF" (note: I'm not an expert on PDF, and there may be situations where that is not true).
If you have no other option, you could attempt to parse the file using various libraries until you found one that worked (System.Drawing.Image.FromStream(), iTextSharp, etc).
Related
In work place everyday we used different type of documents to hold data. For example, DOC, XLSX, PDF files. And sometime we use software (like adobe reader) excel to PDF converter.
As far i know another way to convert a document from excel to pdf is changing the document type from the SaveAs option (correct me if i am wrong) or changing the file extension.
My question is when we change the Document type from save as option does it change the code behind the file?
Another silly question is if we can convert the file by changing extension why we are paying for 3rd party software?!
Every document type has its format. So behind the screen, every type has its style. For example, XLSX format is a combination of XML and zip compression. PDF is a rich document representation format created by Adobe uses PostScript.
When you save a document as XLSX, the document will be saved as its standards. The saving method will be changed. As an answer to your first question, Yes the coding(method) will be changed when saving.
For the second question, the changing file format is not always an easy task. You need to change the encoding of the file when performing the conversion. When you change the extension you do not apply any conversion operation. You say your computer "This is an ... file.". But the encoding of the file is still unchanged.
I'm using ldap3.
I can connect and read all attributes without any issue, but I don't know how to display the photo of the attribute thumbnailPhoto.
If I print(conn.entries[0].thumbnailPhoto) I get a bunch of binary values like b'\xff\xd8\xff\xe0\x00\x10JFIF.....'.
I have to display it on a bottle web page. So I have to put this value in a jpeg or png file.
How can I do that?
The easiest way is to save the raw byte value in a file and open it with a picture editor. The photo is probably a jpeg, but it can be in any format.
Have a look at my answer at Display thumbnailPhoto from Active Directory in PHP. It's especially for PHP but the concept is the same for Python.
basically it's about either using the base64 encoded raw-data as data-stream or actually using a temporary file that is serverd (or used to determine the mime-type)
The Problem
I recieved a pdf file at work which I then printed. In the pdf file there were several optional fields where one could enter information such as "place of birth" etc. If I open the pdf file on my computer, I can see a set of input information A (a travel request with dates from this year 2017).
If I print the pdf on the local printer, the printed document contains a set of information B which for example contained travel request dates from 2015.
This information was not visible when opening the file on my computer.
I have been able to reproduce the error multiple times.
Why is this a problem?
It seems that previous entries into the pdf were yet somehow stored in the pdf contrary to what was visible when opening the pdf. When printing, the printer seems to access only the oldest entries and prints those.
This is a potential breach regarding data privacy and security since the pdf file seems to save all previous entries without anyone knowing.
Especially at work, some of these pdfs contains bank account information and other identity related information.
The Question
Did anyone experience a simliar issue or knows how to delete the invisible old information yet stored in the pdf?
UPDATE1: I could not reproduce the error on other printers. It seems this error is caused by the specific printer. Yet the information must be present in the PDF file, which is the specific cause of my question.
UPDATE2: Using the information from the accepted answer, I used the program "PDF CHAIN" and selected the option "drop XFA from document". I then saved the manipulated document again and printed it on the same printer.
Finally, the correct information was printed.
At a guess (and that's all it is without being able to see the original file) the PDF contains optional content or annotations which contain different field data for Print and Screen.
If you open the file using a PDF consumer (eg Acrobat) then what you see is the 'screen' result. Depending on the consumer you are using it may then either send the screen data to the printer, or substitute with the 'Print' data.
The printer you note as being a problem is capable of direct PDF printing, you haven't stated if that's how you are printing the PDF file, or whether you are using an application, nor whether the other printers are PDF capable or not.
My guess is that there is a different decision being made somewhere in the 2 print paths as to which is the 'correct' information to print.
Note that this does not mean that the PDF 'seems to save all previous entries without anyone knowing'; that's not really possible with a PDF file.
A malicious PDF processing application could do so, by adding comments to the PDF file, but only that application would be able to retrieve it.
But it is possible to have multiple entries of different types for different purposes, and if they aren't the same (because of the tool used to edit the file) then you can get strange results like this.
Note that if this is a problem for you then you probably shouldn't be using PDF, but you can mitigate the issue by digitally signing your documents. Signed PDF files include means (secure cryptographic hash) for verifying that the document has not been tampered with . Of course, you can't then edit the PDF file without re-signing it.
Oh, one other possibility would be that the PDF was actually an XFA form; its possible to have part of the document be a valid PDF which prints 'something' when a PDF consumer can't handle an XFA form, but that need bear no relation to what you see when you use an XFA processor.
My money's on optional content, AcroForm fields, or annotations where the Print data is different from the Screen data though.
Imagine an environment in which users can upload images to a website by either uploading it from their pc or referring to a remote url.
As part of some security checks I'd like to make sure that the referenced object is indeed an image.
In the case of a remote-url, I of course check the content-type, but this isn't bullet-proof.
I figured I could use ImageMagick to do the task. Perhaps executing the ImageMagick.identify() method and if no error is returned and returned type is either JPG|GIF|,etc. the content is an image. (In a quick check I noticed that TXT files are identified correctly as well, so I have to blacklist these)
Is there any better way in doing this?
You could probably simply load the image via ImageMagick's appropriate function for your language of choice. If the image isn't formatted properly (in terms of internal formatting, not its aesthetic properties, that is), I would expect ImageMagick to refuse to load it and report an error. In PHP, for example, readImage returns false if the image fails to load.
Alternatively, you could read the first few hundred bytes of the file and determine if the expected image file format headers are present; e.g., "GIF89" etc.
These checks may backfire, if your image is in a compressable format (PNG, GIF) and it is constructed in a way similar to a zip bomb https://en.wikipedia.org/wiki/Zip_bomb
Some examples at ftp://ftp.aerasec.de/pub/advisories/decompressionbombs/pictures/ (nothing special about that site, I just googled decompression bombs)
Another related issue is that formats like SVG are in fact XML and some image processing tools are prone to a variant of "billion laughs" attack https://en.wikipedia.org/wiki/Billion_laughs
You should not store the original file. The generally recommended approach is to always re-process the image and convert it to an entirely new file. There have been vulnerabilites exploited inside valid image files (see GIFAR), so checking for this would have been useless.
Never expose your visitors to an image file that you have not written out yourself and for which you did not choose the file name yourself.
I am trying to assess our security risk if we allow to have a form in our public website that lets the user upload any type of file and get it stored in the database.
I am worried about the following:
Robots uploading information
A huge increment of the size of the database
The form is an resume upload so HR people will be downloading those files in a jpeg or doc or pdf format but actually getting a virus.
You can use captchas for dealing with robots
Set a reasonable file size limit for each upload
You can do multiple checking for your file upload control.
1) Checking the extension of file (.wmv, .exe, .doc). This can be implemented by Regex expression.
2) Actually check the file header or definition type (ex: gif, word, image, etc, xls). Sometimes file extension is not sufficient.
3) Limit the file size. (Ex: 20mb)
4) Never accept the filename provided by the user. Always rename the file to some GUID according to your specifications. This way hacker wont be able to predict the actual name of the file which is stored on the server.
5) Store all the files out of web virtual directory. Preferably store in separate File Server.
6) Also implement the Captcha for File upload.
In general, if you really mean to allow any kind of file to be uploaded, I'd recommend:
A minimal type check using mime magic numbers that the extension of the file corresponds to the given one (though this doesn't solve much if you are not going to limit the kinds of files that can be uploaded).
Better yet, have an antivirus (free clamav for example) check the file after uploading.
On storage, I always prefer to use the filesystem for what it was created: storing files. I would not recommend storing files in the database (suposing a relational database). You can store the metadata of the file on the database and a pointer to the file on the file system.
Generate a unique id for the file and you can use a 2-level directory structure to store the data: E.g: Id=123456 => /path/to/store/12/34/123456.data
Said that, this can vary depending on what you want to store and how do you want to manage it. It's not the same to service a document repository, a image gallery or a simple "shared directory"