Get the actual file extension using NiFi - python-3.x

I am using Apache NiFi to ingest data from Azure Storage. Now, the file I want to a huge file (100+ GB) read can have any extension and I want to read the file's header to get its actual extension.
I found python-magic package which uses libmagic to read the file's header to fetch the extension, but this requires the file to be present locally.
The NiFi pipeline to ingest the data looks like this
I need a way to get the file extension in this NiFi pipeline. Is there a way to read the file's header from the Content Repo? If yes, how do we do it? FlowFile has only the metadata which says the content-type as text/plain for a CSV.

There is no such thing as a generic 'header' that all files have that gives you it's "real" extension. A file is just a collection of bits, and we sometimes choose to give extensions/headers/footers/etc so that we know how to interpret those bits.
We tend to add that 'type' information in two ways, via a file extension e.g. .mp4 and/or via some metadata that accompanies the file - this is sometimes a header, which is sometimes plaintext and easily readible, but this is not always true. Additioanlly, it is up to the user and/or the application to set this information, and up the user and/or application to read it - neither of which are a given.
If you do not trust that the file has the proper extension applied (e.g. video.txt when it's actually an mp4) then you could also try to interrogate the metadata that is held in Azure Blob Storage (ContentType) and see what that says - however, this is also up to the user/application to set when the file is uploaded to ABS, so there is no guarantee that it is any more accurate than the file extension.
text/plain is not invalid for a plaintext CSV, as CSVs are just formatted plaintext - similar to JSON. However, you can be more specific and use e.g. text/csv for CSV and application/json for JSON.
NiFi does have IndentifyMimeType which can try to work it out for you by interrogating the file, but it is more complex that just accessing some 'header'. This processor uses Apache Tika for the detection, and adds a mime.type attribute to the FlowFile.
If your file is some kind of custom format, then this processor likely won't help you. If you know your files have a specific header, then you'll need to provide more information for your exact situation.

Related

Azure Logic Apps: Check for file type

I setup an Azure Logic App that checks for newly created files in a OneDrive folder and then sends these (images) to the MS Vision API for tagging. This flow works fine.
How can I setup a condition to only react on a specific file type (images) or even better only when the file has a certain file ending, like ".jpg", ".png" etc.?
I tried to setup a condition on the "File content type" but couldn't figure out the appropriate value for the condition ("image" doesn't work).
I couldn't find any hints on the webs and neither on SO. Any help is very much appreciated.
When reading file attachments using the GMail action, I had to use starts with because the Content-Type property contained the MIME type followed by the file name.
The following example is for checking if the file is an Excel file (.xlsx, not .xls):
I also used http://mime.ritey.com/ to upload my files and ensure I had the MIME type correct.
File name is part of the metadata provided by the OneDrive Connector.
Using that, you can apply conditions/filters based on the extension. File content type is probably pretty reliable but in practice, the extension might be better.
I think I found a solution. I was able to kind of reverse engineer the file types by setting up an app that is triggered by new files and writes the file content type to a text file in a different folder.
image/jpg and image/png are image files
application/x-zip-compressed is a zipped file
So it seems that Azure uses standard MIME types to identify the file type (which very much makes sense... :0)

Convert .AIB (Audio Media indexer ) file into readable format (String)

I want to parse a .AIB file which I have got in the form of output from Azure media indexer job So that I can read the data in the file and can take out the required information.
The AIB file format is a binary blob designed for use with the SQL Server Add-in. Have you read through the blog post here? https://azure.microsoft.com/en-us/blog/using-aib-files-with-azure-media-indexer-and-sql-server/
If you have any more questions after reading that blog post, or you already have, I can help you further.
Would you mind sharing your scenario that requires the AIB file? Almost all of our customers have found that the plain-text outputs are sufficient for their use cases.
Adarsh

How to use ImageMagick to test if received input is an image (for security purposes)?

Imagine an environment in which users can upload images to a website by either uploading it from their pc or referring to a remote url.
As part of some security checks I'd like to make sure that the referenced object is indeed an image.
In the case of a remote-url, I of course check the content-type, but this isn't bullet-proof.
I figured I could use ImageMagick to do the task. Perhaps executing the ImageMagick.identify() method and if no error is returned and returned type is either JPG|GIF|,etc. the content is an image. (In a quick check I noticed that TXT files are identified correctly as well, so I have to blacklist these)
Is there any better way in doing this?
You could probably simply load the image via ImageMagick's appropriate function for your language of choice. If the image isn't formatted properly (in terms of internal formatting, not its aesthetic properties, that is), I would expect ImageMagick to refuse to load it and report an error. In PHP, for example, readImage returns false if the image fails to load.
Alternatively, you could read the first few hundred bytes of the file and determine if the expected image file format headers are present; e.g., "GIF89" etc.
These checks may backfire, if your image is in a compressable format (PNG, GIF) and it is constructed in a way similar to a zip bomb https://en.wikipedia.org/wiki/Zip_bomb
Some examples at ftp://ftp.aerasec.de/pub/advisories/decompressionbombs/pictures/ (nothing special about that site, I just googled decompression bombs)
Another related issue is that formats like SVG are in fact XML and some image processing tools are prone to a variant of "billion laughs" attack https://en.wikipedia.org/wiki/Billion_laughs
You should not store the original file. The generally recommended approach is to always re-process the image and convert it to an entirely new file. There have been vulnerabilites exploited inside valid image files (see GIFAR), so checking for this would have been useless.
Never expose your visitors to an image file that you have not written out yourself and for which you did not choose the file name yourself.

Render image or pdf stream from SQL database in asp.net

I have a table with documents saved some of them in pdf, some of them image.
I want to create a web app, to show the images (that can be either pdf, either jpg) in the same control.
I can manage to see pdf, if I set the Response.ContentType = "application/pdf" or image if I set "application/jpg". But the problem is that how can I get the file type, having only the stream saved into the database? Does it have the stream the file type information in it?
Thanks.
No, a stream does not have a content type associated with it. If you had the original filename, you could attempt to derive the content type from that, but it wouldn't be foolproof.
Many file formats have a series of "magic bytes" that allow you to detect what (might) be in the file. PDF, for example, begins with the bytes "%PDF" (note: I'm not an expert on PDF, and there may be situations where that is not true).
If you have no other option, you could attempt to parse the file using various libraries until you found one that worked (System.Drawing.Image.FromStream(), iTextSharp, etc).

What security issues we acquire if we publish a form that lets you upload any type of file into our database?

I am trying to assess our security risk if we allow to have a form in our public website that lets the user upload any type of file and get it stored in the database.
I am worried about the following:
Robots uploading information
A huge increment of the size of the database
The form is an resume upload so HR people will be downloading those files in a jpeg or doc or pdf format but actually getting a virus.
You can use captchas for dealing with robots
Set a reasonable file size limit for each upload
You can do multiple checking for your file upload control.
1) Checking the extension of file (.wmv, .exe, .doc). This can be implemented by Regex expression.
2) Actually check the file header or definition type (ex: gif, word, image, etc, xls). Sometimes file extension is not sufficient.
3) Limit the file size. (Ex: 20mb)
4) Never accept the filename provided by the user. Always rename the file to some GUID according to your specifications. This way hacker wont be able to predict the actual name of the file which is stored on the server.
5) Store all the files out of web virtual directory. Preferably store in separate File Server.
6) Also implement the Captcha for File upload.
In general, if you really mean to allow any kind of file to be uploaded, I'd recommend:
A minimal type check using mime magic numbers that the extension of the file corresponds to the given one (though this doesn't solve much if you are not going to limit the kinds of files that can be uploaded).
Better yet, have an antivirus (free clamav for example) check the file after uploading.
On storage, I always prefer to use the filesystem for what it was created: storing files. I would not recommend storing files in the database (suposing a relational database). You can store the metadata of the file on the database and a pointer to the file on the file system.
Generate a unique id for the file and you can use a 2-level directory structure to store the data: E.g: Id=123456 => /path/to/store/12/34/123456.data
Said that, this can vary depending on what you want to store and how do you want to manage it. It's not the same to service a document repository, a image gallery or a simple "shared directory"

Resources