I want to parse a .AIB file which I have got in the form of output from Azure media indexer job So that I can read the data in the file and can take out the required information.
The AIB file format is a binary blob designed for use with the SQL Server Add-in. Have you read through the blog post here? https://azure.microsoft.com/en-us/blog/using-aib-files-with-azure-media-indexer-and-sql-server/
If you have any more questions after reading that blog post, or you already have, I can help you further.
Would you mind sharing your scenario that requires the AIB file? Almost all of our customers have found that the plain-text outputs are sufficient for their use cases.
Adarsh
Related
I am using Apache NiFi to ingest data from Azure Storage. Now, the file I want to a huge file (100+ GB) read can have any extension and I want to read the file's header to get its actual extension.
I found python-magic package which uses libmagic to read the file's header to fetch the extension, but this requires the file to be present locally.
The NiFi pipeline to ingest the data looks like this
I need a way to get the file extension in this NiFi pipeline. Is there a way to read the file's header from the Content Repo? If yes, how do we do it? FlowFile has only the metadata which says the content-type as text/plain for a CSV.
There is no such thing as a generic 'header' that all files have that gives you it's "real" extension. A file is just a collection of bits, and we sometimes choose to give extensions/headers/footers/etc so that we know how to interpret those bits.
We tend to add that 'type' information in two ways, via a file extension e.g. .mp4 and/or via some metadata that accompanies the file - this is sometimes a header, which is sometimes plaintext and easily readible, but this is not always true. Additioanlly, it is up to the user and/or the application to set this information, and up the user and/or application to read it - neither of which are a given.
If you do not trust that the file has the proper extension applied (e.g. video.txt when it's actually an mp4) then you could also try to interrogate the metadata that is held in Azure Blob Storage (ContentType) and see what that says - however, this is also up to the user/application to set when the file is uploaded to ABS, so there is no guarantee that it is any more accurate than the file extension.
text/plain is not invalid for a plaintext CSV, as CSVs are just formatted plaintext - similar to JSON. However, you can be more specific and use e.g. text/csv for CSV and application/json for JSON.
NiFi does have IndentifyMimeType which can try to work it out for you by interrogating the file, but it is more complex that just accessing some 'header'. This processor uses Apache Tika for the detection, and adds a mime.type attribute to the FlowFile.
If your file is some kind of custom format, then this processor likely won't help you. If you know your files have a specific header, then you'll need to provide more information for your exact situation.
I am working for a customer in the medical business (so excuse the many redactions in the screenshots). I am pretty new here so excuse any mistakes I might make please.
We are trying to fill a SQL database table with data coming from 2 different sources (CSV files). Both are delivered on a BLOB storage where we have read access.
The first flow I build to do this with azure data factory works perfectly so I just thought to clone that flow and point it to the second source. However the CSV files from the second source are TAB delimited and UTF-16le encoded. Luckily you can set these parameters when you create a dataset:
Dataset Settings
When I verify the dataset by using the "Preview Data" option, I see a nice list with data coming from the CSV file:Output from preview data So it appears to work fine !
Now I create a new dataflow and in the source I use the newly created Data source. All settings I left at default. data flow settings
Now when I open Data Preview and click refresh I get garbage and NULL outputs instead of the nice data I received when testing the data source. output from source block in dataflow In my first dataflow i created this does produce the expected data from the csv file but somehow the data is now scrambled ?
Could someone please help me with what I am missing or doing wrong here ?
Tried to repro and here you could see if you have the Dataset settings,
Encoding as UTF-8 instead of UTF-16 then you will ne able to preview the data.
Data Preview inside the Dataflow:
And if even I try to have the UTF-16LE enabled for the encoding having such issues:
Hence, for now you could change the Encoding and use the pipeline.
Need help to setup the Reference data in stream analytics. I want to add setting(default) data of my application into stream analytics. I can add the reference data and by doing upload sample file I can upload JSON or CSV file. However while firing a join query it gives 0 rows as all reference data haven't stored (So null if left outer join).
I investigate the issue and I think it is due to Path Pattern, but I do not have much idea about it.
Based on your description, as you said, you had been sure that the issue was caused by Path Pattern/Path Prefix Pattern, but I could not give some helpful suggestion for you without any details, such as the screenshot of your Path Pattern setting.
So just list some resources as references for you, hope these help for resolving your issue.
Two screenshots about Path Prefix Pattern/Path Pattern which be introduced from Link 1 & 2.
A sample Use Stream Analytics to process exported data from Application Insights introduce how to read stream data from Blob Storage at its section Create an Azure Stream Analytics instance, which step as similar as for Reference data.
Hope it helps.
The issue was due to not properly formatted JSON file.
I'm wondering if there's an easy way to download a large number of files of one arbitrary type, e.g., downloading 10,000 XML files. In the past, I've used Bing's API. It's free and offers unlimited queries. However, it doesn't index as many types of files as Google does. Google indexes XML files, CSV files, and KML files. (These can all be found by doing searches like "filetype:XML".) As far as I know, Bing doesn't index these in a way that's easily searchable. Is there another API that has these capabilities?
How about using wget? You can give wget a URL (for example, a google search result) and tell it to follow all the links on that page and download them (I bet you could also give it a filter).
Just tried it and got an ERROR 403: Forbidden. Apparently Google blocks requests from Wget. You'll have to provide a different user agent. Quick search provided this example:
http://www.mail-archive.com/wget#sunsite.dk/msg06564.html
Then it worked with the example given.
I have a table with documents saved some of them in pdf, some of them image.
I want to create a web app, to show the images (that can be either pdf, either jpg) in the same control.
I can manage to see pdf, if I set the Response.ContentType = "application/pdf" or image if I set "application/jpg". But the problem is that how can I get the file type, having only the stream saved into the database? Does it have the stream the file type information in it?
Thanks.
No, a stream does not have a content type associated with it. If you had the original filename, you could attempt to derive the content type from that, but it wouldn't be foolproof.
Many file formats have a series of "magic bytes" that allow you to detect what (might) be in the file. PDF, for example, begins with the bytes "%PDF" (note: I'm not an expert on PDF, and there may be situations where that is not true).
If you have no other option, you could attempt to parse the file using various libraries until you found one that worked (System.Drawing.Image.FromStream(), iTextSharp, etc).