Is there an Azure platform service that can convert text from pdf files and save those unstructured data in database? - azure

Our organization is migrating our routine work onto Azure Cloud platform. One of my works is using Python to read many pdf files and convert all the text/unstructured data into tables, e.g.
first column shows the file name and second column saves all the text data etc.
Just wondering is there a service in Azure platform that can achieve this automatically? I am new user to Azure, so not quite familiar with this. Thanks heaps if any help.

I would recommend looking at Azure Form Recognizer. You can train it to recognize tables and extract data from PDF files.

Related

Azure Scanned PDF to Searchable PDF

I am looking for a way using Azure services to create a text searchable PDF from an image based (scanned) PDF. I have looked at the Computer Vision Read and OCR APIs from Cognitive Services but they both return JSON with the bounding boxes of the text and what I want is a PDF that is image with hidden text.
I am specifically looking for an Azure service to do this for me. I know there are other services to do this (AWS Textract, Abbyy, etc.) and I know I could write code to do it, but neither of those options are what my client is looking for.
Thank you!

Office365 Excel as source for GCP BigQuery

We are using Office365 Excel and manually creating some data that we need in BigQuery. What solution would you create to automatically load the data from this excel to a table in bq? We are not allowed to use Google Sheets (which would solve all our problems).
We use Matillion and GCP products.
I have no idea how to solve this, and I don't find any information about this, so any suggestion or idea is appreciated.
Cheers,
Cris
You can save your data as csv and then load them to BigQuery.
Then, you can load data by using one of the following:
The Google Cloud Console
The bq command-line tool's bq load command
The API
The client libraries
Here you can find more details Loading data from local files
As a different approach, you can also try this other option:
BigQuery: loading excel file
For this you will need to use Google Drive and federated tables.
Basically you will upload your Excel files to Google Drive with the option "Convert uploaded files to Google Docs editor format" checked in your settings, and upload to BigQuery from Google Drive.

Upload Excel 2013 Workbook to website hosted on Azure

Does anyone have guidance and/or example code (which would be awesome) on how I would go about the following?
With a Web application using C# / ASP.NET MVC and hosted on Azure:
Allow a user to upload an Excel Workbook (multiple worksheets) via a web page UI
Populate a Dataset by reading in the worksheets so I can then process the data
Couple of things I'm unclear on:
I've read that Azure doesn't have ACEOLEDB, which is what Excel 2007+ requires, and I'd have to use OPEN XML SDK. Is this true? Is this the only way?
Is it possible to read the file into memory and not actually save it to Azure storage?
I DO NOT need to modify the uploaded spreadsheet. Only read the data in and then throw the spreadsheet away.
Well that's many questions in one post, let me see if we can tackle them one by one
With a Web application using C# / ASP.NET MVC and hosted on Azure:
1.Allow a user to upload an Excel Workbook (multiple worksheets) via a web page UI
2.Populate a Dataset by reading in the worksheets so I can then process the data
Couple of things I'm unclear on:
1.I've read that Azure doesn't have ACEOLEDB, which is what Excel 2007+ requires, and I'd have to use OPEN XML SDK. Is this true? Is
this the only way?
2.Is it possible to read the file into memory and not actually save it to Azure storage?
1/2. You can allow a user to upload the excel workbook to some /temp location and once you have read you can choose to do the cleanup, you can also write a script which can do the cleanup of the files which couldn't get deleted from /temp for whatever reasons.
Alternatively if you want to keep the files, you should store them in Azure Stoarge, and fetch/read when you need to.
check out this thread read excelsheet in azure uploaded as a blob
By default when you upload a file it is wrote into local disk and one later chooses to save the files to azure storage or whatever places.
Reading the excel - you can use any of the nugget packages given here http://nugetmusthaves.com/Tag/Excel and read the excel file, I prefer Gembox and NPOI
http://www.aspdotnet-suresh.com/2014/12/how-to-upload-files-in-asp-net-mvc-razor.html

How to add table manually on windows azure mobile services DATA>>

I want to upload table [Excel file] on windows azure Mobile services without coding. Can server side scripting is use for this? Any other option to upload it on Azure Mobile Service Data?
No, there's no automatic way to do that. You will need to read your table from the excel file and upload the rows to the server. That should be fairly easy to implement - save your file as a comma-separated value list (or tab-separated value list, which should make parsing easier). In a program which uses the mobile service SDK you'd read the lines from the CSV (or TSV) file, convert it to the appropriate structure (either directly to JSON via the JObject type or a typed class) and call the InsertAsync method in the client to insert the data to the server.

Lucene .NET Azure Blob storage and IFilter

What would be the best way to use IFilter to extract textual content from pdf/word/whatever in an Azure solution?
I've seen examples of IFilter that use a stream, but what should the content of the stream be?
Should it contain some sort of OLE headers and what not?
Sending the raw file content as a stream to IFilter doesnt seem to work.
Or would it be better to save the files to local file storage and let the IFilter read them from that location?
using ifilter in azure will be tricky because several of the ifilters that are common on a desktop aren't available in an azure web/worker role.
You could create a durable VM in azure and install the missing ifilters.
However, if you're going to build your lucene index via a webupload you could just process the files into text as they are uploaded, and then index the text, and save the file off separately. Add a field to your index that lets you get back to the original source document.
Might be an easier way, but that's how I solved the same issue.

Resources