Form Recognizer Labeling - Traning model - azure

I am trying to use Azure Form Recognizer with Labeling tool to train and extract text out of images.
As per the documentation:
First, make sure all the training documents are of the same format. If you have forms in multiple formats, organize them into subfolders based on common format. When you train, you'll need to direct the API to a subfolder. (https://learn.microsoft.com/en-us/azure/cognitive-services/form-recognizer/quickstarts/label-tool#set-up-input-data)
In my case I have different formatted images. I can create different projects, label images, train them and get expected output.
Challenge in my case is, if I follow this approach I need to create different projects, train it separately and maintain several model ids.
So I just wanted to know is there any way where we can train different formats together as a single training model? Basically I want to know if we can use single model Id to extract key-value pair out of different formatted images?

This is a feature that has been asked for by a few customers. We are working on a solution for this, expecting it to arrive in a few months. For now, we suggest you to train models separately and maintain multiple model IDs.

If these are only a few different types (e.g., 2-4), and they are easily distinguishable, you can also try training them all together. For that to work, though, you'll need to label more files, and results are still likely not going to be as good as separate models.
For trying that, put approximately the same number of images for each type all in same folder, and label altogether.
If there are many different types, this is not likely to work.

Related

Building on existing models on spacy

This is a question regarding training models on SPACY3.x.
I couldn't find a good answer/solution on StackOverflow hence the query.
If I am using the existing model in spacy like the en model and want to add my own entities in the model and train it, let's say since I work in the biomedical domain, things like virus name, shape, length, temperature, temperature value, etc. I don't want to lose the entities tagged by Spacy like organization names, country, etc.
All suggestions are appreciated.
Thanks
There are a few ways to do that.
The best way is to train your own model separately and then combine both models in one pipeline, with one before the other. See the double NER example project for an overview of that.
It's also possible to update the pretrained NER model, see this example project. However this isn't usually a good idea, and definitely not if you're adding completely different entities. You'll run into what's called "catastrophic forgetting", where even though you're technically updating the model, it ends up forgetting everything not represented in your current training data.

Automatic html data extraction using deeplearning

We are dealing with web pages, the objective is to let web crawler to extract data items/fields from them and put the data in a database table, automatically, without manually configure every html page to achieve that. We have enough training samples, we are trying to use deeplearning, there are several ways we come up with:
end-to-end mapping from web page to structured data in database, I want to use question-answer or summation paradigm, but current papers on these subjects are using a paragraph of text as input, not html page. Is there a deep learning kind of model fit to html situation?
break down the problem (to that deep learning model can handle): deal with the <td></td> tags separately, classify each tag into items/fields of database using some cnn or rnn text-classification model. Problem is there are possibly many tags contain the same class of information(company name, time, etc.), we can't know which one we want. Maybe we can combine some "position" features of the html, still it's not clearly how to define these features and how to merge these into the classification model to get a somehow end-to-end framework.
Is there some better way?

Train multiple models with various measures and accumulate predictions

So I have been playing around with Azure ML lately, and I got one dataset where I have multiple values I want to predict. All of them uses different algorithms and when I try to train multiple models within one experiment; it says the “train model can only predict one value”, and there are not enough input ports on the train-model to take in multiple values even if I was to use the same algorithm for each measure. I tried launching the column selector and making rules, but I get the same error as mentioned. How do I predict multiple values and later put the predicted columns together for the web service output so I don’t have to have multiple API’s?
What you would want to do is to train each model and save them as already trained models.
So create a new experiment, train your models and save them by right clicking on each model and they will show up in the left nav bar in the Studio. Now you are able to drag your models into the canvas and have them score predictions where you eventually make them end up in the same output as I have done in my example through the “Add columns” module. I made this example for Ronaldo (Real Madrid CF player) on how he will perform in match after training day. You can see my demo on http://ronaldoinform.azurewebsites.net
For more detailed explanation on how to save the models and train multiple values; you can check out Raymond Langaeian (MSFT) answer in the comment section on this link:
https://azure.microsoft.com/en-us/documentation/articles/machine-learning-convert-training-experiment-to-scoring-experiment/
You have to train models for each variable that you going to predict. Then add all those predicted columns together and get as a single output for the web service.
The algorithms available in ML are only capable of predicting a single variable at a time based on the inputs it's getting.

How do I handle entities that have multiple data per entity in scikit-learn?

I have a SVM based classifier that classifies a chunk of data into some categories. Now, I want to classify some entities which each has multiple chunks of these data, into the same categories maybe using majority voting or something like that and then produce reports like precision/recall/confusion matrix etc.
Does scikit-learn offer ways to easily do that?
All scikit-learn models expect a flat features vector for each sample. So to deal with more structured input (or output) you will have to come up with your own wrapper. Based on your succinct description of your task it seems that a majority voting scheme might be a reasonable approach.

Need training data for categories like Sports, Entertainment, Health etc and all the sub categories

I am experimenting with Classification algorithms in ML and am looking for some corpus to train my model to distinguish among the different categories like sports,weather, technology, football,cricket etc,
I need some pointers on where i can find some dataset with these categories,
Another option for me, is to crawl wikipedia to get data for the 30+ categories, but i wanted some brainstorming and opinions, if there is a better way to do this.
Edit
Train the model using the bag of words approach for these categories
Test - classify new/unknown websites to these predefined categories depending on the content of the webpage.
The UCI machine learning repository contains a searchable archive of datasets for supervised learning.
You might get better answers if you provide more specific information about what inputs and outputs your ideal dataset would have.
Edit:
It looks like dmoz has a dump that you can download.
A dataset of newsgroup messages, classified by subject

Resources