How to merge text from closeby fields together in OCR - text

I wanted to understand how can text from OCR be grouped together into one when finding out the entities as performed by Google Cloud Vision in the following image.
Any papers, recommendations, suggestions on the same would be welcome.

Related

Neuroimage MRI scan CNN model preparation

I would like to know a couple of things to clear my confusion. I want to work on a medical neuroimage MRI image scans dataset from the ADNI database.
Each Alzheimer's Disease (AD) MRI image scan has multiple slices.
Do I have to separate each image scan slice and label each of them as AD or combine all image scan slices as a one-image scan and label it for classification?
Most of the medical neuroimage DICOM, NfINT, NII, etc., format. Is it mandatory to convert them to png or jpg for the CNN network model or keep it in NfNIT or nii format?
I have read several existing papers on neuroimaging regarding Alzheimer's disease but did not find the above question answer. Even I have sent an email to the research paper writer in reply; I got they can not help on this as they are very busy and mention their sincere apology for that.
It will be very helpful if anyone has the answer to clear my confusion and thought.
Thank you.
You can train with NIfTI, using, for example, TorchIO. There's no need to separate each slice, you can use the 3D image as is.
You can find some examples in the documentation.
Disclaimer: I'm the main developer of TorchIO.

What is the process to create an FAQ bot using Spacy?

I am beginner to Machine Learning and NLP, I have to create a bot based on FAQ dataset, Each FAQ dataset excel file contains 2 columns "Questions" and its "Answers".
Eg. A record from an excel file (A question & it's answer).
Question - What is RASA-NLU?
Answer - Rasa NLU is trained to identify intent and entities. Better the training, better the identification...
We have 3K+ excel files which has around 10K to 20K such records each excel.
To implement the bot, I would have followed exactly this FAQ bot approach which uses RASA-NLU, but the RASA,Chatterbot also Microsoft's QnA maker are not allowed in my organization.
And Spacy does the NER extraction perfectly for me, so I am looking for a bot creation using Spacy. but I don't know how to proceed further after extracting the entities. (IMHO, I will have to predict the exact question from dataset (and its answer from knowlwdge base) from user query to the bot)
I don't know what NLP algorithm/ ML process to be used or is there any easiest way to create that FAQ bot using extracted NERs.
One way to achieve your FAQ bot is to transform the problem into a classification problem. You have questions and the answers can be the "labels". I suppose that you always have multiple training questions which map to the same answer. You can encode each answer in order to get smaller labels (for instance, you can map the text of the answer to an id).
Then, you can use your training data (the questions) and your labels (the encoded answers) and feed a classifier. After the training your classifier can predict the label of unseen questions.
Of course, this is a supervised approach, so you will need to extract features from your training sentences (the questions). In this case, you can use as a feature the bag-of-word representations and even include the named entities.
An example of how to do text classification in spacy is available here: https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/

Detecting questions in text

I have a project where I need to analyze a text to extract some information if the user who post this text need help in something or not, I tried to use sentiment analysis but it didn't work as expected, my idea was to get the negative post and extract the main words in the post and suggest to him some articles about that subject, if there is another way that can help me please post it below and thanks.
for the dataset i useed, it was a dataset for sentiment analyze, but now I found that it's not working and I need a dataset use for this subject.
Please use the NLP methods before processing the sentiment analysis. Use the TFIDF, Word2Vector to create vectors on the given dataset. And them try the sentiment analysis. You may also need glove vector for the conducting analysis.
For this topic I found that this field in machine learning is called "Natural Language Questions" it's a field where machine learning models trained to detect questions in text and suggesting answer for them based on data set you are working with, check this article for more detail.

How to train and test data for classification using Machine learning algorithms

I have collected tweets from Twitter API. The tweets are not labelled and I have no clue how to start with? All the tutorials have already labelled data. How to label data? Can labelling be done manually only? Any good tutorial answering my queries will be of great help.
I assume that when you extract the data from Twitter API, it's in the JSON format. Use the key, value pair as your dataframe heading and values.Now for the label part,it depends on what are you going with the dataset. If you want to do sentiment analysis then you need to manually mark the dataset(or just download pre-labeled twitter dataset from internet).
For reference here is a great tutorial on how to mine and deal with the raw data, getting insight and applying clustering algorithms. Hope it helps !

Sentence extraction from documents using NLP or Deep Learning

I am looking for references(Papers)/suggestions on how to use deep learning in a text extraction task.
Recently I was given a task to extract important information from documents of similar type, say for example legal merger documents. I have thousands of legal merger documents as inputs. A paralegal would go through the entire document and highlight important points from the document. This is the extracted text.
What I want to do: Given a document(say legal merger document) I want to use DL or NLP to extract the information from the legal document that would be similar to that of the information extracted by paralegal.
I am currently using bag of words model to extract text from the document, calculating sentiment and displaying the sentences with positive or negative sentiments. This yielded very bad results.
Can anyone please provide me with some references and suggestions on how to tackle this issue?

Resources