Improving Identified OCR text accuracy? - text

Lets say I have a text which is returned from an OCR saying "Hllo Wrld". I need to convert this output in to a user friendly one, saying "Hello World". How can I get this job done?
I am a final year Software Engineering Student from Sri Lanka. I am going to implement an OCR to convert Braille text to Sinhala text for my Final Year Project. Sinhala is a low resource language which is used in Sri Lanka. I went through previous research papers and found that researchers had developed a system that could translate braille into Sinhala text. But that final text is not user friendly to the end user(As mentioned in the above example). What I am going to do is to convert that identified text in to a meaningful one. I think I should go with NLP technologies to get this job done. I'm truly thankful If you could guide me in the right direction.
Thank You

Related

How to extract brand name in text?

I'm a new worker in nature language process, it's such a difficult for me, I have no idea now. I'm very grateful if someone can give me some advice. The text is from socail data such as weibo, baidu, twitter and so on. Brand is from all the industry. (Please ignore my pool English.)
I have searched open source brand dataset, most of them are divided by industry. I can use NER or segment based on dictionary. For NER, data is not big enough and laborious to index; for segment, the whole brand data is so big that may slow response. I really need some guide from someone who have related experience.

NLP algorithm to extract part of sentence in language translation

I am trying to solve a problem but am not able to find a way other than training the data sets and making a classifier.
Problem:
The user says to translate a particular sentence from one language to another. I have the user speech in text part, and need to extract these 3 things from the text:
Sentence to be translated.
The language in which its supposed to be translated.
The origin language.
So, when we humans say, its usually in the form of these examples:
What is I love you in French from English?
Can you translate I love you from English to French?
What is French for I love you in English?
And any other possible way that a person can ask for translation.
I need to extract I love you, French (the language translated into) and English (the language translated from) from the sentence.
The first thing that came to my mind was to use Regular Expessions. But I found that it can only be used to detect the language and not the sentence part to be translated.
The other possible solution seems to have the various form of sentence as training data set and train a classifier, but I still feel that this NLP problem can be solved using some algorithm but am not able to get anything.
This seems to be a popular problem, so is there any way it can be done?

Pattern Recognition OR Named Entity Recognition for Information Extraction in NLP

There are some event description texts.
I want to extract the entrance fee of the events.
Sometimes the entrance fee is conditional.
What I want to achieve is to extract the entrance fee and it's conditions(if available). It's fine to retrieve the whole phrase or sentence which tells the entrance fee + it's conditions.
Note I: The texts are in German language.
Note II: Often the sentences are not complete, as they are mainly event flyers or advertisements.
What would be the category of this problem in NLP? Is it Named Entity Recognition and could be solved by training an own model with Apache openNLP?
Or I thought maybe easier would be to detect the pattern via the usual keywords in the use-case(entrance, $, but, only till, [number]am/pm, ...).
Please shed some light on me.
Input Examples:
- "If you enter the club before 10pm, the entrance is for free. Afterwards it is 6$."
- "Join our party tonight at 11pm till 5am. The entrance fee is 8$. But for girls and students it's half price."
This is broadly a structure learning problem. You might have to combine Named-Entity-Recognition/Tagging with Coreference Resolution. Read some papers on these as well as related github code and take it from there. Here is good discussion of state of the art tools for these at the moment https://www.reddit.com/r/MachineLearning/comments/3dz3fl/dl_architectures_for_entity_recognition_and_other/
Hope that helps.
You might try Stanford's CoreNLP for the named entity extraction part. It should be able to help you pick out the money values, and there is also a link to models trained for German language as well (https://nlp.stanford.edu/software/CRF-NER.shtml).
Given that it's fine to extract the entire sentence that contains the information, I'd suggest taking a binary sentence classification approach. You could probably get quite far just by using ngrams and some named entity information as features. That would mean that you'd need you'd want to build a pipeline that would automatically segment your documents into sentence-like chunks. You could try a sentence segmentation tool (also provided by Stanford CoreNLP) as a first go https://stanfordnlp.github.io/CoreNLP/. Since this would form the basis for all further work, you'd want to ensure that the results are at least decent. Perhaps the structure of the document itself gives you enough information to segment it without even using a sentence segmentation tool.
After you have this pipeline in place, you'd want to annotate the sentences extracted from a large set of documents as relevant or non-relevant to make it a binary classification task. Then train a model based on that dataset. Finally, when you apply it to unseen data, first use the sentence segmentation approach, and then classify each sentence.

Streets recognition, deduction of severity

I'm trying to make an analysis of a set of phrases, and I don't know exactly how "natural language processing" can help me, or if someone can share his knowledge with me.
The objective is to extract streets and localizations. Often this kind of information is not presented to the reader in a structured way, and It's hard to find a way of parsing it. I have two main objectives.
First the extraction of the streets itself. As far as I know NLP libraries can help me to tokenize a phrase and perform an analysis which will get nouns (for example). But where a street begins and where does it ends?. I assume that I will need to compare that analysis with a streets database, but I don't know wich is the optimal method.
Also, I would like to deduct the level of severity , for example, in car accidents. I'm assuming that the only way is to stablish some heuristic by the present words in the phrase (for example, if deceased word appears + 100). Am I correct?
Thanks a lot as always! :)
The first part of what you want to do ("First the extraction of the streets itself. [...] But where a street begins and where does it end?") is a subfield of NLP called Named Entity Recognition. There are many libraries available which can do this. I like NLTK for Python myself. Depending on your choice I assume that a streetname database would be useful for training the recognizer, but you might be able to get reasonable results with the default corpus. Read the documentation for your NLP library for that.
The second part, recognizing accident severity, can be treated as an independent problem at first. You could take the raw words or their part of speech tags as features, and train a classifier on it (SVM, HMM, KNN, your choice). You would need a fairly large, correctly labelled training set for that; from your description I'm not certain you have that?
"I'm assuming that the only way is to stablish some heuristic by the present words in the phrase " is very vague, and could mean a lot of things. Based on the next sentence it kind of sounds like you think scanning for a predefined list of keywords is the only way to go. In that case, no, see the paragraph above.
Once you have both parts working, you can combine them and count the number of accidents and their severity per street. Using some geocoding library you could even generalize to neighborhoods or cities. Another challenge is the detection of synonyms ("Smith Str" vs "John Smith Street") and homonyms ("Smith Street" in London vs "Smith Street" in Leeds).

How can I analyze pieces of text for positive or negative words?

I'm looking for some sort of module (preferably for python) that would allow me to give that module a string about 200 characters long. The module should then return how many positive or negative words that string had. (e.g. love, like, enjoy vs. hate, dislike, bad)
I'd really like to avoid having to reinvent the wheel in natural language processing, so if there is anything you guys know of that would allow me to do what I described above, it'd be a huge time-saver if you could share.
Thanks for the help!
I think you're looking for sentiment analysis. Here's a Twitter sentiment app.
Here's a question about sentiment analysis using Python.
Before you analyse pieces of text you need to preprocess given text by striping punctuation, repair language, split spaces,lower the whole text and store the words in an iterable data structure.
For some basic sentiment analysis, following techniques can be used:
Bag of words
In bag of words technique we basically go through a bag(file) of words and check if the iterable made by us contains these. If it does then we assign some value to each word's presence in order to weigh the total sentiment of the text.
This link should help you understand more about this
https://en.wikipedia.org/wiki/Bag-of-words_model
Keyword Extraction and Tagging
Keywords and important information can be extracted from the input text by tagging the elements and then removing unwanted data.
For example:
My name is John.
Here John, name are the information and "is" isn't really needed.
Similarly verbs and other unimportant things can be removed in order to retain only the main information.
Chunking and Chinking helps.
This link must be of help.
http://nltk.org/book/ch07.html
You can tokenize your text and get the sentiment using existing sentiment analysis tools. The most comprehensive sentiment analysis tool that I know is SentiBench. This is basically a survey study of all sentiment analysis tools. As well as the code and examples on how to use the code.

Resources