Any interesting OCR/NLP related projects for CS final year project? - nlp

I am a final year CS student, and very interested about OCR and NLP stuffs.
The problem is I don't know anything about OCR yet and my project duration is only for 5 months. I would like to know OCR & NLP stuff that is viable for my project?
Is writing a (simple) OCR engine for a single language too hard for my project? What about adding a language support for existing FOSS OCR softwares?

My background is in the commercial side of OCR and in my experience writing anything but a simple OCR engine would take a fair amout of time. To get even reasonable results your input files would have to contain very clean text characters for the purposes of OCR or you would need lots of marked up training data to train the engine. This would limit your input data available using OCR to high quality printed documents and computer generated documents such as exporting a Word document to a TIFF image. Commercial OCR engines do a much better job reading standard scanned invoices and letters than even Tesseract OCR and they still make mistakes.
You could write a simple OCR engine and use NLP and language analysis to show how it can improve the OCR results. Most of the OCR engines are doing this anyway but it could be an interesting project. The commercial engines have had years of fine tuning to improve their recognition accuracy and they use every trick they can think of.
This article may give you some ideas on one way how to write an OCR engine:
http://www.codeproject.com/KB/dotnet/simple_ocr.aspx
You may be able to contribute to the Tesseract project but you would first need to research what has already been included and what is not and if anyone else is working on the same problem.

Related

Transcript transformation for sentiment analysis

I'm doing sentiment analysis on users' transcripts for UX website testing. I get the transcript from the testing session and then I analyze the transcript for sentiment analysis - what's the user's opinion about the website, what problems did the user encounter, whether he had any problems, got stuck, lost. Since this is quite domain-specific, I'm testing both TextBlob and Vader and see which gives better results. My issue is at the beginning of the process - the speech-to-text API's transcript isn't perfect. Sentences (periods) are not captured or are minimal. I'm not sure on what level the analysis should be since I was hoping I could do it on sentence-level. I tried making n-grams and analyzing those short chunks of text, but it isn't ideal and the results are slightly difficult to read - because there will be some parts that are repeated. Apart from this, I do classical text cleaning, tokenization, pos tagging, lemmatization and feed it to TextBlob and Vader.
Transcript example: okay so if I go just back over here it has all the information I need it seems like which is great so I'm pretty impressed with it similar to how a lot of government websites are set up over here it looks like I have found all the information I need it's a great website it has everything overall though it had more than enough information...
I did:
ngram_object = TextBlob(lines)
ngrams = ngram_object.ngrams(n=4)
which gives me something like (actually a WordList): [okay so if I, so if I go, if I go just...]
Then the results look like:
62 little bit small -0.21875 Negative
61 like little bit -0.18750 Negative
0 information hard find not see -0.291666667 Negative
1 hard find not see information -0.291666667 Negative
Is there a better way to analyze unstructured text in chunks rather than a full transcript?
This makes it difficult to capture what was the issue with the website. Changing the API isn't really an option since I'm working with something that was given to me to use as data collection for this particular sentiment analysis problem.
Any tips or suggestions would be highly appreciated, couldn't find anyone doing something similar to this.
I am not sure about what you really want but maybe you could take a look on speech sentiment analysis? I have read about RAVDESS, a database useful for sentiment classification. Take a look: https://smartlaboratory.org/ravdess/

What is the difference between OCR and Recognize Text in Azure Pricing Page

As I found in https://azure.microsoft.com/en-us/pricing/details/cognitive-services/computer-vision/, OCR has different price against Recognize Text. It is quite confusing. What is the difference? I can't find any clue thru the documents.
The difference is described here in the docs: https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text#ocr-optical-character-recognition-api
In a few words:
OCR is synchronous, uses an earlier recognition model but works with more languages
Recognize Text (and Read API, its successor) uses updated recognition models, but is asynchronous.
If you want to process handwritten text for example, you should use the 2nd one

Tutorial tensorflow audio pitch analysis

I'm a beginner with tensorflow and Python and I'm trying to build an app that automatically detects, in a football (soccer) match some key moments (yellow/red cards, goals, etc).
I'm starting to understand how to do a video analysis training the program on a dataset built by me, downloading images from the web and tagging them. In order to obtain some better results for the analysis, I was wondering if someone had some suggestions on tutorials to follow in order to understand how to train my app also on audio files, to make the program able to understand when there is a pitch variation in the audio of the video and combine both video and audio analysis in order to get better results.
Thank you in advance
Since you are new to Python and to tensorflow, I recommend you focus on just audio for now, especially since its a strong indicator of events of importance in a football match (red/yellow cards, nasty fouls, goals, strong chances, good plays, etc).
Very simply, without using much ML at all, you can use the average volume of a time period to infer significance. If you want to get a little more sophisticated, you can consider speech-to-text libraries to look for keywords in commentator speech.
Using video to try to determine when something important is happening is much, much more challenging.
This page can help you get started with audio signal processing in Python.
https://bastibe.de/2012-11-02-real-time-signal-processing-in-python.html

Search in a book with speech

I am trying to build a program that will find which page/sentence in a book is read to microphone. I have the book's text and its audio content. The user will start reading from a random page and program is supposed to synch to the user and show the section of the book which is being read. It might seem useless program but please bear with me..
Would an approach similar to shazam-like programs work? I am not sure how effective those algorithms for speech. Also, the speaker will be different and might have accent and different speeds to read.
Another approach would be converting the speech to text and searching the text in the book. The problem is that the language of the book is a rare one for which there is no language model available. In addition, the script does not use latin characters which makes programming difficult (for me at least).
Is there any solutions that anyone can recommend? Would extracting features from the audio file and comparing with the "real-time" extracted features (from microphone) would work? Which features?
Any implementation/code that I can start with? Any language is ok but prefer C.
You need to use speech recognizer.
Create a language model directly from the book text. That will make the recognition of the book reading very accurate, both original reading and the reading by the user.
Use this language model to recognize the book and assign timestamps for the words or use more advanced algorithm to perform text to audio alignment.
Recognize user's speech with the book-specific language model and use the recognized text to display a position in a book.
You can use CMUSphinx for the mentioned tasks.

read text document from scanned image

Is there any way we can get the text from a scanned document in jpg jpeg or any other format ? I am using ruby as my programming language . But I guess if I can get the texts with some help from other programming languages , it will not be much of a problem to integrate.
Thanks.
Yes, you can use an OCR library. There are additional details at https://stackoverflow.com/questions/1085/free-ocr-library.
In brief, you may wish to consider using tessnet (http://www.pixel-technology.com/freeware/tessnet2/).
This technology is called optical character recognition (OCR).
For programming, check out this question, which recommends tesseract-ocr.
OCR for ruby? check out this question.
If it's just a couple images, here's a site that supposedly does it for free.
OCR Terminal http://www.ocrterminal.com has been the best (most accurate) free tool out of at least a dozen that I have used. It works especially well with formatted (table) data.

Resources