Text/Image processing in Python - python-3.x

Brief introduction:
I'm trying to get certain texts from an image of a lot of texts.
By just thinking, there should be at least two ways to handle this problem:
One way is first segmenting the images by text areas — for example, train the neural network with a bunch of sample images that contain the sample texts, and then let the trained model locate corresponding text areas in the real image, then crop that area out from the image, save it — and secondly use, for instance, pytesseract to convert image to string.
The other way is to reverse the processes. First convert the image into strings, then train the neural network with sample real texts, then let the trained model find corresponding texts in texts converted from images.
So, my questions are listed below:
Can this problem be solved without training a neural network? Will it be more efficient than NN, in terms of time taken to run the program and accuracy of results?
Among the two methods above I wrote, which one is better, in terms of time taken to run the program and accuracy of results?
Any other experienced suggestions?
Additional background information if needed:
So, I have a number of groups of screenshots of different web pages, each of which has a lot of texts on it. And I want to extract certain paragraphs from that large volume of texts. The paragraphs I want to extract express similar things but under different contexts.
For example, on a large mixed online forum platform, many comments are made on different things, some on landscapes of mountains, some politics, some sciences, etc... As that platform cannot only have one page, there must be hundreds of pages where countless of users make their comments. Now I want to extract the comments on politics specifically from the entire forum, i.e. from all the pages that platform has. So I would use Python + Selenium to scrape the pages and save the screenshots. Now we need to go back to the questions asked above. What to do now?
Update:
Just a thought went by. Probably a NN trained by images that contain texts cannot give a very accurate location of wanted texts, as the NN might be only looking for arrangements of pixels instead of the words, or even meaning, that compose the sentences or paragraphs. So maybe the second method, text processing, may be better in this case? (like NLP?)

So, you decided not to parse text, but save it as an image and then detect text from this image.
Text -> Image -> Text
It is worst scenario for parsing webpages.
While dealing with OCR you must expect many problems, such as:
High CPU consumption;
Different fonts;
Hidden elements (like 'See full text');
And the main one - you can't OCR with 100% accuracy.
If you try to create common parser, that should crawl only required text from any page given without any "garbage" - it is almost utopic idea.
As far as i know, something about this is 'HTML Readability' technology (Browsers like Safari and Firefox uses it). But how it will work with forums i can't say. Forums is very special format of pages.

Related

How to parse screenshots in Power Automate

I have a bit of a complicated process in Power Automate. I'm trying to parse user uploaded screenshots and categorizing them into different variables. At first, it seemed that an obvious choice would be to build and train the AI Model but the only issue is that the data in the screenshots can vary (i.e. some images will contain more rows, some won't contain the relevant data, and the data can be located in different regions of the screenshot).
Some example of images, which a user can upload, are as follows: (i) Samsung 1 Metrics, (ii) Samsung 2 Metrics (iii) iPhone metrics
My attempt was to perform OCR on the uploaded screenshot and then do string parsing. Therefore, I tried attempting the following flow: Flow Diagram and specifically the substring parsing as:
Substring parsing
Basically, I'm performing OCR on the screenshot and then searching for a substring which corresponds to the values that I'm interested in. I'm unsure if this is the best way to do this as it isn't dynamic (i.e. I have to offset the substring index by a certain amount of characters). Any advice is greatly appreciated.
I believe you should be able to train a custom Form Processing model to extract the information you need. You can two different collections in your training dataset to have the model be able to recognize both Samsung and iPhone layouts.
All you'll need is 5 samples for each collection and you should be good to go.

How to get three dimensional vector embedding for a list of words

I have been asked to create three dimensional vector embeddings for a series of words. Although I understand what an embedding is and that word2vec will be able to create the vector embeddings, I cannot find a resource that shows me how to create a three dimensional vector (all the resources show many more dimensions than this).
The format I have to create the file in is:
house 34444 0.3232 0.123213 1.231231
dog 14444 0.76762 0.76767 1.45454
which is in the format <token>\t<word_count>\t<vector_embedding_separated_by_spaces>
Can anyone point me towards a resource that will show me how to create the desired file format given some training text?
Once you've decided on a programming language, and word2vec library, its documentation will likely highlight a configurable parameter that lets you specify the dimensionality of the vectors it trains. So, you just need to change that parameter from its typical values , like 100 or 300, to 3.
(Note, though, that 3-dimensional word-vectors are unlikely to show the interesting & useful property of higher-dimensional vectors.)
Once you've used such a library to create the vectors-in-memory, writing them out in your specified format becomes just a file-IO problem, unrelated to word2vec itself. In typical languages, you'd open a new file for writing, loop over your data printing each line properly, then close the file.
(To get a more detailed answer from StackOverflow, you'd want to pick a specific language/library, show what you've already tried with actual code, and show how the results/errors achieved fall short of your goal.)

Text classification

I have a trivial understanding of NLP so please keep things basic.
I would like to run some PDFs at work through a keyword extractor/classifier and build a taxonomy - in the hope of delivering some business intelligence.
For example, given a few thousand PDFs to mine I would like to determine the markets they apply to (we serve about 5 major industries with each one having several minor industries. Each industry and sub-industry has a specific market and in most cases those deal with OEMs, which in turn deal models, which further sub divide into component parts, etc.
I would love to crunch these PDFs into a semi-structured (more a graph actually) output like:
Aerospace
Manufacturing
Repair
PT Support
M250
C20
C18
Distribution
Can text classifiers do that? Is this too specific? How do you train a system like this that C18 is a "model" of "manufacturer" Rolls Royce of the M250 series and "PT SUPPORT" is a sub-component?
I could build this data manually but would take forever...
Is there a way I could use a text classifier framework and build something more efficiently than regex and python?
Just looking for ideas at this point... Watched a few tutorials on R and python libs but they didn't sound quite like what I am looking for.
Ok lets break your problem into small sub-problems first , i will break the task as
Read PDF and extract data and meta data from them - take a look at Apache Tikka lib
Any classifier to be more effective need training data - Create training data for text classifier
Then apply any suitable classifier algo .
You can also have look at Carrot2 clustering algo , it will automatically analyse the data and group pdf into different categories.

image processing / computer vision - body part recognition - posture ( standing/ sitting) - supervised learning

I'm after advice from the image processing / computer vision experts here. Trying to develop a robust, scaled algorithm to extract dimensions of a person's body. For example, his upper-body width.
problems:
images without faces
person sitting
multiple faces
person is holding something , thus covering part of his body
ways of doing this:
* haar - unsupervised , a lot of training date of different body parts and hope for the best.
* HOG - 1. face detection -> afterwards using HOG and assumptions along the way with different filters
Note: all images will be scaled to the same size.
Obviously computation time for the second approach MIGHT be more demanding (doubtful though)
but for the 1st method, training is almost impossible and would take much more time..
P.S.
I know there's a paper about using pedestrian data.. but that would work for full body + standing, not for sitting.
I'm open to hearing all your ideas..ask away if you have anything to add.
Implementation would be done, hopefully via node.js
Thank you
DPM is widely used in computer vision for object detection and it tends to work in the case of occlusion and also when only part of an object is present in the image. The grammar model for humans is very good and has state of the art results on standard datasets. It takes around a second to perform detection on a single image, its matlab code, so its expected to be slow.
http://www.cs.berkeley.edu/~rbg/latent/

I need a function that describes a set of sequences of zeros and ones?

I have multiple sets with a variable number of sequences. Each sequence is made of 64 numbers that are either 0 or 1 like so:
Set A
sequence 1: 0,0,0,0,0,0,1,1,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0
sequence 2:
0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
sequence 3:
0,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0
...
Set B
sequence1:
0,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1
sequence2:
0,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,0
...
I would like to find a mathematical function that describes all possible sequences in the set, maybe even predict more and that does not contain the sequences in the other sets.
I need this because I am trying to recognize different gestures in a mobile app based on the cells in a grid that have been touched (1 touch/ 0 no touch). The sets represent each gesture and the sequences a limited sample of variations in each gesture.
Ideally the function describing the sequences in a set would allow me to test user touches against it to determine which set/gesture is part of.
I searched for a solution, either using Excel or Mathematica, but being very ignorant about both and mathematics in general I am looking for the direction of an expert.
Suggestions for basic documentation on the subject is also welcome.
It looks as if you are trying to treat what is essentially 2D data in 1D. For example, let s1 represent the first sequence in set A in your question. Then the command
ArrayPlot[Partition[s1, 8]]
produces this picture:
The other sequences in the same set produce similar plots. One of the sequences from the second set produces, in response to the same operations, the picture:
I don't know what sort of mathematical function you would like to define to describe these pictures, but I'm not sure that you need to if your objective is to recognise user gestures.
You could do something much simpler, such as calculate the 'average' picture for each of your gestures. One way to do this would be to calculate the average value for each of the 64 pixels in each of the pictures. Perhaps there are 6 sequences in your set A describing gesture A. Sum the sequences element-by-element. You will now have a sequence with values ranging from 0 to 6. Divide each element by 6. Now each element represents a sort of probability that a new gesture, one you are trying to recognise, will touch that pixel.
Repeat this for all the sets of sequences representing your set of gestures.
To recognise a user gesture, simply compute the difference between the sequence representing the gesture and each of the sequences representing the 'average' gestures. The smallest (absolute) difference will direct you to the gesture the user made.
I don't expect that this will be entirely foolproof, it may well result in some user gestures being ambiguous or not recognisable, and you may want to try something more sophisticated. But I think this approach is simple and probably adequate to get you started.
In Mathematica the following expression will enumerate all the possible combinations of {0,1} of length 64.
Tuples[{1, 0}, {64}]
But there are 2^62 or 18446744073709551616 of them, so I'm not sure what use that will be to you.
Maybe you just wanted the unique sequences contained in each set, in that case all you need is the Mathematica Union[] function applied to the set. If you have a the sets grouped together in a list in Mathematica, say mySets, then you can apply the Union operator to every set in the list my using the map operator.
Union/#mySets
If you want to do some type of prediction a little more information might be useful.
Thanks you for the clarifications.
Machine Learning
The task you want to solve falls under the disciplines known by a variety of names, but probably most commonly as Machine Learning or Pattern Recognition and if you know which examples represent the same gestures, your case would be known as supervised learning.
Question: In your case do you know which gesture each example represents ?
You have a series of examples for which you know a label ( the form of gesture it is ) from which you want to train a model and use that model to label an unseen example to one of a finite set of classes. In your case, one of a number of gestures. This is typically known as classification.
Learning Resources
There is a very extensive background of research on this topic, but a popular introduction to the subject is machine learning by Christopher Bishop.
Stanford have a series of machine learning video lectures Standford ML available on the web.
Accuracy
You might want to consider how you will determine the accuracy of your system at predicting the type of gesture for an unseen example. Typically you train the model using some of your examples and then test its performance using examples the model has not seen. The two of the most common methods used to do this are 10 fold Cross Validation or repeated 50/50 holdout. Having a measure of accuracy enables you to compare one method against another to see which is superior.
Have you thought about what level of accuracy you require in your task, is 70% accuracy enough, 85%, 99% or better?
Machine learning methods are typically quite sensitive to the specific type of data you have and the amount of examples you have to train the system with, the more examples, generally the better the performance.
You could try the method suggested above and compare it against a variety of well proven methods, amongst which would be Random Forests, support vector machines and Neural Networks. All of which and many more are available to download in a variety of free toolboxes.
Toolboxes
Mathematica is a wonderful system, is infinitely flexible and my favourite environment, but out of the box it doesn't have a great deal of support for machine learning.
I suspect you will make a great deal of progress more quickly by using a custom toolbox designed for machine learning. Two of the most popular free toolboxes are WEKA and R both support more than 50 different methods for solving your task along with methods for measuring the accuracy of the solutions.
With just a little data reformatting, you can convert your gestures to a simple file format called ARFF, load them into WEKA or R and experiment with dozens of different algorithms to see how each performs on your data. The explorer tool in WEKA is definitely the easiest to use, requiring little more than a few mouse clicks and typing some parameters to get started.
Once you have an idea of how well the established methods perform on your data you have a good starting point to compare a customised approach against should they fail to meet your criteria.
Handwritten Digit Recognition
Your problem is similar to a very well researched machine learning problem known as hand written digit recognition. The methods that work well on this public data set of handwritten digits are likely to work well on your gestures.

Resources