String-matching algorithm for noisy text

String-matching algorithm for noisy text - string

I have used OCR (optical character recognition) to get texts from images. The images contain book covers. Because of the images are so noisy, some characters are misrecognised, or some noises are recognised as a character.
Examples:
"w COMPUTER Nnwonxs i I "(Compuer Networks)
"s.ll NEURAL NETWORKS C "(Neural Networks)
"1llllll INFRODUCIION ro PROBABILITY ti iitiiili My "(Introduction of Probability)
I builded a dictionary with words, but i want to somehow match the recognised text with the dictionary. I tried LCS (Longest Common subsequence), but its not so effective.
What is the best string matching algorithm for this kind of problem? (So a part of string is just noise, but also the important part of string can has some misrecognised characters)

That's really a big question. Followings are something I know about it. For more details, you can read some related papers.
For single word, use Hamming Distance to calculate the similarity between the word your recognized by OCR and those in your dictionary;
this step is used to correct the the words have been recognized by OCR but do not exist.
Eg：
If the result of OCR is INFRODUCIION which dosen't exist in your dictionary, you can find out the Hamming Distance of word 'INTRODUCTION' is 2. So it may be mis-recognized as 'INFRODUCIION'.
However, the same word may be recognized as different words with the same Hamming Distance between them.
Eg： If the result of OCR is the CAY, you may find CAR and CAT are both with the same Hamming Distance of 1, so that will be confused.
In this case, there are several things can be used for analyze:
Still for single word, the image different between CAT and CAY is less that CAR and CAY. So for this reason, CAT seems the right word with a greater probability.
Then let us the context to caculate another probability. If the whold sentence is 'I drove my new CAY this morning', as for people usually drive a CAR but not a CAT, we have a better chance to regard the word CAY as CAR but not CAT.
For the frequency of the words used in the similar articles, use TF-TDF.

Are you saying you have a dictionary that defines all words that are acceptable?
If so, it should be fairly straight forward to take each word and find the closest match in your dictionary. Set a match threshold and discard the word if it does not reach the threshold.
I would experiment with the Soundex and Metaphone algorithms or the Levenshtein Distance algorithm.

Related

Use the polarity distribution of word to detect the sentiment of new words

I have just started a project in NLP. Suppose I have a graph for each word that shows the polarity distribution of sentiments for that word in different sentences. I want to know what I can use to recognize the feelings of new words? Any other use you have in mind I will be happy to share.
I apologize for any possible errors in my writing. Thanks a lot

Assuming you've got some words that have been hand-labeled with positive/negative sentiments, but then you encounter some new words that aren't labeled:
If you encounter the new words totally alone, outside of contexts, there's not much you can do. (Maybe, you could go out to try to find extra texts with those new words, such as vis dictionaries or the web, then use those larger texts in the next approach.)
If you encounter the new words inside texts that also include some of your hand-labeled words, you could try guessing that the new words are most like the words you already know that are closest-to, or used-in-the-same-places. This would leverage what's called "the distributional hypothesis" – words with similar distributions have similar meanings – that underlies a lot of computer natural-language analysis, including word2vec.
One simple thing to try along these lines: across all your texts, for every unknown word U, tally up the counts all neighboring words within N positions. (N could be 1, or larger.) From that, pick the top 5 words occuring most often near the unknown word, and look up your prior labels, and avergae them together (perhaps weighted by the number of occurrences.)
You'll then have a number for the new word.
Alternatively, you could train a word2vec set-of-word-vectors for all of your texts, including the unknown & know words. Then, ask that model for the N most-similar neighbors to your unknown word. (Again, N could be small or large.) Then, from among those neighbors with known labels, average them together (again perhaps weighted by similarity), to get a number for the previously unknown word.
I wouldn't particularly expect either of these techniques to work very well. The idea that individual words can have specific sentiment is somewhat weak given the way that in actual language, their meaning is heavily modified, or even reversed, by the surrounding grammar/context. But in each case these simple calculate-from-neighbors techniqyes are probably better than random guesses.
If your real aim is to calculate the overall sentiment of longer texts, like sentences, paragraphs, reviews, etc, then you should discard your labels of individual words an acquire/create labels for full texts, and apply real text-classification techniques to those larger texts. A simple word-by-word approach won't do very well compared to other techniques – as long as those techniques have plenty of labeled training data.

Trying to detect products from text while using a dictionary

I have a list of products names and a collection of text generated from random users. I am trying to detect products mentioned in the text while talking into account spelling variation. For example the text
Text = i am interested in galxy s8
Mentions the product samsung galaxy s8
But note the difference in spellings.
I've implemented the following approaches:
1- max tokenized products names and users text (i split words by punctuation and digits so s8 will be tokenized into 's' and '8'. Then i did a check on each token in user's text to see if it is in my vocabulary with damerau levenshtein distance <= 1 to allow for variation in spelling. Once i have detected a sequence of tokens that do exist in the vocabulary i do a search for the product that matches the query while checking the damerau levenshtein distance on each token. This gave poor results. Mainly because the sequence of tokens that exist in the vocabulary do not necessarily represent a product. For example since text is max tokenized numbers can be found in the vocabulary and as such dates are detected as products.
2- i constructed bigram and trigram indicies from the list of products and converted each user text into a query.. but also results weren't so great given the spelling variation
3- i manually labeled 270 sentences and trained a named entity recognizer with labels ('O' and 'Product'). I split the data into 80% training and 20% test. Note that I didn't use the list of products as part of the features. Results were okay.. not great tho
None of the above results achieved a reliable performance. I tried regular expressions but since there are so many different combinations to consider it became too complicated.. Are there better ways to tackle this problem? I suppose ner could give better results if i train more data but suppose there isn't enough training data, what do u think a better solution would be?
If i come up with a better alternative to the ones I've already mentioned, I'll add it to this post. In the meantime I'm open to suggestions

Consider splitting your problem into two parts.
1) Conduct a spelling check using a dictionary of known product names (this is not a NLP task and there should be guides on how to impelement spell check).
2) Once you have done pre-processing (spell checking), use your NER algorithm
It should improve your accuracy.

Part of speech tagging : tagging unknown words

In the part of speech tagger, the best probable tags for the given sentence is determined using HMM by
P(T*) = argmax P(Word/Tag)*P(Tag/TagPrev)
T
But when 'Word' did not appear in the training corpus, P(Word/Tag) produces ZERO for given all possible tags, this leaves no room for choosing the best.
I have tried few ways,
1) Assigning small amount of probability for all unknown words, P(UnknownWord/AnyTag)~Epsilon... means this completely ignores the P(Word/Tag) for unknowns word by assigning the constant probability.. So decision making on unknown word is by prior probabilities.. As expected it is not producing good result.
2) Laplace Smoothing
I confused with this. I don't know what is difference between (1) and this. My way of understanding Laplace Smoothing adds the constant probability(lambda) to all unknown & Known words.. So the All Unknown words will get constant probability(fraction of lambda) and Known words probabilities will be the same relatively since all word's prob increased by Lambda.
Is the Laplace Smoothing same as the previous one ?
*)Is there any better way of dealing with unknown words ?

Your two approaches are similar, but, if I understand correctly, they differ in one key way. In (1) you are assigning extra mass to counts of unknown words and in (2) you are assigning extra mass to all counts. You definitely want to do (2) and not (1).
One of the problems with Laplace smoothing is that it give too much of a boost to unknown words and drags down the probabilities of high-probability words too much (relatively speaking). Your version (1) would actually worsen this problem. Basically, it would over-smooth.
Laplace smoothing words ok for an HMM, but it's not great. Most people do add-one smoothing but you could experiment with things like add-one-half or whatever.
If you want to move beyond this naive approach to smoothing, check out "one-count smoothing", as described in the Appendix of Jason Eisner's HMM tutorial. The basic idea here is that for unknown words more probability mass should be given to tags that appear with a wider variety of low frequency words. For example, since the tag NOUN appears on a large number of different words and DETERMINER appears on a small number of different words, it is more likely that an unseen word will be a NOUN.
If you want to get even fancier, you could use a Chinese Restaurant Process model taken from non-parametric Bayesian statistics to put a prior distribution on unseen word/tag combinations. Kevin Knight's Bayesian inference tutorial has details.

I think the HMM-based TnT tagger provides a better approach to handle unknown words (see the approach in TnT tagger's paper).
The accuracy results (for known words and unknown words) of TnT and other two POS and morphological taggers on 13 languages including Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese, can be found in this article.

Financial news headers classification to positive/negative classes

I'm doing a small research project where I should try to split financial news articles headers to positive and negative classes.For classification I'm using SVM approach.The main problem which I see now it that not a lot of features can be produced for ML. News articles contains a lot of Named Entities and other "garbage" elements (from my point of view of course).
Could you please suggest ML features which can be used for ML training? Current results are: precision =0.6, recall=0.8
Thanks

The task is not trivial at all.
The straightforward approach would be to find or create a training set. That is a set of headers with positive news and a set of headers with negative news.
You turn the training set to a TF/IDF representation and then you train a Linear SVM to separate the two classes. Depending on the quality and size of your training set you can achieve something decent - not sure for 0.7 break even point.
Then, to get better results you need to go for NLP approaches. Try use a part-of-speech tagger to identify adjectives (trivial), and then score them using some sentiment DB like SentiWordNet.
There is an excellent overview on Sentiment Analysis by Bo Pang and Lillian Lee you should read:

How about these features?
Length of article header in words
Average word length
Number of words in a dictionary of "bad" words, e.g. dictionary = {terrible, horrible, downturn, bankruptcy, ...}. You may have to generate this dictionary yourself.
Ratio of words in that dictionary to total words in sentence
Similar to 3, but number of words in a "good" dictionary of words, e.g. dictionary = {boon, booming, employment, ...}
Similar to 5, but use the "good"-word dictionary
Time of the article's publication
Date of the article's publication
The medium through which it was published (you'll have to do some subjective classification)
A count of certain punctuation marks, such as the exclamation point
If you're allowed access to the actual article, you could use surface features from the actual article, such as its total length and perhaps even the number of responses or the level of opposition to that article. You could also look at many other dictionaries online such as Ogden's 850 basic english dictionary, and see if bad/good articles would be likely to extract many words from those. I agree that it seems difficult to come up with a long list (e.g. 100 features) of useful features for this purpose.

iliasfl is right, this is not a straightforward task.
I would use a bag of words approach but use a POS tagger first to tag each word in the headline. Then you could remove all of the named entities - which as you rightly point out don't affect the sentiment. Other words should appear frequently enough (if your dataset is big enough) to cancel themselves out from being polarised as either positive or negative.
One step further along, if you still aren't close could be to only select the adjectives and verbs from the tagged data as they are the words that tend to convey the emotion or mood.
I wouldn't be too disheartened in your precision and recall figures though, an F number of 0.8 and above is actually quite good.

Finding related texts(correlation between two texts)

I'm trying to find similar articles in database via correlation.
So i split text in array of words, then delete frequently used words (articles,pronouns and so on), then compare two text with pearson coefficient function. For some text it's works but for other it's not so good(texts with large text have higher coefficient).
Can somebody advice a good method to find related texts?

Some of the problems you mention boild down to normalizing over document length and overall word frequency. Try tf-idf.

First and foremost, you need to specify what you precisely mean by similarity and when two documents are (more/less) similar.
If the similarity you are looking for is literal, then I would vectorise the documents using term frequencies, and use the cosine similarity to liken them to each other given that texts are inherently directional data. tf-idf and log-entropy weighting schemes may be tested depending on your use-case. The edit distance is inefficient with long texts.
If you care more about the semantics, word embeddings are your ally.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string