Given this data (relative letter frequency from both languages):
spanish => 'e' => 13.72, 'a' => 11.72, 'o' => 8.44, 's' => 7.20, 'n' => 6.83,
english => 'e' => 12.60, 't' => 9.37, 'a' => 8.34, 'o' => 7.70, 'n' => 6.80,
And then computing the letter frequency for the string "this is a test" gives me:
"t"=>21.43, "s"=>14.29, "i"=>7.14, "r"=>7.14, "y"=>7.14, "'"=>7.14, "h"=>7.14, "e"=>7.14, "l"=>7.14
So, what would be a good approach for matching the given string letter frequency with a language (and try to detect the language)? I've seen (and have tested) some examples using levenshtein distance, and it seems to work fine until you add more languages.
"this is a test" gives (shortest distance:) [:english, 13] ...
"esto es una prueba" gives (shortest distance:) [:spanish, 13] ...
Have you considered using cosine similarity to determine the amount of similarity between two vectors?
The first vector would be the letter frequencies extracted from the test string (to be classified), and the second vector would be for a specific language.
You're currently extracting single letter frequencies (unigrams). I would suggest extracting higher order n-grams, such as bigrams or trigrams (and even larger if you had enough training data). For example, for bigrams you would compute the frequencies of "aa", "ab", "ac" ... "zz", which will allow you to extract more information than if you were just considering single character frequencies.
Be careful though, because you need more training data when you use higher order n-grams otherwise you will have many 0-values for character combinations you haven't seen before.
In addition, a second possibility is to use tf-idf (term-frequency inverse-document-frequency) weightings instead of pure letter (term) frequencies.
Research
Here is a good slideshow on language identification for (very) short texts, which uses machine learning classifiers (but also has some other good info).
Here is a short paper A Comparison of Language Identification Approaches
on Short, Query-Style Texts that you might also find useful.
The examples you gave consisted of a short sentence each. Statistics dictate that if your input was longer (e.g. a paragraph, the unique frequencies should be easier to identify.
If you can't rely on the user giving a longer input, perhaps look for common words (e.g. is, as, and, but ...) in the language as well, if the letter frequencies match?
n-graphs certainly will help with short texts, and help a great deal. With any reasonable length text (a paragraph?), simple letter frequencies work well. As an example, I wrote a short demo of this, and you may download the source at http://georgeflanagin.com/free.code.php
It's the last example on the page.
Related
I have just started a project in NLP. Suppose I have a graph for each word that shows the polarity distribution of sentiments for that word in different sentences. I want to know what I can use to recognize the feelings of new words? Any other use you have in mind I will be happy to share.
I apologize for any possible errors in my writing. Thanks a lot
Assuming you've got some words that have been hand-labeled with positive/negative sentiments, but then you encounter some new words that aren't labeled:
If you encounter the new words totally alone, outside of contexts, there's not much you can do. (Maybe, you could go out to try to find extra texts with those new words, such as vis dictionaries or the web, then use those larger texts in the next approach.)
If you encounter the new words inside texts that also include some of your hand-labeled words, you could try guessing that the new words are most like the words you already know that are closest-to, or used-in-the-same-places. This would leverage what's called "the distributional hypothesis" – words with similar distributions have similar meanings – that underlies a lot of computer natural-language analysis, including word2vec.
One simple thing to try along these lines: across all your texts, for every unknown word U, tally up the counts all neighboring words within N positions. (N could be 1, or larger.) From that, pick the top 5 words occuring most often near the unknown word, and look up your prior labels, and avergae them together (perhaps weighted by the number of occurrences.)
You'll then have a number for the new word.
Alternatively, you could train a word2vec set-of-word-vectors for all of your texts, including the unknown & know words. Then, ask that model for the N most-similar neighbors to your unknown word. (Again, N could be small or large.) Then, from among those neighbors with known labels, average them together (again perhaps weighted by similarity), to get a number for the previously unknown word.
I wouldn't particularly expect either of these techniques to work very well. The idea that individual words can have specific sentiment is somewhat weak given the way that in actual language, their meaning is heavily modified, or even reversed, by the surrounding grammar/context. But in each case these simple calculate-from-neighbors techniqyes are probably better than random guesses.
If your real aim is to calculate the overall sentiment of longer texts, like sentences, paragraphs, reviews, etc, then you should discard your labels of individual words an acquire/create labels for full texts, and apply real text-classification techniques to those larger texts. A simple word-by-word approach won't do very well compared to other techniques – as long as those techniques have plenty of labeled training data.
I have used OCR (optical character recognition) to get texts from images. The images contain book covers. Because of the images are so noisy, some characters are misrecognised, or some noises are recognised as a character.
Examples:
"w COMPUTER Nnwonxs i I "(Compuer Networks)
"s.ll NEURAL NETWORKS C "(Neural Networks)
"1llllll INFRODUCIION ro PROBABILITY ti iitiiili My "(Introduction of Probability)
I builded a dictionary with words, but i want to somehow match the recognised text with the dictionary. I tried LCS (Longest Common subsequence), but its not so effective.
What is the best string matching algorithm for this kind of problem? (So a part of string is just noise, but also the important part of string can has some misrecognised characters)
That's really a big question. Followings are something I know about it. For more details, you can read some related papers.
For single word, use Hamming Distance to calculate the similarity between the word your recognized by OCR and those in your dictionary;
this step is used to correct the the words have been recognized by OCR but do not exist.
Eg:
If the result of OCR is INFRODUCIION which dosen't exist in your dictionary, you can find out the Hamming Distance of word 'INTRODUCTION' is 2. So it may be mis-recognized as 'INFRODUCIION'.
However, the same word may be recognized as different words with the same Hamming Distance between them.
Eg: If the result of OCR is the CAY, you may find CAR and CAT are both with the same Hamming Distance of 1, so that will be confused.
In this case, there are several things can be used for analyze:
Still for single word, the image different between CAT and CAY is less that CAR and CAY. So for this reason, CAT seems the right word with a greater probability.
Then let us the context to caculate another probability. If the whold sentence is 'I drove my new CAY this morning', as for people usually drive a CAR but not a CAT, we have a better chance to regard the word CAY as CAR but not CAT.
For the frequency of the words used in the similar articles, use TF-TDF.
Are you saying you have a dictionary that defines all words that are acceptable?
If so, it should be fairly straight forward to take each word and find the closest match in your dictionary. Set a match threshold and discard the word if it does not reach the threshold.
I would experiment with the Soundex and Metaphone algorithms or the Levenshtein Distance algorithm.
Let's assume that we have following strings:
q8GDNG8h029751
DNS
stackoverflow.com
28743.8.4.919
q7Q5w5dP012855
Martin_Luther
0000000100000000-0000000160000000
1344444967\.962
ExTreme_penguin
Obviously some of those can be, by our brain, classified as strings containing information, stings that have some "meaning" for humans. On the other hand, there are strings like "q7Q5w5dP012855" that are definitely some codes that could mean something only to computer.
My question is: Can we calculate some probability that string can actually tell us something?
I have some thoughts as doing frequency analysis or calculating capital letters etc. but it would be convenient to have something more 'scientific'
If you know the language that the strings are in you could use digram or trigram letter frequencies for the words in that language. These are quite small lookup tables [26 x 26]
or [26 x 26 x 26] each entry can be a floating point number which is the probability of that string occurring in the language. Many of these would be zero for meaningless string. You could add them up or simply count the number of zero probability sequences.
Of course this needs setting up for each language.
In the part of speech tagger, the best probable tags for the given sentence is determined using HMM by
P(T*) = argmax P(Word/Tag)*P(Tag/TagPrev)
T
But when 'Word' did not appear in the training corpus, P(Word/Tag) produces ZERO for given all possible tags, this leaves no room for choosing the best.
I have tried few ways,
1) Assigning small amount of probability for all unknown words, P(UnknownWord/AnyTag)~Epsilon... means this completely ignores the P(Word/Tag) for unknowns word by assigning the constant probability.. So decision making on unknown word is by prior probabilities.. As expected it is not producing good result.
2) Laplace Smoothing
I confused with this. I don't know what is difference between (1) and this. My way of understanding Laplace Smoothing adds the constant probability(lambda) to all unknown & Known words.. So the All Unknown words will get constant probability(fraction of lambda) and Known words probabilities will be the same relatively since all word's prob increased by Lambda.
Is the Laplace Smoothing same as the previous one ?
*)Is there any better way of dealing with unknown words ?
Your two approaches are similar, but, if I understand correctly, they differ in one key way. In (1) you are assigning extra mass to counts of unknown words and in (2) you are assigning extra mass to all counts. You definitely want to do (2) and not (1).
One of the problems with Laplace smoothing is that it give too much of a boost to unknown words and drags down the probabilities of high-probability words too much (relatively speaking). Your version (1) would actually worsen this problem. Basically, it would over-smooth.
Laplace smoothing words ok for an HMM, but it's not great. Most people do add-one smoothing but you could experiment with things like add-one-half or whatever.
If you want to move beyond this naive approach to smoothing, check out "one-count smoothing", as described in the Appendix of Jason Eisner's HMM tutorial. The basic idea here is that for unknown words more probability mass should be given to tags that appear with a wider variety of low frequency words. For example, since the tag NOUN appears on a large number of different words and DETERMINER appears on a small number of different words, it is more likely that an unseen word will be a NOUN.
If you want to get even fancier, you could use a Chinese Restaurant Process model taken from non-parametric Bayesian statistics to put a prior distribution on unseen word/tag combinations. Kevin Knight's Bayesian inference tutorial has details.
I think the HMM-based TnT tagger provides a better approach to handle unknown words (see the approach in TnT tagger's paper).
The accuracy results (for known words and unknown words) of TnT and other two POS and morphological taggers on 13 languages including Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese, can be found in this article.
I want to be able to find sentences with the same meaning. I have a query sentence, and a long list of millions of other sentences. Sentences are words, or a special type of word called a symbol which is just a type of word symbolizing some object being talked about.
For example, my query sentence is:
Example: add (x) to (y) giving (z)
There may be a list of sentences already existing in my database such as: 1. the sum of (x) and (y) is (z) 2. (x) plus (y) equals (z) 3. (x) multiplied by (y) does not equal (z) 4. (z) is the sum of (x) and (y)
The example should match the sentences in my database 1, 2, 4 but not 3. Also there should be some weight for the sentence matching.
Its not just math sentences, its any sentence which can be compared to any other sentence based upon the meaning of the words. I need some way to have a comparison between a sentence and many other sentences to find the ones with the closes relative meaning. I.e. mapping between sentences based upon their meaning.
Thanks! (the tag is language-design as I couldn't create any new tag)
First off: what you're trying to solve is a very hard problem. Depending on what's in your dataset, it may be AI-complete.
You'll need your program to know or learn that add, plus and sum refer to the same concept, while multiplies is a different concept. You may be able to do this by measuring distance between the words' synsets in WordNet/FrameNet, though your distance calculation will have to be quite refined if you don't want to find multiplies. Otherwise, you may want to manually establish some word-concept mappings (such as {'add' : 'addition', 'plus' : 'addition', 'sum' : 'addition', 'times' : 'multiplication'}).
If you want full sentence semantics, you will in addition have to parse the sentences and derive the meaning from the parse trees/dependency graphs. The Stanford parser is a popular choice for parsing.
You can also find inspiration for this problem in Question Answering research. There, a common approach is to parse sentences, then store fragments of the parse tree in an index and search for them by common search engines techniques (e.g. tf-idf, as implemented in Lucene). That will also give you a score for each sentence.
You will need to stem the words in your sentences down to a common synonym, and then compare those stems and use the ratio of stem matches in a sentence (5 out of 10 words) to compare against some threshold that the sentence is a match. For example all sentences with a word match of over 80% (or what ever percentage you deem acurate). At least that is one way to do it.
Write a function which creates some kinda hash, or "expression" from a sentence, which can be easy compared with other sentences' hashes.
Cca:
1. "the sum of (x) and (y) is (z)" => x + y = z
4. "(z) is the sum of (x) and (y)" => z = x + y
Some tips for the transformation: omit "the" words, convert double-word terms to a single word "sum of" => "sumof", find operator word and replace "and" with it.
Not that easy ^^
You should use a stopword filter first, to get non-information-bearing words out of it. Here are some good ones
Then you wanna handle synonyms. Thats actually a really complex theme, cause you need some kind of word sense disambiguation to do it. And most state of the art methods are just a little bit better then the easiest solution. That would be, that you take the most used meaning of a word. That you can do with WordNet. You can get synsets for a word, where all synonyms are in it. Then you can generalize that word (its called a hyperonym) and take the most used meaning and replace the search term with it.
Just to say it, handling synonyms is pretty hard in NLP. If you just wanna handle different wordforms like add and adding for example, you could use a stemmer, but no stemmer would help you to get from add to sum (wsd is the only way there)
And then you have different word orderings in your sentences, which shouldnt be ignored aswell, if you want exact answers (x+y=z is different from x+z=y). So you need word dependencies aswell, so you can see which words depend on each other. The Stanford Parser is actually the best for that task if you wanna use english.
Perhaps you should just get nouns and verbs out of a sentence and make all the preprocessing on them and ask for the dependencies in your search index.
A dependency would look like
x (sum, y)
y (sum, x)
sum (x, y)
which you could use for ur search
So you need to tokenize, generalize, get dependencies, filter unimportant words to get your result. And if you wanna do it in german, you need a word decompounder aswell.