I am building a very basic result ranking algorithm, and one thing I'd like is a way to determine which words are generally more important in a given phrase. It doesn't have to be exact, just general.
Obviously dropping any word under 4 letters, identifying names. But what other ways can I pick out the 3 most significant words in a sentence?
In the absence of any other information, it is fair to assume that important words are rare words. Count how many times each word appears in your set of documents. The words with the lowest counts are more important, while the words with the highest counts are less important (if not nearly useless).
Related reading:
http://en.wikipedia.org/wiki/Stop_words
http://en.wikipedia.org/wiki/Googlewhack
http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases
Related
What is the exact meaning of lexicographical order? How it is different from alphabetical order?
lexicographical order is alphabetical order. The other type is numerical ordering. Consider the following values,
1, 10, 2
Those values are in lexicographical order. 10 comes after 2 in numerical order, but 10 comes before 2 in "alphabetical" order.
Alphabetical order is a specific kind of lexicographical ordering. The term lexicographical often refers to the mathematical rules or sorting. These include, for example, proving logically that sorting is possible. Read more about lexicographical order on wikipedia
Alphabetical ordering includes variants that differ in how to handle spaces, uppercase characters, numerals, and punctuation. Purists believe that allowing characters other than a-z makes the sort not "alphabetic" and therefore it must fall in to the larger class of "lexicographic". Again, wikipedia has additional details.
In computer programming, a related question is dictionary order or ascii code order. In dictionary order, the uppercase "A" sorts adjacent to lowercase "a". However, in many computer languages, the default string compare will use ascii codes. With ascii, all uppercase letters come before any lowercase letters, which means that that "Z" will sort before "a". This is sometimes called ASCIIbetical order.
This simply means "dictionary order", i.e., the way in which words are ordered in a dictionary. If you were to determine which one of the two words would come before the other in a dictionary, you would compare the words letter by the letter starting from the first position. For example, the word "children" will appear before (and can be considered smaller) than the word "chill" because the first four letters of the two words are the same but the letter at the fifth position in "children" (i.e. d ) comes before (or is smaller than) the letter at the fifth position in "chill" (i.e. l ). Observe that lengthwise, the word "children" is bigger than "chill" but length is not the criteria here. For the same reason, an array containing 12345 will appear
before an array containing 1235. (Deshmukh, OCP Java SE 11 Programmer I 1Z0815 Study guide 2019)
Lexicographical ordering means dictionary order.
For ex: In dictionary 'ado' comes after 'adieu' because 'o' comes after 'i' in English alphabetic system.
This ordering is not based on length of the string, but on the occurrence of the smallest letter first.
I want to add an answer that is more related to the programming side of the term rather than the mathematical side of it.
Lexicographical order is not always an equivalent of "dictionary order", at least this definition is not complete in the realm of programming, rather, it refers to "an ordering based on multiple criteria".
For example, almost in all famous programming languages, there are standard tools for sorting collections of objects, now what if you want to sort a collection based on more than one thing? For instance, let's say you want to sort some items based on their prices first AND then based on their popularity. This is an example of Lexicographical Order.
For example in Java (8+), you could do something like this:
// sorts items from the cheapest AND the most popular ones
// towards the most expensive AND the least popular ones.
Collections.sort(items,
Comparator.comparing(Item::price)
.thenComparing(Item::popularity)
.reversed()
);
And the Java documentation uses this term too, to refer to such type of ordering when explaining the "thenComapring()" method:
Returns a lexicographic-order comparator with another comparator.
Lexicographical order is nothing but the dictionary order or preferably the order in which words appear in the dictonary. For example, let's take three strings, "short", "shorthand" and "small". In the dictionary, "short" comes before "shorthand" and "shorthand" comes before "small". This is lexicographical order.
I have a huge list of strings (city-names) and I want to find the name of a city even if the user makes a typo.
Example
User types "chcago" and the system finds "Chicago"
Of course I could calculate the Levenshtein distance of the query for all strings in the list but that would be horribly slow.
Is there any efficient way to perform this kind of string-matching?
I think the basic idea is to use Levenshtein distance, but on a subset of the names. One approach that works if the names are long enough is to use n-grams. You can store n-grams and then use more efficient techniques to say that at least x n-grams need to match. Alas, your example misspelling has 2-matching 3-grams with Chicago out of 5 (unless you count partials at the beginning and end).
For shorter names, another approach is to store the letters in each name. So, "Chicago" would turn into 6 "tuples": "c", "h", "i", "a", "g", "o". You would do the same for the name entered and then require that 4 or 5 match. This is a fairly simple match operation, so it can go quite fast.
Then, on this reduced set, apply Levenshtein distance to determine what the closest match is.
You're asking to determine Levenshtein without using Levenshtein.
You would have to determine how far the words could be deviated before it could be identified, and see if it would be acceptable to apply this less accurate algorithm. For instance, you could lookup commonly switched typed letters and limit it to that. Or apply the first/last letter rule from this paper. You could also assume the first few letters are correct and look up the cities in a sorted list and if you don't find it, apply Levenshtein to the n-1 and n+1 words where n is the location of the last lookup (or some variant of it).
There are several ideas, but I don't think there is a single best solution for what you are asking, without more assumptions.
Efficient way to search for fuzzy matches on a text string based on a Levenshtein distance (or any other metric that obeys the triangle inequality) is Levenshtein automaton. It's implemented in a Lucene project (Java) and particulary in a Lucene.net project (C#). This method works fast, but is very complex to implement
I'm trying to group redundancies in a dataset for some analysis. My primary tool for analysis are their titles.
I might have things like "blue bird" "big blue bird" "brown dog" "red dog", etc.
In this case, I want to group "blue bird" and "big blue bird" together but none of the other elements should be grouped.
I know about String Metrics but, in general, how effective are they on phrases as opposed to single words or noisy strings and which would be an effective solution for this problem?
You could use the same logic that people usually put in programs to sort an array, fix a variable (in this case would be a string that we would use the first word) and compare it with the strings that you have, always looking for an equal word, if it is equal you should place in a separate vector or in a specific order.
However , doing so you would spend a lot of time and probably not the best way to go because it would go phrase by phrase, word by word, letter by letter. Otherwise it may seem helpful to separate the strings by the initial letter of the first word in large groups. This way, you spend less time in your search for repeated words, which would optimize the use of memory.
I found this paper from Carnegie Mellon University, it seems very interesting, it talks about this problem, you should take a better look:
String Metric
String metrics don't care if your words contain empty spaces or not. Thus phrases are mostly just longer strings than words (in this regard), so string metrics work just as well if you are performing a fuzzy search (allthough you might want to search for every word individually).
Since you seem to be looking for exact matches though, i would recommend building a suffix tree from the concatenation of your titles. You can then search that tree for each of your title and build title-groups if you got more than one match. However you will need to decide what you want to do with combinations like
blue bird
big blue bird
small blue bird
Following the brown/red dog example, you would not want to group "big blue bird" with "small blue bird", but "blue bird" would be grouped with both of these.
Suppose the given word is" connnggggggrrrraaatsss" and we need to convert it to congrats .
Or for other example is "looooooovvvvvveeeeee" should be changed to "love" .
Here the given words can be repeated for any number of times but it should be changed to correct form. We need to write a java based program.
You cannot really check for every word because there are certain words which have more than 1 alphabets in their spelling. So one way you could go is -
check for each alphabet in the word and restrict its number of consecutive appearances to two
now check the new spelling on the spell checker, you might want to try HUNspell as it is widely used by many word processing softwares.
I want to know the best way to rank sentences based on similarity from a set of documents.
For e.g lets say,
1. There are 5 documents.
2. Each document contains many sentences.
3. Lets take Document 1 as primary, i.e output will contain sentences from this document.
4. Output should be list of sentences ranked in such a way that sentence with FIRST rank is the most similar sentence in all 5 documents, then 2nd then 3rd...
Thanks in advance.
I'll cover the basics of textual document matching...
Most document similarity measures work on a word basis, rather than sentence structure. The first step is usually stemming. Words are reduced to their root form, so that different forms of similar words, e.g. "swimming" and "swims" match.
Additionally, you may wish to filter the words you match to avoid noise. In particular, you may wish to ignore occurances of "the" and "a". In fact, there's a lot of conjunctions and pronouns that you may wish to omit, so usually you will have a long list of such words - this is called "stop list".
Furthermore, there may be bad words you wish to avoid matching, such as swear words or racial slur words. So you may have another exclusion list with such words in it, a "bad list".
So now you can count similar words in documents. The question becomes how to measure total document similarity. You need to create a score function that takes as input the similar words and gives a value of "similarity". Such a function should give a high value if the same word appears multiple times in both documents. Additionally, such matches are weighted by the total word frequency so that when uncommon words match, they are given more statistical weight.
Apache Lucene is an open-source search engine written in Java that provides practical detail about these steps. For example, here is the information about how they weight query similarity:
http://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/Similarity.html
Lucene combines Boolean model (BM) of Information Retrieval with
Vector Space Model (VSM) of Information Retrieval - documents
"approved" by BM are scored by VSM.
All of this is really just about matching words in documents. You did specify matching sentences. For most people's purposes, matching words is more useful as you can have a huge variety of sentence structures that really mean the same thing. The most useful information of similarity is just in the words. I've talked about document matching, but for your purposes, a sentence is just a very small document.
Now, as an aside, if you don't care about the actual nouns and verbs in the sentence and only care about grammar composition, you need a different approach...
First you need a link grammar parser to interpret the language and build a data structure (usually a tree) that represents the sentence. Then you have to perform inexact graph matching. This is a hard problem, but there are algorithms to do this on trees in polynomial time.
As a starting point you can compute soundex for each word and then compare documents based on soundexes frequencies.
Tim's overview is very nice. I'd just like to add that for your specific use case, you might want to treat the sentences from Doc 1 as documents themselves, and compare their similarity to each of the four remaining documents. This might give you a quick aggregate similarity measure per sentence without forcing you to go down the route of syntax parsing etc.