Detecting syllables in a word containing non-alphabetical characters - nlp

I'm implementing readability test and have implemented simple algorithm of detecting sylables.
Detecting sequences of vowels I'm counting them in words, for example word "shoud" contains one sequence of vowels which is 'ou'. Before I'm counting them i'm removing suffixes like -les, -e, -ed (for example word "like" contains one syllable but two sequences of vowels, so this method works).
But...
Consider these words / sequences:
x-ray (it contains two syllables)
I'm (One syllable, maybe I may use removal of all apostrophes in the text?)
goin'
I'd've
n' (for example Pork n' Beans)
3rd (how to treat this ?)
12345
What to do with special characters? Remove them all? It will be ok for most of words, but not with "n'" and "x-ray". And how treat cyphers.
These are special cases of words but I'll be very glad to see some experience or ideas in this subject.

I'd advise you to first determine how much of your data consists of these kinds of words and how much it matters to your program's overall performance. Also compile some statistics of which kinds occur most.
There's no simple correct solution for this problem, but I can suggest a few heuristics:
A ' between two consonants (shouldn't) seems to mark the elision of a syllable
A ' with a vowel or word boundary on one side (I'd, goin') seems not to do so (but note that goin' is still two syllables)
Any word, including n' is at least one syllable long
Dashes (-) may be handled by treating the text on both sides as separate words
3rd can be handled by code that writes ordinals out as words, or by simpler heuristics.

Related

How to find strings of a list in a text with typo's

I'm trying to check if some String in a list are in a given text. But the given text can have some typos. For example let's take this.
text: The brownw focx and the cat are in th eforest.
and my list is: [brown fox, forest, cat]
What I do actually to do this is that I separate my text in multiple groups, groups of one word and two words like so:
[The, brownw, focx, and, the, cat, are, in, th, eforest, The brownw, brownw focx, focx and, and the, the cat, cat are, are in, in th, th eforest]
Than I iterate over each group of word and check with the Levensthein algorithm how much the two strings match with each other. In case it's more than 90% I consider they are the same.
This approach however is very time consuming and I wonder if I can find an alternative to this.
Instead of using the full Levenshtein distance (which is slow to compute), you could do a couple of sanity checks beforehand, to try and exclude candidates which are obviously wrong:
word length: the will never match brown fox, as it is far too short. Count the word length, and exclude all candidates that are more than a few letters shorter or longer.
letters: just check what letters are in the word. for example, the does not contain a single letter from fox, so you can rule it out straightaway. With short words it might not make a big difference in performance, but for longer words it will do. Additional optimisation: look for rare characters (x,q,w) first, or simply ignore common ones (e,t,s) which are more likely to be present anyway.
Heuristics such as these will of course not give you the right answer, but they can help to filter out those that are definitely not going to match. Then you only need to perform the more expensive full check on a much smaller number of candidate words.

Algorithm to un-concatenate words from string without spaces and punctuation

I've been given a problem in my data structures class to find the solution to this problem. It's similar to an interview question. If someone could explain the thinking process or solution to the problem. Pseudocode can be used. So far i've been thinking to use tries to hold the dictionary and look up words that way for efficiency.
This is the problem:
Oh, no! You have just completed a lengthy document when you have an unfortunate Find/Replace mishap. You have accidentally removed all spaces, punctuation, and capitalization in the document. A sentence like "I reset the computer. It still didn't boot!" would become "iresetthecomputeritstilldidntboot". You figure that you can add back in the punctation and capitalization later, once you get the individual words properly separated. Most of the words will be in a dictionary, but some strings, like proper names, will not.
Given a dictionary (a list of words), design an algorithm to find the optimal way of "unconcatenating" a sequence of words. In this case, "optimal" is defined to be the parsing which minimizes the number of unrecognized sequences of characters.
For example, the string "jesslookedjustliketimherbrother" would be optimally parsed as "JESS looked just like TIM her brother". This parsing has seven unrecognized characters, which we have capitalized for clarity.
For each index, n, into the string, compute the cost C(n) of the optimal solution (ie: the number of unrecognised characters in the optimal parsing) starting at that index.
Then, the solution to your problem is C(0).
There's a recurrence relation for C. At each n, either you match a word of i characters, or you skip over character n, incurring a cost of 1, and then parse the rest optimally. You just need to find which of those choices incurs the lowest cost.
Let N be the length of the string, and let W(n) be a set containing the lengths of all words starting at index n in your string. Then:
C(N) = 0
C(n) = min({C(n+1) + 1} union {C(n+i) for i in W(n)})
This can be implemented using dynamic programming by constructing a table of C(n) starting from the end backwards.
If the length of the longest word in your dictionary is L, then the algorithm runs in O(NL) time in the worst case and can be implemented to use O(L) memory if you're careful.
You could use rolling hashes of different lengths to speed up the search.
You can try a partial pattern matcher for example aho-corasick algorithm. Basically it's a special space optimized version of a suffix tree.

Algorithm (or pointer to literature) sought for string processing challenge

A group of amusing students write essays exclusively by plagiarising portions of the complete works of WIlliam Shakespere. At one end of the scale, an essay might exclusively consist a verbatim copy of a soliloquy... at the other, one might see work so novel that - while using a common alphabet - no two adjacent characters in the essay were used adjacently by Will.
Essays need to be graded. A score of 1 is assigned to any essay which can be found (character-by-character identical) in the plain-text of the complete works. A score of 2 is assigned to any work that can be successfully constructed from no fewer than two distinct (character-by-character identical) passages in the complete works, and so on... up to the limit - for an essay with N characters - which scores N if, and only if, no two adjacent characters in the essay were also placed adjacently in the complete works.
The challenge is to implement a program which can efficiently (and accurately) score essays. While any (practicable) data-structure to represent the complete works is acceptable - the essays are presented as ASCII strings.
Having considered this teasing question for a while, I came to the conclusion that it is much harder than it sounds. The naive solution, for an essay of length N, involves 2**(N-1) traversals of the complete works - which is far too inefficient to be practical.
While, obviously, I'm interested in suggested solutions - I'd also appreciate pointers to any literature that deals with this, or any similar, problem.
CLARIFICATIONS
Perhaps some examples (ranging over much shorter strings) will help clarify the 'score' for 'essays'?
Assume Shakespere's complete works are abridged to:
"The quick brown fox jumps over the lazy dog."
Essays scoring 1 include "own fox jump" and "The quick brow". The essay "jogging" scores 6 (despite being short) because it can't be represented in fewer than 6 segments of the complete works... It can be segmented into six strings that are all substrings of the complete works as follows: "[j][og][g][i][n][g]". N.B. Establishing scores for this short example is trivial compared to the original problem - because, in this example "complete works" - there is very little repetition.
Hopefully, this example segmentation helps clarify the 2*(N-1) substring searches in the complete works. If we consider the segmentation, the (N-1) gaps between the N characters in the essay may either be a gap between segments, or not... resulting in ~ 2*(N-1) substring searches of the complete works to test each segmentation hypothesis.
An (N)DFA would be a wonderful solution - if it were practical. I can see how to construct something that solved 'substring matching' in this way - but not scoring. The state space for scoring, on the surface, at least, seems wildly too large (for any substantial complete works of Shakespere.) I'd welcome any explanation that undermines my assumptions that the (N)DFA would be too large to be practical to compute/store.
A general approach for plagiarism detection is to append the student's text to the source text separated by a character not occurring in either and then to build either a suffix tree or suffix array. This will allow you to find in linear time large substrings of the student's text which also appear in the source text.
I find it difficult to be more specific because I do not understand your explanation of the score - the method above would be good for finding the longest stretch in the students work which is an exact quote, but I don't understand your N - is it the number of distinct sections of source text needed to construct the student's text?
If so, there may be a dynamic programming approach. At step k, we work out the least number of distinct sections of source text needed to construct first k characters of the student's text. Using a suffix array built just from the source text or otherwise, we find the longest match between the source text and characters x..k of the student's text, where x is of course as small as possible. Then the least number of sections of source text needed to construct the first k characters of student text is the least needed to construct 1..x-1 (which we have already worked out) plus 1. By running this process for k=1..the length of the student text we find the least number of sections of source text needed to reconstruct the whole of it.
(Or you could just search StackOverflow for the student's text, on the grounds that students never do anything these days except post their question on StackOverflow :-)).
I claim that repeatedly moving along the target string from left to right, using a suffix array or tree to find the longest match at any time, will find the smallest number of different strings from the source text that produces the target string. I originally found this by looking for a dynamic programming recursion but, as pointed out by Evgeny Kluev, this is actually a greedy algorithm, so let's try and prove this with a typical greedy algorithm proof.
Suppose not. Then there is a solution better than the one you get by going for the longest match every time you run off the end of the current match. Compare the two proposed solutions from left to right and look for the first time when the non-greedy solution differs from the greedy solution. If there are multiple non-greedy solutions that do better than the greedy solution I am going to demand that we consider the one that differs from the greedy solution at the last possible instant.
If the non-greedy solution is going to do better than the greedy solution, and there isn't a non-greedy solution that does better and differs later, then the non-greedy solution must find that, in return for breaking off its first match earlier than the greedy solution, it can carry on its next match for longer than the greedy solution. If it can't, it might somehow do better than the greedy solution, but not in this section, which means there is a better non-greedy solution which sticks with the greedy solution until the end of our non-greedy solution's second matching section, which is against our requirement that we want the non-greedy better solution that sticks with the greedy one as long as possible. So we have to assume that, in return for breaking off the first match early, the non-greedy solution gets to carry on its second match longer. But this doesn't work, because, when the greedy solution finally has to finish using its first match, it can jump on to the same section of matching text that the non-greedy solution is using, just entering that section later than the non-greedy solution did, but carrying on for at least as long as the non-greedy solution. So there is no non-greedy solution that does better than the greedy solution and the greedy solution is optimal.
Have you considered using N-Grams to solve this problem?
http://en.wikipedia.org/wiki/N-gram
First read the complete works of Shakespeare and build a trie. Then process the string left to right. We can greedily take the longest substring that matches one in the data because we want the minimum number of strings, so there is no factor of 2^N. The second part is dirt cheap O(N).
The depth of the trie is limited by the available space. With a gigabyte of ram you could reasonably expect to exhaustively cover Shakespearean English string of length at least 5 or 6. I would require that the leaf nodes are unique (which also gives a rule for constructing the trie) and keep a pointer to their place in the actual works, so you have access to the continuation.
This feels like a problem of partial matching a very large regular expression.
If so it can be solved by a very large non deterministic finite state automata or maybe more broadly put as a graph representing for every character in the works of Shakespeare, all the possible next characters.
If necessary for efficiency reasons the NDFA is guaranteed to be convertible to a DFA. But then this construction can give rise to 2^n states, maybe this is what you were alluding to?
This aspect of the complexity does not really worry me. The NDFA will have M + C states; one state for each character and C states where C = 26*2 + #punctuation to connect to each of the M states to allow the algorithm to (re)start when there are 0 matched characters. The question is would the corresponding DFA have O(2^M) states and if so is it necessary to make that DFA, theoretically it's not necessary. However, consider that in the construction, each state will have one and only one transition to exactly one other state (the next state corresponding to the next character in that work). We would expect that each one of the start states will be connected to on average M/C states, but in the worst case M meaning the NDFA will have to track at most M simultaneous states. That's a large number but not an impossibly large number for computers these days.
The score would be derived by initializing to 1 and then it would incremented every time a non-accepting state is reached.
It's true that one of the approaches to string searching is building a DFA. In fact, for the majority of the string search algorithms, it looks like a small modification on failure to match (increment counter) and success (keep going) can serve as a general strategy.

Algorithm for string processing

I am looking for a algorithm for string processing, I have searched for it but couldn't find a algorithm that meets my requirements. I will explain what the algorithm should do with an example.
There are two sets of word sets defined as shown below:
**Main_Words**: swimming, driving, playing
**Words_in_front**: I am, I enjoy, I love, I am going to go
The program will search through a huge set of words as soon it finds a word that is defined in Main_Words it will check the words in front of that Word to see if it has any matching words defined in Words_in_front.
i.e If the program encounters the word "Swimming" it has to check if the words in front of the word "Swimming" are one of these: I am, I enjoy, I love, I am going to go.
Are there any algorithms that can do this?
A straightforward way to do this would be to just do a linear scan through the text, always keeping track of the last N+1 words (or characters) you see, where N is the number of words (or characters) in the longest phrase contained in your words_in_front collection. When you have a "main word", you can just check whether the sequence of N words/characters before it ends with any of the prefixes you have.
This would be a bit faster if you transformed your words_in_front set into a nicer data structure, such as a hashmap (perhaps keyed by last letter in the phrase..) or a prefix/suffix tree of some sort, so you wouldn't have to do an .endsWith over every single member of the set of prefixes each time you have a matching "main word." As was stated in another answer, there is much room for optimization and a few other possible implementations, but there's a start.
Create a map/dictionary/hash/associative array (whatever is defined in your language) with key in Main_Words and Words_in_front are the linked list attached to the entry pointed by the key. Whenever you encounter a word matching a key, go to the table and see if in the attached list there are words that match what you have in front.
That's the basic idea, it can be optimized for both speed and space.
You should be able to build a regular expression along these lines:
I (am|enjoy|love|am going to go) (swimming|driving|playing)

How to understand and add syllable break in this example?

I am new in machine learning and computing probabilities. This is an example from Lingpipe for adding syllabification in a word by training data.
Given a source model p(h) for hyphenated words, and a channel model p(w|h) defined so that p(w|h) = 1 if w is equal to h with the hyphens removed and 0 otherwise. We then seek to find the most likely source message h to have produced message w by:
ARGMAXh p(h|w) = ARGMAXh p(w|h) p(h) / p(w)
= ARGMAXh p(w|h) p(h)
= ARGMAXh s.t. strip(h)=w p(h)
where we use strip(h) = w to mean that w is equal to h with the hyphenations stripped out (in Java terms, h.replaceAll(" ","").equals(w)). Thus with a deterministic channel, we wind up looking for the most likely hyphenation h according to p(h), restricting our search to h that produce w when the hyphens are stripped out.
I do not understand how to use it to build a syllabification model.
If there is a training set containing:
a bid jan
a bide
a bie
a bil i ty
a bim e lech
How to have a model that will syllabify words? I mean what to be computed in order to find possible syllable breaks of a new word.
First compute what? then compute what? Can you please be specific with example?
Thanks a lot.
The method described in the article is based on a statistical law allowing to compute the correct value observing a noisy value. In other words, non-syllabified word is noisy or incorrect, like picnic, and the goal is finding a probably correct value, which is pic-nic.
Here is an excellent video lesson on very this topic (scroll to 1:25, but the whole set of lectures worth watching).
This method is specifically useful for word delimiting, but some use it for syllabification as well. Chinese language has space delimiters only for logical constructs, but most words follow each other with no delimiters. However, each character is a syllable, no exception.
There are other languages that have more complicated grammar. For instance, Thai has no spaces between the words, but each syllable may be constructed from several symbols, e.g. สวัสดี -> ส-วัส-ดี. Rule-based syllabification may be hard but possible.
As per English, I would not bother with Markov chains and N-grams and instead just use several simple rules that give pretty good match ratio (not perfect, however):
Two consonants between two vowels VCCV - split between them VC-CV as in cof-fee, pic-nic, except the "cluster consonant" that represents a single sound: meth-od, Ro-chester, hang-out
Three or more consonants between the vowels VCCCV - split keeping the blends together as in mon-ster or child-ren (this seems the most difficult as you cannot avoid a dictionary)
One consonant between two vowels VCV - split after the first vowel V-CV as in ba-con, a-rid
The rule above also has an exception based on blends: cour-age, play-time
Two vowels together VV - split between, except they represent a "cluster vowel": po-em, but glacier, earl-ier
I would start with the "main" rules first, and then cover them with "guard" rules preventing cluster vowels and consonants to be split. Also, there would be an obvious guard rule to prevent a single consonant to become a syllable. When done, I would have added another guard rule based on a dictionary.

Resources