How is a hyphenation dictionary used for hyphenation? - ms-office

I've read about hyphenation and I came to know that the hyphenation depends on the dictionary, we are using for the particular language. For some words Microsoft Office hyphenates differently than LibreOffice. I tried to open the dictionaryhyph_en_US.dic but couldn't understand the content.
What I didn't get is, how the dictionary is used.
Does it contain the list of words to hyphenate?
Does it contain the rules to decide as to how to hyphenate the word?
Note: I know they use algorithms as well to make the hyphenation better but to what extent does the dictionary play a role?
Any help will be much appreciated.
Regards,
Ankur Vashishtha

LibreOffice like TeX and a lot of other programs use the hyphenation algorithm created by Franklin M. Liang. This algorithm uses a pattern matching technique to find hyphenation points in words. A separate dictionary file containing the patterns is used for each language. According to Franklin M. Liang's thesis:
These patterns find 89% of the hyphens in a pocket dictionary word list, with essentially no error.
As to how Word does it, it is hard to tell, since it is proprietary software. My guess is that it does not use such an algorithm but a real dictionary with the 100% correct hyphenation points added in. This would explain why the hyphenation is different and more accurate in Word.

Related

Implementations for Pattern/String mining using Suffix Arrays/Trees

I am trying to solve a pattern mining problem for strings and I think that suffix trees or arrays might be a good option to solve this problem.
I will quickly outline the problem:
I have a set strings of different lengths (quotation are just to mark repetitions for the explanation):
C"BECB"ECCECCECEEB"BECB"FCCECCECCECCECCFCBFCBFCC
DCBBDCDDCCECBCEC"BECB""BECB"BECB"BECB"BECB"EDB"BECB""BECB"BEC
etc.
I now would like to find repeated patterns within each string and repeated patterns that are common between the strings. A repeated pattern in string (1) would be "BECB". Moreover the pattern "BECB" is also repeated in string (2). Of course there are several criteria that need to be decided on as for example the minimum number of repetitions or the minimum number of symbols in a pattern.
From the book by Dan Gusfield "Algorithms on Strings, Trees and Sequences" I know that it is possible to find such repeats (e.g. maximal pairs, maximal repetitive structures etc.) using certain algorithms and a suffix tree data structure. This comes in handy as I would like to use probabilistic suffix trees to also calculate some predictions on these sequences. (But this is not the topic of this post.)
Unfortunately I can't find any implementations of these algorithms. Hence, I am wondering if I am even on the right path using suffix trees to solve the mentioned problem. It seems very strange to me that for such a well described problem no packages are available (in R or Python for example).
Are there any alternatives (with existing packages) that solve my problem?
Or do you know any implementation of algorithms for suffix trees?
Here is an implementation in C++ that follows the approach of Dan Gusfield's book : https://cp-algorithms.com/string/suffix-tree-ukkonen.html
You could rewrite it in python. Such algorithms are quite specialised for high performance applications, so it's quite normal that they don't appear in any standard library; nonetheless they are still well known so you can usually find good implementations on the net.
Both suffix trees and suffix arrays are good data structures to help in solving the kinds of problems you want to solve.
Building a (multi-string) suffix tree in python is probably not a good idea -- it involves a lot of operations on individual characters and the resulting data structure consumes a lot of memory unless you spend a lot of code avoiding that.
Building a suffix array in python is more approachable, and the resulting data structure (probably just an array of integers) is reasonably compact.
It doesn't seem too difficult to find suffix array code in python on the web, and there's a nice explanation here: https://louisabraham.github.io/articles/suffix-arrays
It would be more difficult to find one that already supports multiple strings, so you would have to decide how you want to do that. In any case, the prefix doubling algorithm is easy to implement if you leverage the standard built-in sort(), and in python that will probably produce the fastest result.

How to take the suffix in smoothing of Part of speech tagging

I am making a "Part of speech Tagger". I am handling the unknown word with the suffix.
But the main issue is that how would i decide the number of suffix... should it be pre-decided (like Weischedel approach) or I have to take the last few alphabets of the words(like Samuelsson approach).
Which approach would be better......
Quick googling suggests that the Weischedel approach is sufficient for English, which has only rudimentary morphological inflection. The Samuelsson approach seems to work better (which makes sense intuitively) when it comes to processing inflecting languages.
A Resource-light Approach to Morpho-syntactic Tagging - Google Books p 9 quote:
To handle unknown words Brants (2000) uses Samuelsson's (1993) suffix analysis, which seems to work best for inflected languages.
(This is not in a direct comparison to Weischedel's approach, though.)

unicode characters

In my application I have unicode strings, I need to tell in which language the string is in,
I want to do it by narrowing list of possible languages by determining in which range the characters of string are.
Ranges I have from http://jrgraphix.net/research/unicode_blocks.php
And possible languages from http://unicode-table.com/en/
The problem is that algorithm has to detect all languages, does someone know more wide mapping of unicode ranges to languages ?
Thanks
Wojciech
This is not really possible, for a couple of reasons:
Many languages share the same writing system. Look at English and Dutch, for example. Both use the Basic Latin alphabet. By only looking at the range of code points, you simply cannot distinguish between them.
Some languages use more characters, but there is no guarantee that a
specific piece of text contains them. German, for example, uses the
Basic Latin alphabet plus "ä", "ö", "ü" and "ß". While these letters
are not particularly rare, you can easily create whole sentences
without them. So, a short text might not contain them. Thus, again,
looking at code points alone is not enough.
Text is not always "pure". An English text may contain French letters
because of a French loanword (e.g. "déjà vu"). Or it may contain
foreign words, because the text is talking about foreign things (e.g.
"Götterdämmerung is an opera by Richard Wagner...", or "The Great
Wall of China (万里长城) is..."). Looking at code points alone would be
misleading.
To sum up, no, you cannot reliably map code point ranges to languages.
What you could do: Count how often each character appears in the text and heuristically compare with statistics about known languages. Or analyse word structures, e.g. with Markov chains. Or search for the words in dictionaries (taking inflection, composition etc. into account). Or a combination of these.
But this is hard and a lot of work. You should rather use an existing solution, such as those recommended by deceze and Esailija.
I like the suggestion of using something like google translate -- as they will be doing all the work for you.
You might be able to build a rule-based system that gets you part of the way there. Build heuristic rules for languages and see if that is sufficient. Certain Tibetan characters do indicate Tibetan, and there are unique characters in many languages that will be a give away. But as the other answer pointed out, a limited sample of text may not be that accurate, as you may not have a clear indicator.
Languages will however differ in the frequencies that each character appears, so you could have a basic fingerprint of each language you need to classify and make guesses based on letter frequency. This probably goes a bit further than a rule-based system. Probably a good tool to build this would be a text classification algorithm, which will do all the analysis for you. You would train an algorithm on different languages, instead of having to articulate the actual rules yourself.
A much more sophisticated version of this is presumably what Google does.

Text comparison algorithm

We have a requirement in the project that we have to compare two texts (update1, update2) and come up with an algorithm to define how many words and how many sentences have changed.
Are there any algorithms that I can use?
I am not even looking for code. If I know the algorithm, I can code it in Java.
Typically this is accomplished by finding the Longest Common Subsequence (commonly called the LCS problem). This is how tools like diff work. Of course, diff is a line-oriented tool, and it sounds like your needs are somewhat different. However, I'm assuming that you've already constructed some way to compare words and sentences.
An O(NP) Sequence Comparison Algorithm is used by subversion's diff engine.
For your information, there are implementations with various programming languages by myself in following page of github.
https://github.com/cubicdaiya/onp
Some kind of diff variant might be helpful, eg wdiff
If you decide to devise your own algorithm, you're going to have to address the situation where a sentence has been inserted. For example for the following two documents:
The men are bad. I hate the men
and
The men are bad. John likes the men. I hate the men
Your tool should be able to look ahead to recognise that in the second, I hate the men has not been replaced by John likes the men but instead is untouched, and a new sentence inserted before it. i.e. it should report the insertion of a sentence, not the changing of four words followed by a new sentence.
The specific algorithm used by diff and most other comparison utilities is Eugene Myer's An O(ND) Difference Algorithm and Its Variations. There's a Java implementation of it available in the java-diff-utils package.
Here are two papers that describe other text comparison algorithms that should generally output 'better' (e.g. smaller, more meaningful) differences:
Tichy, Walter F., "The String-to-String Correction Problem with Block Moves" (1983). Computer Science Technical Reports. Paper 378.
Paul Heckel, "A Technique for Isolating Differences Between Files", Communications of the ACM, April 1978, Volume 21, Number 4
The first paper cites the second and mentions this about its algorithm:
Heckel[3] pointed out similar problems with LCS techniques and proposed a
linear-lime algorithm to detect block moves. The algorithm performs adequately
if there are few duplicate symbols in the strings. However, the algorithm gives
poor results otherwise. For example, given the two strings aabb and bbaa,
Heckel's algorithm fails to discover any common substring.
The first paper was mentioned in this answer and the second in this answer, both to the similar SO question:
Is there a diff-like algorithm that handles moving block of lines? - Stack Overflow
The difficulty comes when comparing large files efficiently and with good performance. I therefore implemented a variation of Myers O(ND) diff algorithm - which performs quite well and accurate (and supports filtering based on regular expression):
Algorithm can be tested out here: becke.ch compare tool web application
And a little bit more information on the home page: becke.ch compare tool
The most famous algorithm is O(ND) Difference Algorithm, also used in Notepad++ compare plugin (written in C++) and GNU diff(1). You can find a C# implementation here:
http://www.mathertel.de/Diff/default.aspx

Libraries or tools for generating random but realistic text

I'm looking for tools for generating random but realistic text. I've implemented a Markov Chain text generator myself and while the results were promising, my attempts at improving them haven't yielded any great successes.
I'd be happy with tools that consume a corpus or that operate based on a context-sensitive or context-free grammar. I'd like the tool to be suitable for inclusion into another project.
Most of my recent work has been in Java so a tool in that language is preferred, but I'd be OK with C#, C, C++, or even JavaScript.
This is similar to this question, but larger in scope.
Extending your own Markov chain generator is probably your best bet, if you want "random" text. Generating something that has context is an open research problem.
Try (if you haven't):
Tokenising punctuation separately, or include punctuation in your chain if you're not already. This includes paragraph marks.
If you're using a 2- or 3- history Markov chain, try resetting to using a 1-history one when you encounter full stops or newlines.
Alternatively, you could use WordNet in two passes with your corpus:
Analyse sentences to determine common sequences of word types, ie nouns, verbs, adjectives, and adverbs. WordNet includes these. Everything else (pronouns, conjunctions, whatever) is excluded, but you could essentially pass those straight through.
This would turn "The quick brown fox jumps over the lazy dog" into "The [adjective] [adjective] [noun] [verb(s)] over the [adjective] [noun]"
Reproduce sentences by randomly choosing a template sentence and replacing [adjective], [nouns] and [verbs] with actual adjectives nouns and verbs.
There are quite a few problems with this approach too: for example, you need context from the surrounding words to know which homonym to choose. Looking up "quick" in wordnet yields the stuff about being fast, but also the bit of your fingernail.
I know this doesn't solve your requirement for a library or a tool, but might give you some ideas.
I've used for this purpose many data sets, including wikinews articles.
I've extracted text from them using this tool:
http://alas.matf.bg.ac.rs/~mr04069/WikiExtractor.py

Resources