unicode characters - string

In my application I have unicode strings, I need to tell in which language the string is in,
I want to do it by narrowing list of possible languages by determining in which range the characters of string are.
Ranges I have from http://jrgraphix.net/research/unicode_blocks.php
And possible languages from http://unicode-table.com/en/
The problem is that algorithm has to detect all languages, does someone know more wide mapping of unicode ranges to languages ?
Thanks
Wojciech

This is not really possible, for a couple of reasons:
Many languages share the same writing system. Look at English and Dutch, for example. Both use the Basic Latin alphabet. By only looking at the range of code points, you simply cannot distinguish between them.
Some languages use more characters, but there is no guarantee that a
specific piece of text contains them. German, for example, uses the
Basic Latin alphabet plus "ä", "ö", "ü" and "ß". While these letters
are not particularly rare, you can easily create whole sentences
without them. So, a short text might not contain them. Thus, again,
looking at code points alone is not enough.
Text is not always "pure". An English text may contain French letters
because of a French loanword (e.g. "déjà vu"). Or it may contain
foreign words, because the text is talking about foreign things (e.g.
"Götterdämmerung is an opera by Richard Wagner...", or "The Great
Wall of China (万里长城) is..."). Looking at code points alone would be
misleading.
To sum up, no, you cannot reliably map code point ranges to languages.
What you could do: Count how often each character appears in the text and heuristically compare with statistics about known languages. Or analyse word structures, e.g. with Markov chains. Or search for the words in dictionaries (taking inflection, composition etc. into account). Or a combination of these.
But this is hard and a lot of work. You should rather use an existing solution, such as those recommended by deceze and Esailija.

I like the suggestion of using something like google translate -- as they will be doing all the work for you.
You might be able to build a rule-based system that gets you part of the way there. Build heuristic rules for languages and see if that is sufficient. Certain Tibetan characters do indicate Tibetan, and there are unique characters in many languages that will be a give away. But as the other answer pointed out, a limited sample of text may not be that accurate, as you may not have a clear indicator.
Languages will however differ in the frequencies that each character appears, so you could have a basic fingerprint of each language you need to classify and make guesses based on letter frequency. This probably goes a bit further than a rule-based system. Probably a good tool to build this would be a text classification algorithm, which will do all the analysis for you. You would train an algorithm on different languages, instead of having to articulate the actual rules yourself.
A much more sophisticated version of this is presumably what Google does.

Related

Fixing English text with no spaces

I have a lot of lines of English text with mostly no spaces between the words. The text is normal English from 19th century historical records. I can look at the text and add spaces, but it is very time consuming, not to mention boring. Is there a "simple" script or program that could work out where to put the spaces? For some definition of "simple"? Clearly it would need a dictionary. I would prefer a script language I could adjust a bit and hopefully it would run on linux/BSD/MacOS.
This is (close to) impossible. You need something that can understand the meaning of the text.
Remember the old joke about expertsexchange.com, the domain name of Stack Overflow's "competitor", Experts Exchange … or is it Expert Sex Change? You cannot correctly place the spaces without knowing and understanding the context. Note that both have the same grammatical function, so even grammatical analysis cannot tell you which is correct.
There is nothing "simple" about this, you are in territory that could earn you a degree in Artificial Intelligence and Natural Language Processing if you get it right. At least it will be publishable in a reputable top-tier scientific journal.

How is a hyphenation dictionary used for hyphenation?

I've read about hyphenation and I came to know that the hyphenation depends on the dictionary, we are using for the particular language. For some words Microsoft Office hyphenates differently than LibreOffice. I tried to open the dictionaryhyph_en_US.dic but couldn't understand the content.
What I didn't get is, how the dictionary is used.
Does it contain the list of words to hyphenate?
Does it contain the rules to decide as to how to hyphenate the word?
Note: I know they use algorithms as well to make the hyphenation better but to what extent does the dictionary play a role?
Any help will be much appreciated.
Regards,
Ankur Vashishtha
LibreOffice like TeX and a lot of other programs use the hyphenation algorithm created by Franklin M. Liang. This algorithm uses a pattern matching technique to find hyphenation points in words. A separate dictionary file containing the patterns is used for each language. According to Franklin M. Liang's thesis:
These patterns find 89% of the hyphens in a pocket dictionary word list, with essentially no error.
As to how Word does it, it is hard to tell, since it is proprietary software. My guess is that it does not use such an algorithm but a real dictionary with the 100% correct hyphenation points added in. This would explain why the hyphenation is different and more accurate in Word.

How to take the suffix in smoothing of Part of speech tagging

I am making a "Part of speech Tagger". I am handling the unknown word with the suffix.
But the main issue is that how would i decide the number of suffix... should it be pre-decided (like Weischedel approach) or I have to take the last few alphabets of the words(like Samuelsson approach).
Which approach would be better......
Quick googling suggests that the Weischedel approach is sufficient for English, which has only rudimentary morphological inflection. The Samuelsson approach seems to work better (which makes sense intuitively) when it comes to processing inflecting languages.
A Resource-light Approach to Morpho-syntactic Tagging - Google Books p 9 quote:
To handle unknown words Brants (2000) uses Samuelsson's (1993) suffix analysis, which seems to work best for inflected languages.
(This is not in a direct comparison to Weischedel's approach, though.)

How do I pad a Unicode string to a specific visible length?

I'd like to create a left pad function in a programming language. The function pads a string with leading characters to a specified total length. Strings are UTF-16 encoded in this language.
There are a few things in Unicode that make it complicated:
Surrogates: 2 surrogate characters = 1 unicode character
Combining characters: 1 non-combining character + any number of combining characters = 1 visible character
Invisible characters: 1 invisible character = 0 visible characters
What other factors have to be taken into consideration, and how would they be dealt with?
When you’re first starting out trying to understand something, it’s really frustrating. We’ve all been there. But while it’s very easy to call it stupid and everyone who made it stupid, you’re not going to get very far doing that. With an attitude like that, you’re implying that people who do understand it are also stupid for wasting their time on something so obviously stupid. After calling the people who do understand it stupid, it’s extremely unlikely that anyone who does understand it will take the time to explain it to you.
I understand the frustration. Unicode’s really complicated and it was a huge pain for me before I understood it and it’s still a pain for a lot of things I don’t have experience with. But the reason it’s so complicated isn’t because the people who made it were stupid and trying to ruin your life. It’s complicated because it attempts to provide a standard way of representing every human writing system ever used. Writing systems are insanely complicated, and throughout history developing a new and different writing system has been a fairly standard part of identifying yourself as a different culture from the people across the river or over the next mountain range. You yourself start off by identifying yourself as Hungarian based on the language you speak. Having once tried to pronounce a Hungarian professor’s name, I know that Hungarian is very complicated compared to English, just as English is very complicated compared to Hungarian. How would you feel if I was having trouble with Hungarian and asked you, “Boy, Hungarian sure is a stupid language! It must have been designed by idiots! By the way, how do I pronounce this word??”
There’s just no simple way to express something that’s inherently complicated in a very simple way. Human writing systems are inherently complicated and intentionally different from each other. As complicated as Unicode is, it’s better than what people had to do before, when instead of one single complicated standard there were multiple complicated standards in every country and you’d have to understand all of the different ‘standards.’
I’m not sure what your general life strategy is, but what I usually do when I don’t understand something is to pick up a few textbooks on the topic, read the textbooks through, and work out the examples. A good textbook will not only tell you how things are and what you need to do, but also how they go to be that way and why you need to do what you need to do.
I found Unicode Demysitifed to be an excellent book, and the newer book Unicode Explained has even higher ratings on amazon.

Text comparison algorithm

We have a requirement in the project that we have to compare two texts (update1, update2) and come up with an algorithm to define how many words and how many sentences have changed.
Are there any algorithms that I can use?
I am not even looking for code. If I know the algorithm, I can code it in Java.
Typically this is accomplished by finding the Longest Common Subsequence (commonly called the LCS problem). This is how tools like diff work. Of course, diff is a line-oriented tool, and it sounds like your needs are somewhat different. However, I'm assuming that you've already constructed some way to compare words and sentences.
An O(NP) Sequence Comparison Algorithm is used by subversion's diff engine.
For your information, there are implementations with various programming languages by myself in following page of github.
https://github.com/cubicdaiya/onp
Some kind of diff variant might be helpful, eg wdiff
If you decide to devise your own algorithm, you're going to have to address the situation where a sentence has been inserted. For example for the following two documents:
The men are bad. I hate the men
and
The men are bad. John likes the men. I hate the men
Your tool should be able to look ahead to recognise that in the second, I hate the men has not been replaced by John likes the men but instead is untouched, and a new sentence inserted before it. i.e. it should report the insertion of a sentence, not the changing of four words followed by a new sentence.
The specific algorithm used by diff and most other comparison utilities is Eugene Myer's An O(ND) Difference Algorithm and Its Variations. There's a Java implementation of it available in the java-diff-utils package.
Here are two papers that describe other text comparison algorithms that should generally output 'better' (e.g. smaller, more meaningful) differences:
Tichy, Walter F., "The String-to-String Correction Problem with Block Moves" (1983). Computer Science Technical Reports. Paper 378.
Paul Heckel, "A Technique for Isolating Differences Between Files", Communications of the ACM, April 1978, Volume 21, Number 4
The first paper cites the second and mentions this about its algorithm:
Heckel[3] pointed out similar problems with LCS techniques and proposed a
linear-lime algorithm to detect block moves. The algorithm performs adequately
if there are few duplicate symbols in the strings. However, the algorithm gives
poor results otherwise. For example, given the two strings aabb and bbaa,
Heckel's algorithm fails to discover any common substring.
The first paper was mentioned in this answer and the second in this answer, both to the similar SO question:
Is there a diff-like algorithm that handles moving block of lines? - Stack Overflow
The difficulty comes when comparing large files efficiently and with good performance. I therefore implemented a variation of Myers O(ND) diff algorithm - which performs quite well and accurate (and supports filtering based on regular expression):
Algorithm can be tested out here: becke.ch compare tool web application
And a little bit more information on the home page: becke.ch compare tool
The most famous algorithm is O(ND) Difference Algorithm, also used in Notepad++ compare plugin (written in C++) and GNU diff(1). You can find a C# implementation here:
http://www.mathertel.de/Diff/default.aspx

Resources