How to write an algorithm which generates furigana readings for japanese kanji based on the kana reading - string

I'm currently working on a multi-lang online dictionary for Japanese words and kanji. My current problem is to generate furigana for kanji-compounds within expression, sentences and words. I have the kana and kanji reading (separated) available in each case, but I don't get a reliable algorithm to work, which generates the readings for each kanji-compound based on the kana reading.
I don't need the exact reading for each kanji, which is clearly impossible based on the data I have, but it should be possible to determine the readings for all kanji-compounds since I have the full sentence/word/expression in kana.
I have:
kanji = 私は学生です
kana = わたしはがくせいです
I want to automatically assign
私 to わたし
and
学生 to がくせい.
I tried to iterate over the kanji string and check if the chars 'change' between kana and kanji and looked up until this position in the kana string. This approach works for all sentences where no kanji is followed by a hiragana syllable which is the same as the reading of the kanji ends with.
Another Idea of mine was to replace all hiragana-compounds from the kanji string in the kana, and take the left kana compounds as readings for the kanji. This clearly doesn't work in each case.
How can I write such an algorithm, which works in every case?

The standard way to do this is to use a Part-of-Speech and Morphological Analyzer like MeCab.
It will split up the sentence into a bunch of tokens, and use a dictionary to generate the reading.
If you feed the CLI with your example sentence, it will look like this:
私 名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
学生 名詞,一般,*,*,*,*,学生,ガクセイ,ガクセイ
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS
The next-to-last column is the reading (in katakana), and the last one is the pronunciation.
For choosing which dictionary to use, check out this article.
MeCab has Python bindings (and probably for many other programming languages).
IMPORTANT NOTE: It will NOT always produce the correct readings. There are two reasons for this:
The tokenization may be incorrect
A word can have different readings depending on the context, whereas MeCab always uses a single reading for each word

Related

Is there a module or regex in Python to convert all fonts to a uniform font? (Text is coming from Twitter)

I'm working with some text from twitter, using Tweepy. All that is fine, and at the moment I'm just looking to start with some basic frequency counts for words. However, I'm running into an issue where the ability of users to use different fonts for their tweets is making it look like some words are their own unique word, when in reality they're words that have already been encountered but in a different font/font size, like in the picture below (those words are words that were counted previously and appear in the spreadsheet earlier up).
This messes up the accuracy of the counts. I'm wondering if there's a package or general solution to make all the words a uniform font/size - either while I'm tokenizing it (just by hand, not using a module) or while writing it to the csv (using the csv module). Or any other solutions for this that I may not be considering. Thanks!
You can (mostly) solve your problem by normalising your input, using unicodedata.normalize('NFKC', str).
The KC normalization form (which is what NF stands for) first does a "compatibility decomposition" on the text, which replaces Unicode characters which represent style variants, and then does a canonical composition on the result, so that ñ, which is converted to an n and a separate ~ diacritic by the decomposition, is then turned back into an ñ, the canonical composite for that character. (If you don't want the recomposition step, use NFKD normalisation.) See Unicode Annex 15 for a more precise description, with examples.
Unicode contains a number of symbols, mostly used for mathematics, which are simply stylistic variations on some letter or digit. Or, in some cases, on several letters or digits, such as ¼ or ℆. In particular, this includes commonly-used symbols written with font variants which have particular mathematical or other meanings, such as ℒ (the Laplace transform) and ℚ (the set of rational numbers). Canonical decomposition will strip out the stylistic information, which reduces those four examples to '1/4', 'c/u', 'L' and 'Q', respectively.
The first published Unicode standard defined a block of Letter-like symbols block in the Basic Multilingula Plane (BMP). (All of the above examples are drawn from that block.) In Unicode 3.1, complete Latin and Greek alphabets and digits were added in the Mathematical Alphanumeric Symbols block, which includes 13 different font variants of the 52 upper- and lower-case letters of the roman alphabet (lower and upper case), 58 greek letters in five font variants (some of which could pass for roman letters, such as 𝝪 which is upsilon, not capital Y), and the 10 digits in five variants (𝟎 𝟙 𝟤 𝟯 𝟺). And a few loose characters which mathematicians apparently asked for.
None of these should be used outside of mathematical typography, but that's not a constraint which most users of social networks care about. So people compensate for the lack of styled text in Twitter (and elsewhere) by using these Unicode characters, despite the fact that they are not properly rendered on all devices, make life difficult for screen readers, cannot readily be searched, and all the other disadvantages of used hacked typography, such as the issue you are running into. (Some of the rendering problems are also visible in your screenshot.)
Compatibility decomposition can go a long way in resolving the problem, but it also tends to erase information which is really useful. For example, x² and H₂O become just x2 and H2O, which might or might not be what you wanted. But it's probably the best you can do.

How to use Unicode::Normalize to create most compatible windows-1252 encoded string?

I have a legacy app in Perl processing XML encoded in UTF-8 most likely and which needs to store some data of that XML in some database, which uses windows-1252 for historical reasons. Yes, this setup can't support all possible characters of the Unicode standard, but in practice I don't need to anyway and can try to be reasonable compatible.
The specific problem currently is a file containing LATIN SMALL LETTER U, COMBINING DIAERESIS (U+0075 U+0308), which makes Perl break the existing encoding of the Unicode string to windows-1252 with the following exception:
"\x{0308}" does not map to cp1252
I was able to work around that problem using Unicode::Normalize::NFKC, which creates the character U+00FC (ü), which perfectly fine maps to windows-1252. That lead to some other problem of course, e.g. in case of the character VULGAR FRACTION ONE HALF (½, U+00BD), because NFKC creates DIGIT ONE, FRACTION SLASH, DIGIT TWO (1/2, U+0031 U+2044 U+0032) for that and Perl dies again:
"\x{2044}" does not map to cp1252
According to normalization rules, this is perfectly fine for NFKC. I used that because I thought it would give me the most compatible result, but that was wrong. Using NFC instead fixed both problems, as both characters provide a normalization compatible with windows-1252 in that case.
This approach gets additionally problematic for characters for which a normalization compatible with windows-1252 is available in general, only different from NFC. One example is LATIN SMALL LIGATURE FI (fi, U+FB01). According to it's normalization rules, it's representation after NFC is incompatible with windows-1252, while using NFKC this time results in two characters compatible with windows-1252: fi (U+0066 U+0069).
My current approach is to simply try encoding as windows-1252 as is, if that fails I'm using NFC and try again, if that fails I'm using NFKC and try again and if that fails I'm giving up for now. This works in the cases I'm currently dealing with, but obviously fails if all three characters of my examples above are present in a string at the same time. There's always one character then which results in windows-1252-incompatible output, regardless the order of NFC and NFKC. The only question is which character breaks when.
BUT the important point is that each character by itself could be normalized to something being compatible with windows-1252. It only seems that there's no one-shot-solution.
So, is there some API I'm missing, which already converts in the most backwards compatible way?
If not, what's the approach I would need to implement myself to support all the above characters within one string?
Sounds like I would need to process each string Unicode-character by Unicode-character, normalize individually with what is most compatible with windows-1252 and than concatenate the results again. Is there some incremental Unicode-character parser available which deals with combining characters and stuff already? Does a simple Unicode-character based regular expression handles this already?
Unicode::Normalize provides additional functions to work on partial strings and such, but I must admit that I currently don't fully understand their purpose. The examples focus on concatenation as well, but from my understanding I first need some parsing to be able to normalize individual characters differently.
I don't think you're missing an API because a best-effort approach is rather involved. I'd try something like the following:
Normalize using NFC. This combines decomposed sequences like LATIN SMALL LETTER U, COMBINING DIAERESIS.
Extract all codepoints which aren't combining marks using the regex /\PM/g. This throws away all combining marks remaining after NFC conversion which can't be converted to Windows-1252 anyway. Then for each code point:
If the codepoint can be converted to Windows-1252, do so.
Otherwise try to normalize the codepoint with NFKC. If the NFKC mapping differs from the input, apply all steps recursively on the resulting string. This handles things like ligatures.
As a bonus: If the codepoint is invariant under NFKC, convert to NFD and try to convert the first codepoint of the result to Windows-1252. This converts characters like Ĝ to G.
Otherwise ignore the character.
There are of course other approaches that convert unsupported characters to ones that look similar but they require to create mappings manually.
Since it seems that you can convert individual characters as needed (to cp-1252 encoding), one way is to process character by character, as proposed, once a word fails the procedure.
The \X in Perl's regex matches a logical Unicode character, an extended grapheme cluster, either as a single codepoint or a sequence. So if you indeed can convert all individual (logical) characters into the desired encoding, then with
while ($word =~ /(\X)/g) { ... }
you can access the logical characters and apply your working procedure to each.
In case you can't handle all logical characters that may come up, piece together an equivalent of \X using specific character properties, for finer granularity with combining marks or such (like /((.)\p{Mn}?)/, or \p{Nonspacing_Mark}). The full, grand, list is in perluniprops.

Case-insensitive string comparison in Julia

I'm sure this has a simple answer, but how does one compare two string and ignore case in Julia? I've hacked together a rather inelegant solution:
function case_insensitive_match{S<:AbstractString}(a::S,b::S)
lowercase(a) == lowercase(b)
end
There must be a better way!
Efficiency Issues
The method that you have selected will indeed work well in most settings. If you are looking for something more efficient, you're not apt to find it. The reason is that capital vs. lowercase letters are stored with different bit encoding. Thus it isn't as if there is just some capitalization field of a character object that you can ignore when comparing characters in strings. Fortunately, the difference in bits between capital vs. lowercase is very small, and thus the conversions are simple and efficient. See this SO post for background on this:
How do uppercase and lowercase letters differ by only one bit?
Accuracy Issues
In most settings, the method that you have will work accurately. But, if you encounter characters such as capital vs. lowercase Greek letters, it could fail. For that, you would be better of with the normalize function (see docs for details) with the casefold option:
normalize("ad", casefold=true)
See this SO post in the context of Python which addresses the pertinent issues here and thus need not be repeated:
How do I do a case-insensitive string comparison?
Since it's talking about the underlying issues with utf encoding, it is applicable to Julia as well as Python.
See also this Julia Github discussion for additional background and specific examples of places where lowercase() can fail:
https://github.com/JuliaLang/julia/issues/7848

When to use Unicode Normalization Forms NFC and NFD?

The Unicode Normalization FAQ includes the following paragraph:
Programs should always compare canonical-equivalent Unicode strings as equal ... The Unicode Standard provides well-defined normalization forms that can be used for this: NFC and NFD.
and continues...
The choice of which to use depends on the particular program or system. NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings. ... NFD and NFKD are most useful for internal processing.
My questions are:
What makes NFC best for "general text." What defines "internal processing" and why is it best left to NFD? And finally, never minding what is "best," are the two forms interchangable as long as two strings are compared using the same normalization form?
The FAQ is somewhat misleading, starting from its use of “should” followed by the inconsistent use of “requirement” about the same thing. The Unicode Standard itself (cited in the FAQ) is more accurate. Basically, you should not expect programs to treat canonically equivalent strings as different, but neither should you expect all programs to treat them as identical.
In practice, it really depends on what your software needs to do. In most situations, you don’t need to normalize at all, and normalization may destroy essential information in the data.
For example, U+0387 GREEK ANO TELEIA (·) is defined as canonical equivalent to U+00B7 MIDDLE DOT (·). This was a mistake, as the characters are really distinct and should be rendered differently and treated differently in processing. But it’s too late to change that, since this part of Unicode has been carved into stone. Consequently, if you convert data to NFC or otherwise discard differences between canonically equivalent strings, you risk getting wrong characters.
There are risks that you take by not normalizing. For example, the letter “ä” can appear as a single Unicode character U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or as two Unicode characters U+0061 LATIN SMALL LETTER A U+0308 COMBINING DIAERESIS. It will mostly be the former, i.e. the precomposed form, but if it is the latter and your code tests for data containing “ä”, using the precomposed form only, then it will not detect the latter. But in many cases, you don’t do such things but simply store the data, concatenate strings, print them, etc. Then there is a risk that the two representations result in somewhat different renderings.
It also matters whether your software passes character data to other software somehow. The recipient might expect, due to naive implicit assumptions or consciously and in a documented manner, that its input is normalized.
NFC is the general common sense form that you should use, ä is 1 code point there and that makes sense.
NFD is good for certain internal processing - if you want to make accent-insensitive searches or sorting, having your string in NFD makes it much easier and faster. Another usage is making more robust slug titles. These are just the most obvious ones, I am sure there are plenty of more uses.
If two strings x and y are canonical equivalents, then
toNFC(x) = toNFC(y)
toNFD(x) = toNFD(y)
Is that what you meant?

How do I determine if a random string sounds like English?

I have an algorithm that generates strings based on a list of input words. How do I separate only the strings that sounds like English words? ie. discard RDLO while keeping LORD.
EDIT: To clarify, they do not need to be actual words in the dictionary. They just need to sound like English. For example KEAL would be accepted.
You can build a markov-chain of a huge english text.
Afterwards you can feed words into the markov chain and check how high the probability is that the word is english.
See here: http://en.wikipedia.org/wiki/Markov_chain
At the bottom of the page you can see the markov text generator. What you want is exactly the reverse of it.
In a nutshell: The markov-chain stores for each character the probabilities of which next character will follow. You can extend this idea to two or three characters if you have enough memory.
The easy way with Bayesian filters (Python example from http://sebsauvage.net/python/snyppets/#bayesian)
from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french','La souris est rentrée dans son trou.')
guesser.train('english','my tailor is rich.')
guesser.train('french','Je ne sais pas si je viendrai demain.')
guesser.train('english','I do not plan to update my website soon.')
>>> print guesser.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]
>>> print guesser.guess('Demain il fera très probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]
You could approach this by tokenizing a candidate string into bigrams—pairs of adjascent letters—and checking each bigram against a table of English bigram frequencies.
Simple: if any bigram is sufficiently low on the frequency table (or outright absent), reject the string as implausible. (String contains a "QZ" bigram? Reject!)
Less simple: calculate the overall plausibility of the whole string in terms of, say, a product of the frequencies of each bigram divided by the mean frequency of a valid English string of that length. This would allow you to both (a) accept a string with an odd low-frequency bigram among otherwise high-frequency bigrams, and (b) reject a string with several individual low-but-not-quite-below-the-threshold bigrams.
Either of those would require some tuning of the threshold(s), the second technique more so than the first.
Doing the same thing with trigrams would likely be more robust, though it'll also likely lead to a somewhat more strict set of "valid" strings. Whether that's a win or not depends on your application.
Bigram and trigram tables based on existing research corpora may be available for free or purchase (I didn't find any freely available but only did a cursory google so far), but you can calculate a bigram or trigram table from yourself from any good-sized corpus of English text. Just crank through each word as a token and tally up each bigram—you might handle this as a hash with a given bigram as the key and an incremented integer counter as the value.
English morphology and English phonetics are (famously!) less than isometric, so this technique might well generate strings that "look" English but present troublesome prounciations. This is another argument for trigrams rather than bigrams—the weirdness produced by analysis of sounds that use several letters in sequence to produce a given phoneme will be reduced if the n-gram spans the whole sound. (Think "plough" or "tsunami", for example.)
It's quite easy to generate English sounding words using a Markov chain. Going backwards is more of a challenge, however. What's the acceptable margin of error for the results? You could always have a list of common letter pairs, triples, etc, and grade them based on that.
You should research "pronounceable" password generators, since they're trying to accomplish the same task.
A Perl solution would be Crypt::PassGen, which you can train with a dictionary (so you could train it to various languages if you need to). It walks through the dictionary and collects statistics on 1, 2, and 3-letter sequences, then builds new "words" based on relative frequencies.
I'd be tempted to run the soundex algorithm over a dictionary of English words and cache the results, then soundex your candidate string and match against the cache.
Depending on performance requirements, you could work out a distance algorithm for soundex codes and accept strings within a certain tolerance.
Soundex is very easy to implement - see Wikipedia for a description of the algorithm.
An example implementation of what you want to do would be:
def soundex(name, len=4):
digits = '01230120022455012623010202'
sndx = ''
fc = ''
for c in name.upper():
if c.isalpha():
if not fc: fc = c
d = digits[ord(c)-ord('A')]
if not sndx or (d != sndx[-1]):
sndx += d
sndx = fc + sndx[1:]
sndx = sndx.replace('0','')
return (sndx + (len * '0'))[:len]
real_words = load_english_dictionary()
soundex_cache = [ soundex(word) for word in real_words ]
if soundex(candidate) in soundex_cache:
print "keep"
else:
print "discard"
Obviously you'll need to provide an implementation of read_english_dictionary.
EDIT: Your example of "KEAL" will be fine, since it has the same soundex code (K400) as "KEEL". You may need to log rejected words and manually verify them if you want to get an idea of failure rate.
Metaphone and Double Metaphone are similar to SOUNDEX, except they may be tuned more toward your goal than SOUNDEX. They're designed to "hash" words based on their phonetic "sound", and are good at doing this for the English language (but not so much other languages and proper names).
One thing to keep in mind with all three algorithms is that they're extremely sensitive to the first letter of your word. For example, if you're trying to figure out if KEAL is English-sounding, you won't find a match to REAL because the initial letters are different.
Do they have to be real English words, or just strings that look like they could be English words?
If they just need to look like possible English words you could do some statistical analysis on some real English texts and work out which combinations of letters occur frequently. Once you've done that you can throw out strings that are too improbable, although some of them may be real words.
Or you could just use a dictionary and reject words that aren't in it (with some allowances for plurals and other variations).
You could compare them to a dictionary (freely available on the internet), but that may be costly in terms of CPU usage. Other than that, I don't know of any other programmatic way to do it.
That sounds like quite an involved task! Off the top of my head, a consonant phoneme needs a vowel either before or after it. Determining what a phoneme is will be quite hard though! You'll probably need to manually write out a list of them. For example, "TR" is ok but not "TD", etc.
I would probably evaluate each word using a SOUNDEX algorithm against a database of english words. If you're doing this on a SQL-server it should be pretty easy to setup a database containing a list of most english words (using a freely available dictionary), and MSSQL server has SOUNDEX implemented as an available search-algorithm.
Obviously you can implement this yourself if you want, in any language - but it might be quite a task.
This way you'd get an evaluation of how much each word sounds like an existing english word, if any, and you could setup some limits for how low you'd want to accept results. You'd probably want to consider how to combine results for multiple words, and you would probably tweak the acceptance-limits based on testing.
I'd suggest looking at the phi test and index of coincidence. http://www.threaded.com/cryptography2.htm
I'd suggest a few simple rules and standard pairs and triplets would be good.
For example, english sounding words tend to follow the pattern of vowel-consonant-vowel, apart from some dipthongs and standard consonant pairs (e.g. th, ie and ei, oo, tr). With a system like that you should strip out almost all words that don't sound like they could be english. You'd find on closer inspection that you will probably strip out a lot of words that do sound like english as well, but you can then start adding rules that allow for a wider range of words and 'train' your algorithm manually.
You won't remove all false negatives (e.g. I don't think you could manage to come up with a rule to include 'rythm' without explicitly coding in that rythm is a word) but it will provide a method of filtering.
I'm also assuming that you want strings that could be english words (they sound reasonable when pronounced) rather than strings that are definitely words with an english meaning.

Resources