The code doesn't need to be exactly correct. A pronunciation in English can be used to represent a similar pronunciation in Chinese, e.g. /ʈ͡ʂ/ can use "CH" to represent.
How close do you want to get?
ARPABET was designed for General American English.
You can compare the charts for consonants in the above article to a similar chart for Mandarin. It looks to me like there are some significant differences.
But I suspect a bigger issue is going to be that tonality is fundamental to vowel production in Mandarin.
Related
What I want to do is create an API that translates human speech into the IPA (International Phonetic Alphabet) format. My question is, where are the resources on how to decode speech at the level of the original audio waveform. I looked for an API, but most of what I found just translates straight to the roman alphabet. I'm looking to create something a little more accurate in its ability to distinguish vocal phonetics.
I would just like to start out by saying that this project is much more difficult and complicated than you think it is. Speech to text processing is a very large and complicated field with a huge amount of research that has been done into it. The reason most parsers send things straight to roman characters is because most of their processing is a probabilistic matching of vague sounds with their context of other vague sounds to guess which words make sense together. You are much more likely to find something that will give you Soundex rather than IPA. That said, this is a problem that has been approached on several fronts. Your best bet is probably the Sphinx project from CMU.
http://cmusphinx.sourceforge.net/wiki/start
That will give you a good start, but you make an assumption that speech to text processing is a lot more developed than it actually is, and there is no simple way of translating speech to IPA through the waveform with any kind of accuracy. Sphinx is very modular and completely open source and so it would give you a huge amount of power at your fingertips, and at that point whether or not you can figure out how to make this work is up to you, but again. This is not a solved problem in any way.
What if I have a captcha that displays a series of English characters. Will people who don't speak English have trouble interpreting and/or typing these characters? If this is the case then what is the best solution for an internationalized captcha?
Since 99% of the URLs are in regular ASCII, I don't think you will have a problem..after all how would they get to Google or Yahoo if they couldn't type the URL
That said I have on occasion run across Chinese characters used in captchas
Image-based CAPTCHA has two main advantages over text-based CAPTCHA:
International
Harder to solve algorithmically (see PWNtcha - captcha decoder)
There are several flavors, such as:
Classification: see Captcha The Dog, KittenAuth, Microsoft Asirra
3D projection: see 3D images: A human way to create Captchas and 3D-based Captchas become reality
Detection: see Image-Based CAPTCHA from Confident Technologies and Pic-Capture
Rotation: see A Dynamic, User-Friendly Captcha With Pictures
Puzzle: see Key Captcha
It would be a problem for users using their native, non-Latin keyboard layout, for example Russians and Greeks. They would be forced to switch keyboard layout just to fill security question.
Another thing is an ability to even recognize the words - somebody who doesn't speak English could have huge problems with getting word right. Even I sometimes do (for less popular words), although I am quite proficient...
In other words, don't do this mistake, your application should be easy to use for all users.
It's definitely a concern. Dictionary-based CAPTCHAs should ideally adapt to the user's language preferences and ask them to recognize words that match their language preferences and by extension the character set they are most familiar with.
But in the absence of such internationalization, I would say that numerals and mathematical expressions are the most universal solution, and for word-based CAPTCHAs a random series of ASCII characters (which being random would be culture-neutral) would be the most accessible as pretty much any user around the world has the ability to enter these characters even if some have to switch their input method.
Now where it really gets tricky is providing accessibility alternatives for visually impaired users. Making a univeral audio CAPTCHA seems pretty much impossible (you could consider a set of universally-recognized sounds instead of spoken words, but I doubt this would provide sufficient security). And internationalized (multilingual) spoken word generation is far from trivial.
No, because English captchas are ASCII -- ASCII is always available, even if people have a Japanese, Chinese, or Russian keyboard. So this should not be a problem! And image based captchas only require the person to read the letter - and that should be possible for anybody on the web who can see, as SQLMenace pointed out.
The other way around is a problem though.
Google's reCaptcha has a little icon where the user can get a different captcha if for some reason the captcha is not readable or contains foreign characters.
I would recommend that you use Google's reCaptcha, rather than implementing it yourself.
Added Benefit:
Google's reCaptcha is also available for other languages btw. http://www.google.com/recaptcha/faq
which makes it possible for you to internationalize the captcha for the user's default locale.
EDIT:
There is a work-around for Google's reCaptcha to work with flash!
Check here:
http://groups.google.com/group/recaptcha/browse_thread/thread/e22d7e3c91bcc9db
Sure they are a problem. Would a Russian captcha be a problem for you? What about a Chinese one?
The URLs are indeed ASCII, but that is only relevant for geeks.
Regular people go to Google, type some text in their own language, and then click on one of the answers. Then never get to type an URL.
Yes, this could represent a problem to a small percentage of users. Is it a large enough problem to take into consideration when building the UI for your site to better the UX? That's up to you. If it were up to me, probably not.
To help you in the right direction though, I would use Google' reCAPTCHA. It serves a great cause and works like a charm. There's also a great API where you can customize the language that it displays. You could use PHP to detect their country and write some code to change the settings to display in their native language.
Here's a sample of changing reCATCHA's language. "fr" is french!
<script type="text/javascript">
var RecaptchaOptions = {
lang : 'fr',
};
</script>
Google reCATPCHA's API:
http://code.google.com/apis/recaptcha/docs/customization.html#i18n
I believe that the 24 letters that constitute the English alphabet correspond in most 90% of the world. We have Chinese, Japanese, Cyrillic and Arabic users however all of them have the possibility of switching to an English keyboard within their operating systems.
We have no diacritics in English which makes everything a lot easier and our system more easily adaptable all over the world. Everyone types ASCII but they are able to switch to their own zone-specific/language-specific characters.
Are there any good NLP or statistical techniques for detecting garbled characters in OCR-ed text? Off the top of my head I was thinking that looking at the distribution of n-grams in text might be a good starting point but I'm pretty new to the whole NLP domain.
Here is what I've looked at so far:
N-gram Statistics in English and Chinese: Similarities and Differences
Statistical Distributions of English Text
The text will mostly be in english but a general solution would be nice. The text is currently indexed in Lucene so any ideas on a term based approach would be useful too.
Any suggestions would be great! Thanks!
Yes, most powerful thing in that case is Ngrams. You should collect them on related text corpora (with same topic to your OCR texts). This problem is very similar to spellchecking - if small character change lead to great probability increase it was a mistake. Check this tutorial how to use ngram for spellchecking.
I used n-grams for this some years ago, with pretty decent results. I used Apache Nutch's language detector, that uses word and intraword n-grams internally.Then the "ngram-profile" of your text is compared to n-gram profiles of the training material. Nutch gives a score/confidence value in addition to the language, and I used hard cutoffs based on the language (should be the one the docs are in) and scores. Kept most of the garbeled text out, but it's somewhat computationally costly.
Is it possible to programmatically convert regular English into English haiku? Or is this something too complicated to contemplate? I have a feeling that this is a lot more involved than a Pig Latin formatter.
Must count syllables
Need nature references
Haiku's not easy
Pig Latin is text substitution. Haiku is poetry.
Find a regular expression to convert prose to poetry and you'll be rich.
Pig Latin is easy.
Haiku is much different.
Syllables, not words.
You'd first need to find a way to count the number of syllables in a given word, take a look at the answers in Detecting Syllables in a Word.
Keep in the mind the top voted answer references an entire thesis, so this is definitely more involved than a pig latin formatter.
I have a program that reads a bunch of text and analyzes it. The text may be in any language, but I need to test for japanese and chinese specifically to analyze them a different way.
I have read that I can test each character on it's unicode number to find out if it is in the range of CJK characters. This is helpful, however I would like to separate them if possible to process the text against different dictionaries. Is there a way to test if a character is Japanese OR Chinese?
You won't be able to test a single character to tell with certainty that it is Japanese or Chinese because of the way the unihan code points are implemented in the Unicode standard. Basically, every Chinese character is a potential Japanese character. However, the reverse is not true. Also, there are a number of conventions that could be used to test to see if a block of text is in one language or the other.
Simplifications - if the character you are testing is a PRC simplification such as 门 is only available in main land Chinese.
Kana - if the character is one of the many Japanese kana characters such as あいうえお then the text block you are working with is definitely Japanese.
The problem arises with the sheer number of characters and words that are in common. However, if I needed a quick and dirty solution to this problem, I would check my entire blocks of text for kana - if the text contains kana then I know it is Japanese. If you need to distinguish Korean as well, I would test for Hangul. Also, if you need to distinguish what type of Chinese, testing for types of simplifications would be the best approach.
The process of developing Unicode included the Han Unification. This is because a lot of the Japanese characters are derived from, or the same as, Chinese characters; similarly with Korean. There are some characters (katakana and hiragana - see chapter 12 of the Unicode standard v5.1.0) commonly used in Japanese that would indicate that the text was Japanese rather than Chinese, but I believe it would be a statistical test rather than definitive.
Check out the O'Reilly book on CJKV Information Processing (CJKV is short for Chinese, Japanese, Korean, Vietnamese; I have the CJK predecessor lurking somewhere). There's also the O'Reilly book on Unicode Explained which may be some help, though probably not for this question (I don't recall a discussion of how to identify Japanese and Chinese text).
You probably can't do that reliably. Japanese uses a lot of the same characters as Chinese. I think the best you could do is to look at a block of text. If you see any uniquely Japanese characters, then you can assume the whole block is Japanese. If not, then it's probably Chinese.
However, I'm just learning Chinese, so I'm not an expert.
testing for characters in the katakana or hiragana ranges should be a very reliable means of determining whether or not the text is Japanese, especially if you are dealing with 'regular' user-generated text. if you are looking at legal documents or other more official fare it might be slightly more difficult, as there will be a much greater preponderance of complex chinese characters - but it should still be pretty reliable.
A workaround is to check the encoding before it is converted to Unicode.
There are many characters which are only (commonly) used in Japanese or only used in Chinese.
Japan and China both simplified many characters but often in different ways. You can check for Japanese Shinjitai and Simplified Chinese characters. There are many more of the latter than the former. If there are none of either then you probably have Traditional Chinese.
Of course if you're dealing with Unicode text you may find occasional rare characters or mixed languages which could throw off a heuristic so you're better off going with counting the types of characters to make a judgement.
A good way to find out which characters are common in one language and not in the others is to compare the legacy encodings against each other. You can find mappings of each to Unicode easily on the internet.
I used to have some code I wrote which did a binary search by codepoint and it was extremely fast even in JavaScript - I may have lost it in my travels though (-: