Is it possible to programmatically convert regular English into English haiku? Or is this something too complicated to contemplate? I have a feeling that this is a lot more involved than a Pig Latin formatter.
Must count syllables
Need nature references
Haiku's not easy
Pig Latin is text substitution. Haiku is poetry.
Find a regular expression to convert prose to poetry and you'll be rich.
Pig Latin is easy.
Haiku is much different.
Syllables, not words.
You'd first need to find a way to count the number of syllables in a given word, take a look at the answers in Detecting Syllables in a Word.
Keep in the mind the top voted answer references an entire thesis, so this is definitely more involved than a pig latin formatter.
Related
The code doesn't need to be exactly correct. A pronunciation in English can be used to represent a similar pronunciation in Chinese, e.g. /ʈ͡ʂ/ can use "CH" to represent.
How close do you want to get?
ARPABET was designed for General American English.
You can compare the charts for consonants in the above article to a similar chart for Mandarin. It looks to me like there are some significant differences.
But I suspect a bigger issue is going to be that tonality is fundamental to vowel production in Mandarin.
I have a text corpus which is already aligned at sentence level by construction - it is a list of pairs of English strings and their translation in another language. I have about 10 000 strings of 5 - 20 words each and their translations. My goal is to try to build a metric of the quality of the translation - automatically of course, because I'm dealing with languages I know nothing about :)
I'd like to build a dictionary from this list of translations that would give me the (most probable) translation of each word in the source English strings into the other language. I know the dictionary will be far from perfect but I'm hoping I can have something good enough to flag when a word is not consistently translated, for example, if my dictionary says "Store" is to be tranlated into French by "Magasin" then if I spot some place where "Store" is translated as "Boutique" I can suspect that something is wrong.
So I'd need to:
build a dictionary from my corpus
align the words inside the string/translation pairs
Do you have good references on how to do this? Known algorithms? I found many links about text alignment but they seem to be more at the sentence level than at the word level...
Any other suggestion on how to automatically check whether a translation is consistent would be greatly appreciated!
Thanks in advance.
A freely available (specifically, GPL-licensed) tool for word alignment is GIZA++. I trains the well-known IBM models mentioned in other answers, as well as other statistical models.
You can download it from the GIZA++ site at Google Code, and there is a brief introduction to its usage found at the GIZA++ Apertium. It boils down to this procedure:
Create your parallel corpus, sentence-aligned (you seem to have this already)
Apply the plain2snt tool included in GIZA++ to extract word lists and sentence lists in GIZA++ format
(Optional – only used for some models:) Generate word classes using the mkcls tool (also included)
Run the actual word alignment tool GIZA++. There are various optional configuration settings you can use to determine the type of model generated.
Before you can do this, you must build the tool from source code by running make. The code is written in C++ and compiles well with recent GCC versions.
A few final notes:
There are more than one possible translations for every word; you shouldn't rely on the assumption that a specific translation found in one text is necessarily wrong just because the same word is translated differently in another text;
One word may be translated into a (not necessarily contiguous) sequence of several words, and vice versa. Some words are not translated at all;
GIZA++ is a statistical tool that approximates the correct word alignment; many of the alignments it generates are questionable or incorrect.
This a pretty standard statistical machine translation problem called 'word alignment'.
There are bunch of EM clustering-based models developed by researchers at IBM which I think are the base for most other cooler models being developed today.
Google for 'ibm word alignment models' to find about IBM Models 1 to 5.
This presentation - http://www.stanford.edu/class/cs224n/handouts/cs224n-lecture-05-2011-MT.pdf seems like a good place to start.
Are you using spaces between words? Whatever character you are using, you might check out the slice command in Linux. It gives you the ability to filter words in-between spaces and other characters.
Which sorting algorithm does adobe reader use in its find function such that it is able to search any pattern out of a very large document within seconds?
I am not sure precisely what Adobe is using, but I would guess that it was some known fast string matching algorithm (perhaps Rabin-Karp, Boyer-Moore, or KMP) probably run in parallel across all the document pages at once. For short text strings, this should be very, very fast.
Hope this helps!
Are there any good NLP or statistical techniques for detecting garbled characters in OCR-ed text? Off the top of my head I was thinking that looking at the distribution of n-grams in text might be a good starting point but I'm pretty new to the whole NLP domain.
Here is what I've looked at so far:
N-gram Statistics in English and Chinese: Similarities and Differences
Statistical Distributions of English Text
The text will mostly be in english but a general solution would be nice. The text is currently indexed in Lucene so any ideas on a term based approach would be useful too.
Any suggestions would be great! Thanks!
Yes, most powerful thing in that case is Ngrams. You should collect them on related text corpora (with same topic to your OCR texts). This problem is very similar to spellchecking - if small character change lead to great probability increase it was a mistake. Check this tutorial how to use ngram for spellchecking.
I used n-grams for this some years ago, with pretty decent results. I used Apache Nutch's language detector, that uses word and intraword n-grams internally.Then the "ngram-profile" of your text is compared to n-gram profiles of the training material. Nutch gives a score/confidence value in addition to the language, and I used hard cutoffs based on the language (should be the one the docs are in) and scores. Kept most of the garbeled text out, but it's somewhat computationally costly.
I have a program that reads a bunch of text and analyzes it. The text may be in any language, but I need to test for japanese and chinese specifically to analyze them a different way.
I have read that I can test each character on it's unicode number to find out if it is in the range of CJK characters. This is helpful, however I would like to separate them if possible to process the text against different dictionaries. Is there a way to test if a character is Japanese OR Chinese?
You won't be able to test a single character to tell with certainty that it is Japanese or Chinese because of the way the unihan code points are implemented in the Unicode standard. Basically, every Chinese character is a potential Japanese character. However, the reverse is not true. Also, there are a number of conventions that could be used to test to see if a block of text is in one language or the other.
Simplifications - if the character you are testing is a PRC simplification such as 门 is only available in main land Chinese.
Kana - if the character is one of the many Japanese kana characters such as あいうえお then the text block you are working with is definitely Japanese.
The problem arises with the sheer number of characters and words that are in common. However, if I needed a quick and dirty solution to this problem, I would check my entire blocks of text for kana - if the text contains kana then I know it is Japanese. If you need to distinguish Korean as well, I would test for Hangul. Also, if you need to distinguish what type of Chinese, testing for types of simplifications would be the best approach.
The process of developing Unicode included the Han Unification. This is because a lot of the Japanese characters are derived from, or the same as, Chinese characters; similarly with Korean. There are some characters (katakana and hiragana - see chapter 12 of the Unicode standard v5.1.0) commonly used in Japanese that would indicate that the text was Japanese rather than Chinese, but I believe it would be a statistical test rather than definitive.
Check out the O'Reilly book on CJKV Information Processing (CJKV is short for Chinese, Japanese, Korean, Vietnamese; I have the CJK predecessor lurking somewhere). There's also the O'Reilly book on Unicode Explained which may be some help, though probably not for this question (I don't recall a discussion of how to identify Japanese and Chinese text).
You probably can't do that reliably. Japanese uses a lot of the same characters as Chinese. I think the best you could do is to look at a block of text. If you see any uniquely Japanese characters, then you can assume the whole block is Japanese. If not, then it's probably Chinese.
However, I'm just learning Chinese, so I'm not an expert.
testing for characters in the katakana or hiragana ranges should be a very reliable means of determining whether or not the text is Japanese, especially if you are dealing with 'regular' user-generated text. if you are looking at legal documents or other more official fare it might be slightly more difficult, as there will be a much greater preponderance of complex chinese characters - but it should still be pretty reliable.
A workaround is to check the encoding before it is converted to Unicode.
There are many characters which are only (commonly) used in Japanese or only used in Chinese.
Japan and China both simplified many characters but often in different ways. You can check for Japanese Shinjitai and Simplified Chinese characters. There are many more of the latter than the former. If there are none of either then you probably have Traditional Chinese.
Of course if you're dealing with Unicode text you may find occasional rare characters or mixed languages which could throw off a heuristic so you're better off going with counting the types of characters to make a judgement.
A good way to find out which characters are common in one language and not in the others is to compare the legacy encodings against each other. You can find mappings of each to Unicode easily on the internet.
I used to have some code I wrote which did a binary search by codepoint and it was extremely fast even in JavaScript - I may have lost it in my travels though (-: