I have a text corpus which is already aligned at sentence level by construction - it is a list of pairs of English strings and their translation in another language. I have about 10 000 strings of 5 - 20 words each and their translations. My goal is to try to build a metric of the quality of the translation - automatically of course, because I'm dealing with languages I know nothing about :)
I'd like to build a dictionary from this list of translations that would give me the (most probable) translation of each word in the source English strings into the other language. I know the dictionary will be far from perfect but I'm hoping I can have something good enough to flag when a word is not consistently translated, for example, if my dictionary says "Store" is to be tranlated into French by "Magasin" then if I spot some place where "Store" is translated as "Boutique" I can suspect that something is wrong.
So I'd need to:
build a dictionary from my corpus
align the words inside the string/translation pairs
Do you have good references on how to do this? Known algorithms? I found many links about text alignment but they seem to be more at the sentence level than at the word level...
Any other suggestion on how to automatically check whether a translation is consistent would be greatly appreciated!
Thanks in advance.
A freely available (specifically, GPL-licensed) tool for word alignment is GIZA++. I trains the well-known IBM models mentioned in other answers, as well as other statistical models.
You can download it from the GIZA++ site at Google Code, and there is a brief introduction to its usage found at the GIZA++ Apertium. It boils down to this procedure:
Create your parallel corpus, sentence-aligned (you seem to have this already)
Apply the plain2snt tool included in GIZA++ to extract word lists and sentence lists in GIZA++ format
(Optional – only used for some models:) Generate word classes using the mkcls tool (also included)
Run the actual word alignment tool GIZA++. There are various optional configuration settings you can use to determine the type of model generated.
Before you can do this, you must build the tool from source code by running make. The code is written in C++ and compiles well with recent GCC versions.
A few final notes:
There are more than one possible translations for every word; you shouldn't rely on the assumption that a specific translation found in one text is necessarily wrong just because the same word is translated differently in another text;
One word may be translated into a (not necessarily contiguous) sequence of several words, and vice versa. Some words are not translated at all;
GIZA++ is a statistical tool that approximates the correct word alignment; many of the alignments it generates are questionable or incorrect.
This a pretty standard statistical machine translation problem called 'word alignment'.
There are bunch of EM clustering-based models developed by researchers at IBM which I think are the base for most other cooler models being developed today.
Google for 'ibm word alignment models' to find about IBM Models 1 to 5.
This presentation - http://www.stanford.edu/class/cs224n/handouts/cs224n-lecture-05-2011-MT.pdf seems like a good place to start.
Are you using spaces between words? Whatever character you are using, you might check out the slice command in Linux. It gives you the ability to filter words in-between spaces and other characters.
Related
I have text from huge Text/PDF file. I am working on the text to do sentence tokenization using the Period (punctuation). But, I am facing issues with cases like ['Dr.', 'Mrs', 'D.C.', 'Inc.','.com']. To deal with this, I am looking for complete list of such words. Where can I find corpus of all these prefixes/abbreviations/suffixes?
Thanks.
It would probably be best to use a segmentation library instead of trying to write something yourself. Segmentation involves more than just splitting at a period.
To answer your question though, here is a list of English abbreviations.
This README has some additional info about segmentation and links to various research papers as well as various segmentation libraries.
I have many audio files with clean audio and only spoken voice in Mandarin Chinese. I need to estimate of how many syllables are spoken in each file. Is there a tool for OS X, Windows, or Linux that can estimate these?
sample01.wav 15
sample02.wav 8
sample03.wav 5
sample04.wav 1
sample05.wav 18
As there are many files, command-line or batch-capable software is preferred, e.g.:
$ application sample01.wav
15
A solution that uses speech-to-text, then counts the number of characters present would be suitable to.
The automatic segmentation of speech is an active scientific domain, meaning that there is no method that works perfectly.
In 2009, de Jong and Wempe proposed a method to automatically detect syllables in a human speech signal using Praat. This methods compares well with man-made segmentation, and has been employed in many third-party scientific studies. You can find a detailed description of the method in their scientific article (pdf), along with an historical perspective on previously proposed methods. The Praat script per se and a couple of tutorials can be found on a dedicated website (www - speechrate).
You may also be interested in another segmentation algorithm developed by Harma that has been implemented in Matlab (Harma Syllable Segmentation)
You can use formants to determine this. Each syllable should correspond to a formant. Here is more information on formants:
https://en.wikipedia.org/wiki/Formants
This might be of interest for you
http://sites.google.com/site/speechrate/
Your question requires specific attention and solution for Speech to Text.
I really doubt any free open source library, easily available and serving to purpose will be served.
I have used one but for reverse purpose "text to speech".
Though this is not a free library, i would love to help just Google "annosoft lipsync"...
http://www.annosoft.com/lipsync-sdks
This library is available for SDK evaluation as well....
I am trying to build a program that will find which page/sentence in a book is read to microphone. I have the book's text and its audio content. The user will start reading from a random page and program is supposed to synch to the user and show the section of the book which is being read. It might seem useless program but please bear with me..
Would an approach similar to shazam-like programs work? I am not sure how effective those algorithms for speech. Also, the speaker will be different and might have accent and different speeds to read.
Another approach would be converting the speech to text and searching the text in the book. The problem is that the language of the book is a rare one for which there is no language model available. In addition, the script does not use latin characters which makes programming difficult (for me at least).
Is there any solutions that anyone can recommend? Would extracting features from the audio file and comparing with the "real-time" extracted features (from microphone) would work? Which features?
Any implementation/code that I can start with? Any language is ok but prefer C.
You need to use speech recognizer.
Create a language model directly from the book text. That will make the recognition of the book reading very accurate, both original reading and the reading by the user.
Use this language model to recognize the book and assign timestamps for the words or use more advanced algorithm to perform text to audio alignment.
Recognize user's speech with the book-specific language model and use the recognized text to display a position in a book.
You can use CMUSphinx for the mentioned tasks.
I am having some trouble getting pointers to how to perform what appears to be a deceptively easy task:
Given an audio stream, how do you count the number of words that have been spoken, in real-time?
I don't need to recognize what the words are, but rather just have an accurate counter on words that have been uttered. The counter doesn't have to be too accurate and could even consider utterances and other "grunts" like coughs.
It appears that all Speech Recognition systems depend on a pre-defined grammar to be provided before they can analyze the phonemes that are spoken to convert to known words with some degree of accuracy. But I don't care about the accuracy at all, but rather the rate of words being spoken.
What is important is that this runs in real time, and allow the system to provide alerts after a certain number of words have been spoken. The system will encourage a visual cue to pause, and then the speaker can continue.
I've looked at CMU Sphinx FAQ and found that the idea of "word spotting" is not yet supported. I don't really need a real time search of particular words, but it approximates more closely to what I am looking for. Looking for very small silences in the waveform seems to be a very crude way of doing this and probably not very accurate at all, but that's all I have right now.
Any pointers on algorithms, research papers or any other insights would be appreciated!
Are there any good NLP or statistical techniques for detecting garbled characters in OCR-ed text? Off the top of my head I was thinking that looking at the distribution of n-grams in text might be a good starting point but I'm pretty new to the whole NLP domain.
Here is what I've looked at so far:
N-gram Statistics in English and Chinese: Similarities and Differences
Statistical Distributions of English Text
The text will mostly be in english but a general solution would be nice. The text is currently indexed in Lucene so any ideas on a term based approach would be useful too.
Any suggestions would be great! Thanks!
Yes, most powerful thing in that case is Ngrams. You should collect them on related text corpora (with same topic to your OCR texts). This problem is very similar to spellchecking - if small character change lead to great probability increase it was a mistake. Check this tutorial how to use ngram for spellchecking.
I used n-grams for this some years ago, with pretty decent results. I used Apache Nutch's language detector, that uses word and intraword n-grams internally.Then the "ngram-profile" of your text is compared to n-gram profiles of the training material. Nutch gives a score/confidence value in addition to the language, and I used hard cutoffs based on the language (should be the one the docs are in) and scores. Kept most of the garbeled text out, but it's somewhat computationally costly.