Which sorting algorithm does adobe reader use in its find function such that it is able to search any pattern out of a very large document within seconds?
I am not sure precisely what Adobe is using, but I would guess that it was some known fast string matching algorithm (perhaps Rabin-Karp, Boyer-Moore, or KMP) probably run in parallel across all the document pages at once. For short text strings, this should be very, very fast.
Hope this helps!
Related
I have a text corpus which is already aligned at sentence level by construction - it is a list of pairs of English strings and their translation in another language. I have about 10 000 strings of 5 - 20 words each and their translations. My goal is to try to build a metric of the quality of the translation - automatically of course, because I'm dealing with languages I know nothing about :)
I'd like to build a dictionary from this list of translations that would give me the (most probable) translation of each word in the source English strings into the other language. I know the dictionary will be far from perfect but I'm hoping I can have something good enough to flag when a word is not consistently translated, for example, if my dictionary says "Store" is to be tranlated into French by "Magasin" then if I spot some place where "Store" is translated as "Boutique" I can suspect that something is wrong.
So I'd need to:
build a dictionary from my corpus
align the words inside the string/translation pairs
Do you have good references on how to do this? Known algorithms? I found many links about text alignment but they seem to be more at the sentence level than at the word level...
Any other suggestion on how to automatically check whether a translation is consistent would be greatly appreciated!
Thanks in advance.
A freely available (specifically, GPL-licensed) tool for word alignment is GIZA++. I trains the well-known IBM models mentioned in other answers, as well as other statistical models.
You can download it from the GIZA++ site at Google Code, and there is a brief introduction to its usage found at the GIZA++ Apertium. It boils down to this procedure:
Create your parallel corpus, sentence-aligned (you seem to have this already)
Apply the plain2snt tool included in GIZA++ to extract word lists and sentence lists in GIZA++ format
(Optional – only used for some models:) Generate word classes using the mkcls tool (also included)
Run the actual word alignment tool GIZA++. There are various optional configuration settings you can use to determine the type of model generated.
Before you can do this, you must build the tool from source code by running make. The code is written in C++ and compiles well with recent GCC versions.
A few final notes:
There are more than one possible translations for every word; you shouldn't rely on the assumption that a specific translation found in one text is necessarily wrong just because the same word is translated differently in another text;
One word may be translated into a (not necessarily contiguous) sequence of several words, and vice versa. Some words are not translated at all;
GIZA++ is a statistical tool that approximates the correct word alignment; many of the alignments it generates are questionable or incorrect.
This a pretty standard statistical machine translation problem called 'word alignment'.
There are bunch of EM clustering-based models developed by researchers at IBM which I think are the base for most other cooler models being developed today.
Google for 'ibm word alignment models' to find about IBM Models 1 to 5.
This presentation - http://www.stanford.edu/class/cs224n/handouts/cs224n-lecture-05-2011-MT.pdf seems like a good place to start.
Are you using spaces between words? Whatever character you are using, you might check out the slice command in Linux. It gives you the ability to filter words in-between spaces and other characters.
What I want to do is create an API that translates human speech into the IPA (International Phonetic Alphabet) format. My question is, where are the resources on how to decode speech at the level of the original audio waveform. I looked for an API, but most of what I found just translates straight to the roman alphabet. I'm looking to create something a little more accurate in its ability to distinguish vocal phonetics.
I would just like to start out by saying that this project is much more difficult and complicated than you think it is. Speech to text processing is a very large and complicated field with a huge amount of research that has been done into it. The reason most parsers send things straight to roman characters is because most of their processing is a probabilistic matching of vague sounds with their context of other vague sounds to guess which words make sense together. You are much more likely to find something that will give you Soundex rather than IPA. That said, this is a problem that has been approached on several fronts. Your best bet is probably the Sphinx project from CMU.
http://cmusphinx.sourceforge.net/wiki/start
That will give you a good start, but you make an assumption that speech to text processing is a lot more developed than it actually is, and there is no simple way of translating speech to IPA through the waveform with any kind of accuracy. Sphinx is very modular and completely open source and so it would give you a huge amount of power at your fingertips, and at that point whether or not you can figure out how to make this work is up to you, but again. This is not a solved problem in any way.
Are there any good NLP or statistical techniques for detecting garbled characters in OCR-ed text? Off the top of my head I was thinking that looking at the distribution of n-grams in text might be a good starting point but I'm pretty new to the whole NLP domain.
Here is what I've looked at so far:
N-gram Statistics in English and Chinese: Similarities and Differences
Statistical Distributions of English Text
The text will mostly be in english but a general solution would be nice. The text is currently indexed in Lucene so any ideas on a term based approach would be useful too.
Any suggestions would be great! Thanks!
Yes, most powerful thing in that case is Ngrams. You should collect them on related text corpora (with same topic to your OCR texts). This problem is very similar to spellchecking - if small character change lead to great probability increase it was a mistake. Check this tutorial how to use ngram for spellchecking.
I used n-grams for this some years ago, with pretty decent results. I used Apache Nutch's language detector, that uses word and intraword n-grams internally.Then the "ngram-profile" of your text is compared to n-gram profiles of the training material. Nutch gives a score/confidence value in addition to the language, and I used hard cutoffs based on the language (should be the one the docs are in) and scores. Kept most of the garbeled text out, but it's somewhat computationally costly.
I find myself making repetitive mistakes typing keywords and sentences in my code comments. I notice its getting worse since my fingers just keep "practicing" incorrect words.
Is there any solution to this? Like a typing tutor designed to help correct repetitive mistakes?
The only way to correct this is to retrain your muscle memory. If it's important enough to take the time, the only way to retrain muscle memory is repetition.
For example, I tend to spell the word "the" as "teh" because of the same scenario you're asking about. To retrain the memory I would just spell the word over and over, starting slowly, striving for 100% accuracy, and increasing the speed. It's the same technique I use to get better at Guitar Hero.
Try a different keyboard layout. That way you start from scratch and completely retrain your fingers. Done properly you should be able to type just as fast as you could with qwerty in a few weeks. For example Dvorak.
</shameless promotion of dvorak>
If this were SMBC, the alt-text drawing thingy would be a giraffe hooker fluttering her eyelashes.
Try texter from one of LH's editors.
Maybe a book? Mastering Computer Typing: A Painless Course for Beginners and Professionals I hadn't read that, but in amazon has good reviews
One of the best websites to avoid repetitive mistakes is http://www.keybr.com/
It will actually keep track of the letters with which you are making more mistakes and generates typing lessons accordingly.
I would recommend practice on TouchTyping.guru - you can choose there the test with most popular words, so you'll quickly improve your general performance - making a mistake will make this app to generate next words with the letter you were wrong with.
And if you have problems with given letters you can try learning there putting restriction to the number of letters used to form the words. It puts focus on the last letter you choose, and they are also ordered by frequency of occurence in given language.
We have many WIDE html grids which scroll horizontally within a DIV in our web application.
I would like to find the best strategy for printing these grids on a portrait A4 page.
What I would like to know is what is the best way to present/display grids/data like this.
This question is not HTML specific, I am looking for design strategies and not CSS #page directives.
There's actually a whole book dedicated (amongst other things) to fast methods for the computation of \pi: 'Pi and the AGM', by Jonathan and Peter Borwein (available on Amazon).
I studied the AGM and related algorithms quite a bit: it's quite interesting (though sometimes non-trivial).
Note that to implement most modern algorithms to compute \pi, you will need a multiprecision arithmetic library (GMP is quite a good choice, though it's been a while since I last used it).
The time-complexity of the best algorithms is in O(M(n)log(n)), where M(n) is the time-complexity for the multiplication of two n-bit integers (M(n)=O(n log(n) log(log(n))) using FFT-based algorithms, which are usually needed when computing digits of \pi, and such an algorithm is implemented in GMP).
Note that even though the mathematics behind the algorithms might not be trivial, the algorithms themselves are usually a few lines of pseudo-code, and their implementation is usually very straightforward (if you chose not to write your own multiprecision arithmetic :-) ).
I guess it really depends on what your purpose is.
In a book format: I usually try span two facing pages.
For a conference or poster: Find an extra wide printer and print it out on a large sheet of paper.
Something more informal: Span regular pages and tape them together.
Powerpoint: Don't show the whole chart, they'll not be able to read the details anyways, just show the relevant information.