Is there any way we can get the text from a scanned document in jpg jpeg or any other format ? I am using ruby as my programming language . But I guess if I can get the texts with some help from other programming languages , it will not be much of a problem to integrate.

Yes, you can use an OCR library.
you may wish to consider using tessnet

This technology is called optical character recognition (OCR).
For programming, check out this question, which recommends tesseract-ocr.
OCR for ruby? check out this question.
If it's just a couple images, here's a site that supposedly does it for free.

OCR Terminal has been the best (most accurate) free tool out of at least a dozen that I have used. It works especially well with formatted (table) data.


Can tesseract work with languages such as bengali? If so, with how much accuracy and what steps should I follow to implement it for bengali language?

I want to implement an offline program which can detect bengali text from an image (white background black text). I need to know how to approach my work to begin with
Yes. Tesseract is trained for Bengali. List of languages supported. You have to use language code ben for that. Rest of the implementation details are given here. Simply follow it.

define pronunciation starting time for each word in script

I have a text script that is used to create podcasts. So the words in podcast audio are exactly the same as in my text. Now what I want to have is the following:
Word in text | Pronounciation started at
Hello 0:0:0.000
my 0:0:1.125
friends 0:0:2.750
Is that possible to do at all?
One of the key words you could start with to approach the complexity of the problem is "forced alignment". This site also covers questions regarding this topic e.g. here which leads you to questions and answers concerning HTK (the Hidden Markov Model Toolkit) via the releated threads.
You can find a more hands-on style description of how to use forced alignment in automated audio segmentation here.
So the answer is: yes, it is possible, but it is algorithmically very complex and even in its best implementations it is not error-free.
PS.: I found you a really simple tool

How to count the number of spoken syllables in an audio file?

I have many audio files with clean audio and only spoken voice in Mandarin Chinese. I need to estimate of how many syllables are spoken in each file. Is there a tool for OS X, Windows, or Linux that can estimate these?
sample01.wav 15
sample02.wav 8
sample03.wav 5
sample04.wav 1
sample05.wav 18
As there are many files, command-line or batch-capable software is preferred, e.g.:
$ application sample01.wav
A solution that uses speech-to-text, then counts the number of characters present would be suitable to.
The automatic segmentation of speech is an active scientific domain, meaning that there is no method that works perfectly.
In 2009, de Jong and Wempe proposed a method to automatically detect syllables in a human speech signal using Praat. This methods compares well with man-made segmentation, and has been employed in many third-party scientific studies. You can find a detailed description of the method in their scientific article (pdf), along with an historical perspective on previously proposed methods. The Praat script per se and a couple of tutorials can be found on a dedicated website (www - speechrate).
You may also be interested in another segmentation algorithm developed by Harma that has been implemented in Matlab (Harma Syllable Segmentation)
You can use formants to determine this. Each syllable should correspond to a formant. Here is more information on formants:
This might be of interest for you
Your question requires specific attention and solution for Speech to Text.
I really doubt any free open source library, easily available and serving to purpose will be served.
I have used one but for reverse purpose "text to speech".
Though this is not a free library, i would love to help just Google "annosoft lipsync"...
This library is available for SDK evaluation as well....

How does OCR work? and how to add OCR to an alphabet

I have an alphabet which has not been tackled before, so when scanned, there's no way to detect the letters for recognition with OCR. I'm trying to program OCR for it, but don't have much experience in this. I'd appreciate some hints as to where to get started, and how such a system is normally implemented.
Take a look at this page--it describes the training process for an open source OCR engine.
The free Stanford Online Machine Learning class has a great set of lessons on Photo OCR in Part XVIII.
This blog post has a brief description of the example taught in the class.
There are some excellent resources at google books. Likewise, if you search for Optical Character Recognition on Amazon, there are some pretty up-to-date books that look to be fairly thick and intellectually challenging :D heh
btw - I'm well aware this post has some age, but you never know when some other person might stumble across this and find just what they need. And if this even has the chance of helping out, then so be it. OCR is such a strange subject, that there's not too much out there that can really really answer the deep-machine ended questions. Especially if you're going to attempt to write your own library. :P

Libraries of audio samples (spoken text)

For a project we're currently working on, we need a library of spoken words in many different languages.
Two options seem possible: text-to-speech or "real" recordings by native speakers. As the quality is important to us, we're thinking about going the latter path.
In order to create a prototype for our application, we're looking for libraries that contain as many words in different languages as possible. To get a feeling for the quality of our approach, this library should not be made up of synthesized speech.
Do you know of any available/accessible libraries?
A co-worker just found this community based library, which is nice, but rather small in size:
I've just found this on the Audacity wiki: VoxForge. From their site:
VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac).
We will make available all submitted audio files under the GPL license, and then 'compile' them into acoustic models for use with Open Source speech recognition engines such as Sphinx, ISIP, Julius and HTK (note: HTK has distribution restrictions).
There is also Old time radio, not sure if this is the sort of spoken word you're after though.
My guess is that you won't find a library anywhere that consists of just individual words. Whatever you find, you're going to have to open the audio up in an editor (like Pro Tools or Cool Edit) and chop it up into individual words.
You would probably be better off creating a list of all the words you need for each language, and then finding native speakers to read them while you record. You can have them read slowly, so that you'll have an easy time chopping up each individual word.
One I use to use a lot:
Easy access to the recordings.
