Unguided speech to text conversion - speech-to-text

I am trying to come up with a way to convert speech to text. I am trying to use Sphinx to attain this. What I mean by unguided speech to text is that, the speaker is not bound to speak from a definite set of sentences. Rather he might speak any sentence. So its not possible for me to have a grammar file, where each word is one of the alternative pre-written in the grammar file. I understand that I would have to train Sphinx somehow to do this.
But I am a beginner in sphinx. How to start training Sphinx to convert unguided speech? Is it possible to attain unguided conversion with Sphinx?

The task you are up to is, as of right now, is not yet possible to complete, at least not with satisfying accuracy.
As for the Sphinx-based solution: you will have to create dictionary with all the words to be recognized. There is no other way.
Once you have the dictionary, you can generate a simple n-gram model based on it, with ony unigrams - each unigram will be one word. The probability of each may be the same, or you may attempt to do some statistical analysis of the words that will be used.

Related

NLP Structure Question (best way for doing feature extraction)

I am building an NLP pipeline and I am trying to get my head around in regards to the optimal structure. My understanding at the moment is the following:
Step1 - Text Pre-processing [a. Lowercasing, b. Stopwords removal, c. stemming, d. lemmatisation,]
Step 2 - Feature extraction
Step 3 - Classification - using the different types of classifier(linearSvC etc)
From what I read online there are several approaches in regard to feature extraction but there isn't a solid example/answer.
a. Is there a solid strategy for feature extraction ?
I read online that you can do [a. Vectorising usin ScikitLearn b. TF-IDF]
but also I read that you can use Part of Speech or word2Vec or other embedding and Name entity recognition.
b. What is the optimal process/structure of using these?
c. On the text pre-processing I am ding the processing on a text column on a df and the last modified version of it is what I use as an input in my classifier. If you do feature extraction do you do that in the same column or you create a new one and you only send to the classifier the features from that column?
Thanks so much in advance
The preprocessing pipeline depends mainly upon your problem which you are trying to solve. The use of TF-IDF, word embeddings etc. have their own restrictions and advantages.
You need to understand the problem and also the data associated with it. In order to make the best use of the data, we need to implement the proper pipeline.
Specifically for text related problems, you will find word embeddings to be very useful. TF-IDF is useful when the problem needs to be solved emphasising the words with lesser frequency. Word embeddings, on the other hand, convert the text to a N-dimensional vector which may show up similarity with some other vector. This could bring a sense of association in your data and the model can learn the best features possible.
In simple cases, we can use a bag of words representation to tokenize the texts.
So, you need to discover the best approach for your problem. If you are solving a problems which closely resembles the famous NLP problems like IMDB review classification, sentiment analysis on Twitter data, then you can find a number of approaches on the internet.

Calculating grammar similarity between two sentences

I'm making a program which provides some english sentences which user has to learn more.
For example:
First, I provide a sentence "I have to go school today" to user.
Then if the user wants to learn more sentences like that, I find some sentences which have high grammar similarity with that sentence.
I think the only way for providing sentences is to calculate similarity.
Is there a way to calculate grammar similarity between two sentences?
or is there a better way to make that algorithm?
Any advice or suggestions would be appreciated. Thank you.
My approach for solving this problem would be to do a Part Of Speech Tagging of using a tool like NLTK and compare the trees structure of your phrase with your database.
Other way, if you already have a training dataset, use the WEKA to use a machine learn approach to connect the phrases.
You can parse your sentence as either a constituent or dependency tree and use these representations to formulate some form of query that you can use to find candidate sentences with similar structures.
You can check this available tool from Stanford NLP:
Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for "tree regular expressions"). Tregex comes with Tsurgeon, a tree transformation language. Also included from version 2.0 on is a similar package which operates on dependency graphs (class SemanticGraph, called semgrex.

What is the best way to split a sentence for a keyword extraction task?

I'm doing a keyword extraction using TD-IDF on a large number of documents. Currenly I'm splitting each sentence based on n-gram. More particularly I'm using tri-gram. However, this is not the best way to split each sentence into ints constituting keywords. For example a noun phrase like 'triple heart bypass' may not always get detected as one term.
The other alternative to chunk each sentence into its constituting elements look to be part of speech tagging and chunking in Open NLP. In this approach phrase like 'triple heart bypass' always gets extracted as a whole but the downside is in TF-IDF the frequency of extracted terms (phrases) dramatically drops.
Does anyone have any suggestion on either or these two approaches or have any other ideas to improve the quality of the keywords?
What is :
the goal of your application ?
--impacts the tokenization rules and defines the quality of your keywords
type of documents?
--chunking is not the same if you have forum data or news article data.
You can implement some boundary recognizer by yourself, or using a statistical model as in openNLP.
The typical pipeline is that you should first tokenize as simple as possible, apply stop words removal (language-dependent), and then if needed POS tagging-based filtering (but this is a costly operation).
other options : java.text.BreakIterator, com.ibm.icu.text.BreakIterator, com.ibm.icu.text.RuleBasedBreakIterator...

How to create a simple feature to detect sentiment of a sentence using CRFs?

I want to use CRF for sentence level sentiment classiciation (positive or negative). But, I am lost on how to create a very simple feature to detect this using either CRFsuite or CRF++. Been trying for a few days, can anyone suggest how to design a simple feature which I can use as starting point to understand how to use the tools.
Thanks.
You could start providing gazetteers containing words separated by sentiment (e.g. positive adjectives, negative nouns, etc) and so using CRF to label relevant portions of the sentences. Using gazetteers you can also provide lists of other words which won't be labeled themselves, but could help identifying sentiment terms. You could also use WordNet instead of gazetteers. Your gazetteer features could be binary, i.e. gazetteer matched or not matched. Check out http://crfpp.googlecode.com for more examples and references.
I hope this helps!

Sentiment analysis with NLTK python for sentences using sample data or webservice?

I am embarking upon a NLP project for sentiment analysis.
I have successfully installed NLTK for python (seems like a great piece of software for this). However,I am having trouble understanding how it can be used to accomplish my task.
Here is my task:
I start with one long piece of data (lets say several hundred tweets on the subject of the UK election from their webservice)
I would like to break this up into sentences (or info no longer than 100 or so chars) (I guess i can just do this in python??)
Then to search through all the sentences for specific instances within that sentence e.g. "David Cameron"
Then I would like to check for positive/negative sentiment in each sentence and count them accordingly
NB: I am not really worried too much about accuracy because my data sets are large and also not worried too much about sarcasm.
Here are the troubles I am having:
All the data sets I can find e.g. the corpus movie review data that comes with NLTK arent in webservice format. It looks like this has had some processing done already. As far as I can see the processing (by stanford) was done with WEKA. Is it not possible for NLTK to do all this on its own? Here all the data sets have already been organised into positive/negative already e.g. polarity dataset http://www.cs.cornell.edu/People/pabo/movie-review-data/ How is this done? (to organise the sentences by sentiment, is it definitely WEKA? or something else?)
I am not sure I understand why WEKA and NLTK would be used together. Seems like they do much the same thing. If im processing the data with WEKA first to find sentiment why would I need NLTK? Is it possible to explain why this might be necessary?
I have found a few scripts that get somewhat near this task, but all are using the same pre-processed data. Is it not possible to process this data myself to find sentiment in sentences rather than using the data samples given in the link?
Any help is much appreciated and will save me much hair!
Cheers Ke
The movie review data has already been marked by humans as being positive or negative (the person who made the review gave the movie a rating which is used to determine polarity). These gold standard labels allow you to train a classifier, which you could then use for other movie reviews. You could train a classifier in NLTK with that data, but applying the results to election tweets might be less accurate than randomly guessing positive or negative. Alternatively, you can go through and label a few thousand tweets yourself as positive or negative and use this as your training set.
For a description of using Naive Bayes for sentiment analysis with NLTK: http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/
Then in that code, instead of using the movie corpus, use your own data to calculate word counts (in the word_feats method).
Why dont you use WSD. Use Disambiguation tool to find senses. and use map polarity to the senses instead of word. In this case you will get a bit more accurate results as compared to word index polarity.

Resources