What is the best way to split a sentence for a keyword extraction task? - nlp

I'm doing a keyword extraction using TD-IDF on a large number of documents. Currenly I'm splitting each sentence based on n-gram. More particularly I'm using tri-gram. However, this is not the best way to split each sentence into ints constituting keywords. For example a noun phrase like 'triple heart bypass' may not always get detected as one term.
The other alternative to chunk each sentence into its constituting elements look to be part of speech tagging and chunking in Open NLP. In this approach phrase like 'triple heart bypass' always gets extracted as a whole but the downside is in TF-IDF the frequency of extracted terms (phrases) dramatically drops.
Does anyone have any suggestion on either or these two approaches or have any other ideas to improve the quality of the keywords?

What is :
the goal of your application ?
--impacts the tokenization rules and defines the quality of your keywords
type of documents?
--chunking is not the same if you have forum data or news article data.
You can implement some boundary recognizer by yourself, or using a statistical model as in openNLP.
The typical pipeline is that you should first tokenize as simple as possible, apply stop words removal (language-dependent), and then if needed POS tagging-based filtering (but this is a costly operation).
other options : java.text.BreakIterator, com.ibm.icu.text.BreakIterator, com.ibm.icu.text.RuleBasedBreakIterator...


Removing junk sentences

I have transcripts of phone calls with customers and agents. I'm trying to find promises which were made by an agent to a customer.
I already did punctuation restoration. But there are a lot of sentences that don't have any sense. I would like to remove them from the transcript. Most of them are just a set of not connected words.
I wonder what approach is the best for this task?
My ideas are:
• Use tf idf and word2vec to create vectors from all sentences. After that we can do some kind of anomaly detection e.g. look for and delete vectors that are highly deviated from most other vectors.
• Spam filters. Maybe is it possible to apply spam filters for this task?
• Crate some pattern of part of speech tags that proper sentence must include. For example, any good sentence must include noun + verb. Or we can use for example dependency tokens from spacy.
Example of a sentence that I want to keep:
There's no charge once sent that you'll get a ups tracking number.
Example of a junk sentence:
Kinder pr just have to type it in again, clock drives bethel.
Another junk sentence:
Just so you have it on and said this is regarding that.
One thing I would try is to treat this as a classification problem (junk vs non-junk). You can train a model based on a labelled set (i.e. you need to label some subset of your dataset) and then classify the rest of the corpus.
You could use a pre-trained language model like Bert and fine-tune it with you labeled set, as in here (https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb).
The advantage of using a language model like this is that you don't have to worry too much about linguistic (pre-)processing, meaning you don't have to get the part-of-speech or syntactic structure.
Comments regarding your ideas:
Anomaly detection with tf-idf and word2vec: It depends on the proportion of the junk sentences in your corpus. If they it's more than 15%, I would think that they might not be so anomal. Also, I am assuming your junk sentences come from noisy automatic speech-to-text transcription. I am not sure, to what extent parts of these junk sentences are correctly transcribed and what the effect of the correctly transcribed portion might have on the extent of the anomaly.
If you mean pre-existing spam filters that are trained on spam email, I would guess that the spammyness of emails is quite different from junkiness of your transcripts.
Use POS tags or syntactic structure to manually create rules for valid sentences:
This seems a bit tedious too me and also I am not sure if you will discover all junk with this. For instance, in your junk examples, the syntactic structure does not strike me as too unusal, e.g. "clock drives bethel" might be tagged as , which is quite a common tag sequence. The junkiness in this case comes from the meaning of the words.

NLP Structure Question (best way for doing feature extraction)

I am building an NLP pipeline and I am trying to get my head around in regards to the optimal structure. My understanding at the moment is the following:
Step1 - Text Pre-processing [a. Lowercasing, b. Stopwords removal, c. stemming, d. lemmatisation,]
Step 2 - Feature extraction
Step 3 - Classification - using the different types of classifier(linearSvC etc)
From what I read online there are several approaches in regard to feature extraction but there isn't a solid example/answer.
a. Is there a solid strategy for feature extraction ?
I read online that you can do [a. Vectorising usin ScikitLearn b. TF-IDF]
but also I read that you can use Part of Speech or word2Vec or other embedding and Name entity recognition.
b. What is the optimal process/structure of using these?
c. On the text pre-processing I am ding the processing on a text column on a df and the last modified version of it is what I use as an input in my classifier. If you do feature extraction do you do that in the same column or you create a new one and you only send to the classifier the features from that column?
Thanks so much in advance
The preprocessing pipeline depends mainly upon your problem which you are trying to solve. The use of TF-IDF, word embeddings etc. have their own restrictions and advantages.
You need to understand the problem and also the data associated with it. In order to make the best use of the data, we need to implement the proper pipeline.
Specifically for text related problems, you will find word embeddings to be very useful. TF-IDF is useful when the problem needs to be solved emphasising the words with lesser frequency. Word embeddings, on the other hand, convert the text to a N-dimensional vector which may show up similarity with some other vector. This could bring a sense of association in your data and the model can learn the best features possible.
In simple cases, we can use a bag of words representation to tokenize the texts.
So, you need to discover the best approach for your problem. If you are solving a problems which closely resembles the famous NLP problems like IMDB review classification, sentiment analysis on Twitter data, then you can find a number of approaches on the internet.

NLP: Pre-processing in doc2vec / word2vec

A few papers on the topics of word and document embeddings (word2vec, doc2vec) mention that they used the Stanford CoreNLP framework to tokenize/lemmatize/POS-tag the input words/sentences:
The corpora were lemmatized and POS-tagged with the Stanford CoreNLP (Manning et al., 2014) and each token was replaced with its lemma and POS tag
For pre-processing, we tokenise and lowercase the words using Stanford CoreNLP
So my questions are:
Why does the first paper apply POS-tagging? Would each token then be replaced with something like {lemma}_{POS} and the whole thing used to train the model? Or are the tags used to filter tokens?
For example, gensims WikiCorpus applies lemmatization per default and then only keeps a few types of part of speech (verbs, nouns, etc.) and gets rid of the rest. So what is the recommended way?
The quote from the second paper seems to me like they only split up words and then lowercase them. This is also what I first tried before I used WikiCorpus. In my opinion, this should give better results for document embeddings as most of POS types contribute to the meaning of a sentence. Am I right?
In the original doc2vec paper I did not find details about their pre-processing.
For your first question, the answer is "it depends on what you are trying to accomplish!"
There isn't a recommended way per say, to pre-process text. To clean a text corpus, usually the first steps are tokenization and lemmatization. Next, to remove not important terms/tokens, you can remove stop-words or even apply POS tags, to be able to remove tokens based on their grammatical category, based on the assumption that some grammatical categories (such as adjectives), do not contain valuable information for modelling a topic for example. But this purely depends on the type of analysis you are going to follow after the pre-processing step.
For you second part of the question, as explained above, tokenisation and lower case tokens, are standard parts of the pre-processing routine. So I also suspect, that regardless of the ML algorithm used later on, your results will be better if you carefully pre-process your data. I am not sure whether POS tags contribute to the meaning of a sentence though.
Hope I provided some valuable feedback to your research. If not you could provide a code sample to further discuss this issue.

How to find the most meaningful words in the text with using word2vec?

So, for instance, I'm typing, as an input, some sentence with some semantic meaning and, as an output, I get some list of closest (in cosine distance) words (mostly single words).
But I want to understand which cluster my sentence belongs to and compute how far is located each word from it. And eliminate non-meaningful words from sentence.
For example:
"I want to buy a pizza";
"pizza": 0.99123
"buy": 0.7834
"want": 0.1443
How such requirement can be achieved out of the box, without any C coding?
Maybe I need to compute cosine distance equation for this?
Thank you!
It seems like you need topic modeling instead of word2vec. Word2vec is used to capture local information, it is not a good idea to use it directly to classify or clustering words or sentences.
One other aspect can be stop word removal since you are mentioning about non-meaningful words. By the way, they are not non-meaningful, they are actually not aligned with any topic. So, you are thinking them as non-meaningful.
I believe you should use LDA topic modeling approach and you don't need to implement anything since there are many implementation out there for LDA.

Unguided speech to text conversion

I am trying to come up with a way to convert speech to text. I am trying to use Sphinx to attain this. What I mean by unguided speech to text is that, the speaker is not bound to speak from a definite set of sentences. Rather he might speak any sentence. So its not possible for me to have a grammar file, where each word is one of the alternative pre-written in the grammar file. I understand that I would have to train Sphinx somehow to do this.
But I am a beginner in sphinx. How to start training Sphinx to convert unguided speech? Is it possible to attain unguided conversion with Sphinx?
The task you are up to is, as of right now, is not yet possible to complete, at least not with satisfying accuracy.
As for the Sphinx-based solution: you will have to create dictionary with all the words to be recognized. There is no other way.
Once you have the dictionary, you can generate a simple n-gram model based on it, with ony unigrams - each unigram will be one word. The probability of each may be the same, or you may attempt to do some statistical analysis of the words that will be used.
