I have a text file from which I have to eliminate all the statements which do not make any meaning or in other words, I have to check for a statement that if it is a sentence or not.
For example:
1. John is a heart patient.
2. Dr. Green, Rob is the referring doctor for the patient.
3. Jacob Thomas, M.D. is the ordering provider
4. Xray Shoulder PA, Oblique, TRUE Lateral, 18° FOSSA LAT LT; Status: Complete;
The sentence 1,2, ad 3 makes some meaning
but sentence 4 does not make any meaning, so I want to eliminate it.
May I know how it could be done?
This task seems very difficult; however, assuming you have the training data, you could likely use XGBoost, which uses boosted decision trees (and random forests). You would train it to answer positive or negative (yes is makes sense, or no).
You would then need to come up with features. You could use the features from the NLTK part of speech (POS) tags. The number of occurrences of each of the types of tags in the sentence would be a good first model. That can set your benchmark for how good an "easy" solution is.
You also may be able to look into the utility of a (word/sentence)-to-vector model such as gensim for creating features for your model.
First I would see what happens with just the number of occurrences of each POS tag and XGBOOST. Train and test a model and see how well it does. Then look to adding other features such as position or using a doc-2-vec as your input to XGBoost.
Last resort would be a neural network (which would only be recommended if the prior ideas fail, and you have lots and lots of data). If you did use a neural net I would think an LSTM would likely be useful.
You would have to experiment and the amount of data matters, but you can start simple and then test and add to your model iteratively.
It's very hard to be 100% confident but let's try.
I can use Amazon Comprehend - Natural Language Processing and Text Analytics and create your own metrics over the sentences. ex:
John is a heart patient.
Amazon will give you: "." Punctuation, "a" Determiner, "heart" Noun, "is" verb, "John" Proper Noun, "patient" Noun.
1 Punctuation, 1 Determiner, 2 Noun, 1 Verb, 1 Proper Noun. Probably you will have Noun and verd to have a valid sentence.
In Your last sentence we have:
3 Punctuation, 1 Numeral, 11 Proper noun. You dont have a action (verb) probably these sentense isn't valid.
Related
Such as the following sentence,
"Don't pay attention to people if they say it's no good."
As humans, we understand the overall sentiment from the sentence is positive.
Technique of "Bag of Words" or BOW
Then, we have the two categories of "positive" words as Polarity of 1, "negative" words of Polarity of 0.
In this case, the word of "good" fits into category, but here it is accidentally correct.
Thus, this technique is eliminated.
Still use BOW technique (sort of "Word Embedding")
But take into consideration of its surrounding words, in this case, the "no" word preceding it, thus, it's "no good", not the adj alone "good". However, "no good" is not what the author intended from the context of the entire sentence.
Thus, this question. Thanks in advance.
Word embeddings are one possible way to try to take into account the complexity coming from the sequence of terms in your example. Using pre-trained models on general English such as BERT should give you interesting results for your sentiment analysis problem. You can leverage on several implementation provided by Hugging face library.
Another approach, that doesn't rely on compute intensive techniques (such as word embeddings), would be to use n-gram which will capture the sequence aspect and should provide good features for sentiment estimation. You can try different depth (unigram, bigrams, trigrams...) and combine with different types of preprocesing and/or tokenizers. Scikit-learn provides a good reference implementation for n-gramss in its CountVectorizer class.
I have transcripts of phone calls with customers and agents. I'm trying to find promises which were made by an agent to a customer.
I already did punctuation restoration. But there are a lot of sentences that don't have any sense. I would like to remove them from the transcript. Most of them are just a set of not connected words.
I wonder what approach is the best for this task?
My ideas are:
• Use tf idf and word2vec to create vectors from all sentences. After that we can do some kind of anomaly detection e.g. look for and delete vectors that are highly deviated from most other vectors.
• Spam filters. Maybe is it possible to apply spam filters for this task?
• Crate some pattern of part of speech tags that proper sentence must include. For example, any good sentence must include noun + verb. Or we can use for example dependency tokens from spacy.
Examples
Example of a sentence that I want to keep:
There's no charge once sent that you'll get a ups tracking number.
Example of a junk sentence:
Kinder pr just have to type it in again, clock drives bethel.
Another junk sentence:
Just so you have it on and said this is regarding that.
One thing I would try is to treat this as a classification problem (junk vs non-junk). You can train a model based on a labelled set (i.e. you need to label some subset of your dataset) and then classify the rest of the corpus.
You could use a pre-trained language model like Bert and fine-tune it with you labeled set, as in here (https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb).
The advantage of using a language model like this is that you don't have to worry too much about linguistic (pre-)processing, meaning you don't have to get the part-of-speech or syntactic structure.
Comments regarding your ideas:
Anomaly detection with tf-idf and word2vec: It depends on the proportion of the junk sentences in your corpus. If they it's more than 15%, I would think that they might not be so anomal. Also, I am assuming your junk sentences come from noisy automatic speech-to-text transcription. I am not sure, to what extent parts of these junk sentences are correctly transcribed and what the effect of the correctly transcribed portion might have on the extent of the anomaly.
If you mean pre-existing spam filters that are trained on spam email, I would guess that the spammyness of emails is quite different from junkiness of your transcripts.
Use POS tags or syntactic structure to manually create rules for valid sentences:
This seems a bit tedious too me and also I am not sure if you will discover all junk with this. For instance, in your junk examples, the syntactic structure does not strike me as too unusal, e.g. "clock drives bethel" might be tagged as , which is quite a common tag sequence. The junkiness in this case comes from the meaning of the words.
I have various restaurant labels with me and i have some words that are unrelated to restaurants as well. like below:
vegan
vegetarian
pizza
burger
transportation
coffee
Bookstores
Oil and Lube
I have such mix of around 500 labels. I want to know is there a way pick the similar labels that are related to food choices and leave out words like oil and lube, transportation.
I tried using word2vec but, some of them have more than one word and could not figure out a right way.
Brute-force approach is to tag them manually. But, i want to know is there a way using NLP or Word2Vec to cluster all related labels together.
Word2Vec could help with this, but key factors to consider are:
How are your word-vectors trained? Using off-the-shelf vectors (like say the popular GoogleNews vectors trained on a large corpus of news stories) are unlikely to closely match the senses of these words in your domain, or include multi-word tokens like 'oil_and_lube'. But, if you have a good training corpus from your own domain, with multi-word tokens from a controlled vocabulary (like oil_and_lube) that are used in context, you might get quite good vectors for exactly the tokens you need.
The similarity of word-vectors isn't strictly 'synonymity' but often other forms of close-relation including oppositeness and other ways words can be interchangeable or be used in similar contexts. So whether or not the word-vector similarity-values provide a good threshold cutoff for your particular desired "related to food" test is something you'd have to try out & tinker around. (For example: whether words that are drop-in replacements for each other are closest to each other, or words that are common-in-the-same-topics are closest to each other, can be influenced by whether the window parameter is smaller or larger. So you could find tuning Word2Vec training parameters improve the resulting vectors for your specific needs.)
Making more recommendations for how to proceed would require more details on the training data you have available – where do these labels come from? what's the format they're in? how much do you have? – and your ultimate goals – why is it important to distinguish between restaurant- and non-restaurant- labels?
OK, thank you for the details.
In order to train on word2vec you should take into account the following facts :
You need a huge and variate text dataset. Review your training set and make sure it contains the useful data you need in order to obtain what you want.
Set one sentence/phrase per line.
For preprocessing, you need to delete punctuation and set all strings to lower case.
Do NOT lemmatize or stemmatize, because the text will be less complex!
Try different settings:
5.1 Algorithm: I used word2vec and I can say BagOfWords (BOW) provided better results, on different training sets, than SkipGram.
5.2 Number of layers: 200 layers provide good result
5.3 Vector size: Vector length = 300 is OK.
Now run the training algorithm. The, use the obtained model in order to perform different tasks. For example, in your case, for synonymy, you can compare two words (i.e. vectors) with cosine (or similarity). From my experience, cosine provides a satisfactory result: the distance between two words is given by a double between 0 and 1. Synonyms have high cosine values, you must find the limit between words which are synonyms and others that are not.
Folks,
I have searched Google for different type of papers/blogs/tutorials etc but haven't found anything helpful. I would appreciate if anyone can help me. Please note that I am not asking for code step-by-step but rather an idea/blog/paper or some tutorial.
Here's my problem statement:
Just like sentiment analysis is used for identifying positive and
negative tone of a sentence, I want to find whether a sentence is
forward-looking (future outlook) statement or not.
I do not want to use bag of words approach to sum up the number of forward-looking words/phrases such as "going forward", "in near future" or "In 5 years from now" etc. I am not sure if word2vec or doc2vec can be used. Please enlighten me.
Thanks.
It seems what you are interested in doing is finding temporal statements in texts.
Not sure of your final output, but let's assume you want to find temporal phrases or sentences which contain them.
One methodology could be the following:
Create list of temporal terms [days, years, months, now, later]
Pick only sentences with key terms
Use sentences in doc2vec model
Infer vector and use distance metric for new sentence
GMM Cluster + Limit
Distance from average
Another methodology could be:
Create list of temporal terms [days, years, months, now, later]
Do Bigram and Trigram collocation extraction
Keep relevant collocations with temporal terms
Use relevant collocations in a kind of bag-of-collocations approach
Matched binary feature vectors for relevant collocations
Train classifier to recognise higher level text
This sounds like a good case for a Bootstrapping approach if you have large amounts of texts.
Both are semi-supervised really, since there is some need for finding initial temporal terms, but even that could be automated using a word2vec scheme and bootstrapping
I've successfully used OpenNLP for document categorization and also was able to extract names from trained samples and using regular expressions.
I was wondering if it is also possible to extract names (or more generally speaking, subjects) based on their position in a sentence?
E.g. instead of training with concrete names that are know a priori, like Travel to <START:location> New York </START>, I would prefer not to provide concrete examples but let OpenNLP decide that anything appearing at the specified position could be an entity. That way, I wouldn't have to provide each and every possible option (which is impossible in my case anyway) but only provide one for the possible surrounding sentence.
that is context based learning and Opennlp already does that. you've to train it with proper and more examples to get good results.
for example, when there is Professor X in our sentence, Opennlp trained model.bin gives you output X as a name whereas when X is present in the sentence without professor infront of it, it might not give output X as a name.
according to its documentation, give 15000 sentences of training data and you can expect good results.