Distinguishing well formed English sentences from "word salad"

Distinguishing well formed English sentences from "word salad" - nlp

I'm looking for a library easily usable from C++, Python or F#, which can distinguish well formed English sentences from "word salad". I tried The Stanford Parser and unfortunately, it parsed this:
Some plants have with done stems animals with exercise that to predict?
without a complaint. I'm not looking for something very sophisticated, able to handle all possible corner cases. I only need to filter out an obvious nonsense.

Here is something I just stumbled upon:
A general-purpose sentence-level nonsense detector, by a Stanford student named Ian Tenney.
Here is the code from the project, undocumented but available on GitHub.
If you want to develop your own solution based on this, I think you should pay attention the 4th group of features used, ie the language model, under section 3 "Features and preprocessing".
It might not suffice, but I think getting a probability score of each subsequences of length n is a good start. 3-grams like "plants have with", "have with done", "done stems animals", "stems animals with" and "that to predict" seem rather improbable, which could lead to a "nonsense" label on the whole sentence.
This method has the advantage of relying on a learned model rather than on a set of hand-made rules, which afaik is your other option. Many people would point you to Chapter 8 of NLTK's manual, but I think that developing your own context-free grammar for general English is asking a bit much.

The paper was useful, but goes into too much depth for solving this problem. Here is the author's basic approach, heuristically:
Baseline sentence heuristic: first letter is Capitalized,
and line ends with one of .?! (1 feature).
Number of characters, words, punctuation, digits, and named entities (from Stanford CoreNLP NER tagger), and normalized versions by text length (10 features).
Part-of-speech distributional tags: (# / # words) for each Penn treebank tag (45 features).
Indicators for the part of speech tag of the first
and last token in the text (45x2 = 90 features).
Language model raw score (s lm = log p(text))
and normalized score (s¯lm = s lm / # words) (2 features).
However, after a lot of searching, the github repo only includes the tests and visualizations. The raw training and test data are not there. Here is his function for calculating these features:
(note: this uses pandas dataframes as df)
def make_basic_features(df):
"""Compute basic features."""
df['f_nchars'] = df['__TEXT__'].map(len)
df['f_nwords'] = df['word'].map(len)
punct_counter = lambda s: sum(1 for c in s
if (not c.isalnum())
and not c in
[" ", "\t"])
df['f_npunct'] = df['__TEXT__'].map(punct_counter)
df['f_rpunct'] = df['f_npunct'] / df['f_nchars']
df['f_ndigit'] = df['__TEXT__'].map(lambda s: sum(1 for c in s
if c.isdigit()))
df['f_rdigit'] = df['f_ndigit'] / df['f_nchars']
upper_counter = lambda s: sum(1 for c in s if c.isupper())
df['f_nupper'] = df['__TEXT__'].map(upper_counter)
df['f_rupper'] = df['f_nupper'] / df['f_nchars']
df['f_nner'] = df['ner'].map(lambda ts: sum(1 for t in ts
if t != 'O'))
df['f_rner'] = df['f_nner'] / df['f_nwords']
# Check standard sentence pattern:
# if starts with capital, ends with .?!
def check_sentence_pattern(s):
ss = s.strip(r"""`"'""").strip()
return s[0].isupper() and (s[-1] in '.?!')
df['f_sentence_pattern'] = df['__TEXT__'].map(check_sentence_pattern)
# Normalize any LM features
# by dividing logscore by number of words
lm_cols = {c:re.sub("_lmscore_", "_lmscore_norm_",c)
for c in df.columns if c.startswith("f_lmscore")}
for c,cnew in lm_cols.items():
df[cnew] = df[c] / df['f_nwords']
return df
So I guess that's a function you can use in this case. For the minimalist version:
raw = ["This is is a well-formed sentence","but this ain't a good sent","just a fragment"]
import pandas as pd
df = pd.DataFrame([{"__TEXT__":i, "word": i.split(), 'ner':[]} for i in raw])
the parser seems to want a list of the words, and named entities recognized (NER) using the Stanford CoreNLP library, which is written in Java. You can pass in nothing (an empty list []) and the function do calculate everything else. You'll get back a dataframe (like a matrix) with all the features of sentences that you can then used to decide what to call "well formed" by the rules given.
Also, you don't HAVE to use pandas here. A list of dictionaries will also work. But the original code used pandas.
Because this example involved a lot of steps, I've created a gist where I run through an example up to the point of producing a clean list of sentences and a dirty list of not-well-formed sentences
my gist: https://gist.github.com/marcmaxson/4ccca7bacc72eb6bb6479caf4081cefb
This replaces the Stanford CoreNLP java library with spacy - a newer and easier to use python library that fills in the missing meta data, such as sentiment, named entities, and parts of speech used to determine if a sentence is well-formed. This runs under python 3.6, but could work under 2.7. all the libraries are backwards compatible.

Related

Use the polarity distribution of word to detect the sentiment of new words

I have just started a project in NLP. Suppose I have a graph for each word that shows the polarity distribution of sentiments for that word in different sentences. I want to know what I can use to recognize the feelings of new words? Any other use you have in mind I will be happy to share.
I apologize for any possible errors in my writing. Thanks a lot

Assuming you've got some words that have been hand-labeled with positive/negative sentiments, but then you encounter some new words that aren't labeled:
If you encounter the new words totally alone, outside of contexts, there's not much you can do. (Maybe, you could go out to try to find extra texts with those new words, such as vis dictionaries or the web, then use those larger texts in the next approach.)
If you encounter the new words inside texts that also include some of your hand-labeled words, you could try guessing that the new words are most like the words you already know that are closest-to, or used-in-the-same-places. This would leverage what's called "the distributional hypothesis" – words with similar distributions have similar meanings – that underlies a lot of computer natural-language analysis, including word2vec.
One simple thing to try along these lines: across all your texts, for every unknown word U, tally up the counts all neighboring words within N positions. (N could be 1, or larger.) From that, pick the top 5 words occuring most often near the unknown word, and look up your prior labels, and avergae them together (perhaps weighted by the number of occurrences.)
You'll then have a number for the new word.
Alternatively, you could train a word2vec set-of-word-vectors for all of your texts, including the unknown & know words. Then, ask that model for the N most-similar neighbors to your unknown word. (Again, N could be small or large.) Then, from among those neighbors with known labels, average them together (again perhaps weighted by similarity), to get a number for the previously unknown word.
I wouldn't particularly expect either of these techniques to work very well. The idea that individual words can have specific sentiment is somewhat weak given the way that in actual language, their meaning is heavily modified, or even reversed, by the surrounding grammar/context. But in each case these simple calculate-from-neighbors techniqyes are probably better than random guesses.
If your real aim is to calculate the overall sentiment of longer texts, like sentences, paragraphs, reviews, etc, then you should discard your labels of individual words an acquire/create labels for full texts, and apply real text-classification techniques to those larger texts. A simple word-by-word approach won't do very well compared to other techniques – as long as those techniques have plenty of labeled training data.

Extract Acronyms and Māori (non-english) words in a dataframe, and put them in adjacent columns within the dataframe

Regular expression seems a steep learning curve for me. I have a dataframe that contains texts (up to 300,000 rows). The text as contained in outcome column of a dummy file named foo_df.csv has a mixture of English words, acronyms and Māori words. foo_df.csv is as thus:
outcome
0 I want to go to DHB
1 Self Determination and Self-Management Rangatiratanga
2 mental health wellness and AOD counselling
3 Kai on my table
4 Fishing
5 Support with Oranga Tamariki Advocacy
6 Housing pathway with WINZ
7 Deal with personal matters
8 Referral to Owaraika Health services
The result I desire is in form of a table below such that has Abreviation and Māori_word columns:
outcome Abbreviation Māori_word
0 I want to go to DHB DHB
1 Self Determination and Self-Management Rangatiratanga Rangatiratanga
2 mental health wellness and AOD counselling AOD
3 Kai on my table Kai
4 Fishing
5 Support with Oranga Tamariki Advocacy Oranga Tamariki
6 Housing pathway with WINZ WINZ
7 Deal with personal matters
8 Referral to Owaraika Health services Owaraika
The approach I am using is to extract the ACRONYMS using regular expression and extract the Māori words using nltk module.
I have been able to extract the ACRONYMS using regular expression with this code:
pattern = '(\\b[A-Z](?:[\\.&]?[A-Z]){1,7}\\b)'
foo_df['Abbreviation'] = foo_df.outcome.str.extract(pattern)
I have been able to extract non-english words from a sentence using the code below:
import nltk
nltk.download('words')
from nltk.corpus import words
words = set(nltk.corpus.words.words())
sent = "Self Determination and Self-Management Rangatiratanga"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if not w.lower() in words or not w.isalpha())
However, I got an error TypeError: expected string or bytes-like object when I tried to iterate the above code over a dataframe. The iteration I tried is below:
def no_english(text):
words = set(nltk.corpus.words.words())
" ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
foo_df['Māori_word'] = foo_df.apply(no_english, axis = 1)
print(foo_df)
Any help in python3 will be appreciated. Thanks.

You can't magically tell if a word is English/Māori/abbreviation with a simple short regex. Actually, it is quite likely that some words can be found in multiple categories, so the task itself is not binary (or trinary in this case).
What you want to do is natural language processing, here are some examples of libraries for language detection in python. What you'll get is a probability that the input is in a given language. This is usually ran on full texts but you could apply it to single words.
Another approach is to use Māori and abbreviation dictionaries (=exhaustive/selected lists of words) and craft a function to tell if a word is one of them and assume English otherwise.

How to identify words with the same meaning in order to reduce number of tags/categories/classes in a dataset

So here is an example of a column in my data-set:
"industries": ["Gaming", "fitness and wellness"]
The industries column has hundreds of different tags, some of which can have the same meaning, for example, some rows have: "Gaming" and some have "video games" and others "Games & consoles".
I'd like to "lemmatize" these tags so I could query the data and not worry about minute differences in the presentation (if they are basically the same).
What is the standard solution in this case?

I don't know that there is a "standard" solution, but I can suggest a couple of approaches, ranked by increasing depth of knowledge, or going from the surface form to the meaning.
String matching
Lemmatisation/stemming
Word embedding vector distance
String matching is based on the calculating the difference between strings, as a measure of how many characters they share or how many editing steps it takes to transform one into the other. Levenshtein distance is one of the most common ones. However, depending on the size of your data, it might be a bit inefficient to use. This is a really cool approach to find most similar strings in a large data set.
However, it might not be the most suitable one for your particular data set, as your similarities seem more semantic and less bound to the surface form of the words.
Lemmatisation/stemming goes beyond the surface by analysing the words apart based on their morphology. In your example, gaming and games both have the same stem game, so you could base your similarity measure on matching stems. This can be better than pure string matching as you can see that *go" and went are related
Word embeddings go beyond the surface form by encoding meaning as the context in which words appear and as such, might find a semantic similarity between health and *fitness", that is not apparent from the surface at all! The similarity is measured as the cosine distance/similarity between two word vectors, which is basically the angle between the two vectors.
It seems to me that the third approach might be most suitable for your data.

This is a tough NLU question! Basically 'what are synonyms or near synonyms of each other, even if there's not exact string overlap?'.
1. Use GLoVE word embeddings to judge synonymous words
It might be interesting to use spaCy's pre-trained GLoVE model (en_vectors_web_lg) for word embeddings, to get the pairwise distances between tokens, and use that as a metric for judging 'closeness'.
nlp = spacy.load('en_vectors_web_lg')
doc1 = nlp('board games')
doc2 = nlp('Games & Recreation')
doc3 = nlp('video games')
for doc in [doc1, doc2, doc3]:
for comp in [doc1, doc2, doc3]:
if doc != comp:
print(f'{doc} | {comp} | similarity: {round(doc.similarity(comp), 4)}')
board games | Games & Recreation | similarity: 0.6958
board games | video games | similarity: 0.7732
Games & Recreation | board games | similarity: 0.6958
Games & Recreation | video games | similarity: 0.675
video games | board games | similarity: 0.7732
video games | Games & Recreation | similarity: 0.675
(GLoVE is cool - really nice mathematical intuition for word embeddings.)
PROS: GLoVE is robust, spaCy has it built in, vector space comparisons are easy in spaCy
CONS: It doesn't handle out of vocabulary words well, spaCy's just taking the average of all the token vectors here (so it's sensitive to document length)
2. Try using different distance metrics/fuzzy string matching
You might also look at different kinds of distance metrics -- cosine distance isn't the only one.
FuzzyWuzzy is a good implementation of Levenshtein distance for fuzzy string matching (no vectors required).
This library implements a whole slew of string-matching algorithms.
PROS: Using a preconfigured library saves you some coding, other distance metrics might help you find new correlations, don't need to train a vector model
CONS: More dependencies, some kinds of distance aren't appropriate and will miss synonymous words without literal string overlap
3. Use WordNet to get synonym sets
You could also get a sort of dictionary of synonym sets ('synsets') from WordNet, which was put together by linguists as a kind of semantic knowledge graph.
The nice thing about this is it gets you some textual entailment -- that is, given sentence A, would a reader think sentence B is most likely true?
Because it was handmade by linguists and grad students, WordNet isn't as dependent on string overlap and can give you nice semantic enrichment. It also provides things like hyponyms/meroynms and hypernyms/holonyms -- so you could, e.g., say 'video game' is a subtype of 'game', which is a subset of 'recreation' -- just based off of WordNet.
You can access WordNet in python through the textblob library.
from textblob import Word
from textblob.wordnet import NOUN
game = Word('game').get_synsets(pos=NOUN)
for synset in game:
print(synset.definition())
a contest with rules to determine a winner
a single play of a sport or other contest
an amusement or pastime
animal hunted for food or sport
(tennis) a division of play during which one player serves
(games) the score at a particular point or the score needed to win
the flesh of wild animals that is used for food
a secret scheme to do something (especially something underhand or illegal)
the game equipment needed in order to play a particular game
your occupation or line of work
frivolous or trifling behavior
print(game[0].hyponyms())
[Synset('athletic_game.n.01'),
Synset('bowling.n.01'),
Synset('card_game.n.01'),
Synset('child's_game.n.01'),
Synset('curling.n.01'),
Synset('game_of_chance.n.01'),
Synset('pall-mall.n.01'),
Synset('parlor_game.n.01'),
Synset('table_game.n.01'),
Synset('zero-sum_game.n.01')]
Even cooler, you can get the similarity based on these semantic features between any words you like.
print((Word('card_game').synsets[0]).shortest_path_distance(Word('video_game').synsets[0]))
5
PROS: Lets you use semantic information like textual entailment to get at your objective, which is hard to get in other ways
CONS: WordNet is limited to what is in WordNet, so again out-of-vocabulary words may be a problem for you.

I Suggest to use the word2vector approcah or the lemmatisation approach:
With the first one you can compute vectors starting from words, and so you have a projection into a vectorial space. With this projection you can compute the similarity between words (with cosine similarity as #Shnipp sad) and then put a threshold, above which you say that two words belong do different arguments.
Using lemmatisation you can compare bare words/lemma using SequenceMatcher. In this case you condition of equality could be based on the presence of very similar lemmas (similarity above 95%).
It's us to you to choose the best for your purpouse. If you want something solid and structured use word2vec. Othervise if you something simple and fast to implement use the lemmatisation approcah.

Financial news headers classification to positive/negative classes

I'm doing a small research project where I should try to split financial news articles headers to positive and negative classes.For classification I'm using SVM approach.The main problem which I see now it that not a lot of features can be produced for ML. News articles contains a lot of Named Entities and other "garbage" elements (from my point of view of course).
Could you please suggest ML features which can be used for ML training? Current results are: precision =0.6, recall=0.8
Thanks

The task is not trivial at all.
The straightforward approach would be to find or create a training set. That is a set of headers with positive news and a set of headers with negative news.
You turn the training set to a TF/IDF representation and then you train a Linear SVM to separate the two classes. Depending on the quality and size of your training set you can achieve something decent - not sure for 0.7 break even point.
Then, to get better results you need to go for NLP approaches. Try use a part-of-speech tagger to identify adjectives (trivial), and then score them using some sentiment DB like SentiWordNet.
There is an excellent overview on Sentiment Analysis by Bo Pang and Lillian Lee you should read:

How about these features?
Length of article header in words
Average word length
Number of words in a dictionary of "bad" words, e.g. dictionary = {terrible, horrible, downturn, bankruptcy, ...}. You may have to generate this dictionary yourself.
Ratio of words in that dictionary to total words in sentence
Similar to 3, but number of words in a "good" dictionary of words, e.g. dictionary = {boon, booming, employment, ...}
Similar to 5, but use the "good"-word dictionary
Time of the article's publication
Date of the article's publication
The medium through which it was published (you'll have to do some subjective classification)
A count of certain punctuation marks, such as the exclamation point
If you're allowed access to the actual article, you could use surface features from the actual article, such as its total length and perhaps even the number of responses or the level of opposition to that article. You could also look at many other dictionaries online such as Ogden's 850 basic english dictionary, and see if bad/good articles would be likely to extract many words from those. I agree that it seems difficult to come up with a long list (e.g. 100 features) of useful features for this purpose.

iliasfl is right, this is not a straightforward task.
I would use a bag of words approach but use a POS tagger first to tag each word in the headline. Then you could remove all of the named entities - which as you rightly point out don't affect the sentiment. Other words should appear frequently enough (if your dataset is big enough) to cancel themselves out from being polarised as either positive or negative.
One step further along, if you still aren't close could be to only select the adjectives and verbs from the tagged data as they are the words that tend to convey the emotion or mood.
I wouldn't be too disheartened in your precision and recall figures though, an F number of 0.8 and above is actually quite good.

Find sentences with similar relative meaning from a list of sentences against an example one

I want to be able to find sentences with the same meaning. I have a query sentence, and a long list of millions of other sentences. Sentences are words, or a special type of word called a symbol which is just a type of word symbolizing some object being talked about.
For example, my query sentence is:
Example: add (x) to (y) giving (z)
There may be a list of sentences already existing in my database such as: 1. the sum of (x) and (y) is (z) 2. (x) plus (y) equals (z) 3. (x) multiplied by (y) does not equal (z) 4. (z) is the sum of (x) and (y)
The example should match the sentences in my database 1, 2, 4 but not 3. Also there should be some weight for the sentence matching.
Its not just math sentences, its any sentence which can be compared to any other sentence based upon the meaning of the words. I need some way to have a comparison between a sentence and many other sentences to find the ones with the closes relative meaning. I.e. mapping between sentences based upon their meaning.
Thanks! (the tag is language-design as I couldn't create any new tag)

First off: what you're trying to solve is a very hard problem. Depending on what's in your dataset, it may be AI-complete.
You'll need your program to know or learn that add, plus and sum refer to the same concept, while multiplies is a different concept. You may be able to do this by measuring distance between the words' synsets in WordNet/FrameNet, though your distance calculation will have to be quite refined if you don't want to find multiplies. Otherwise, you may want to manually establish some word-concept mappings (such as {'add' : 'addition', 'plus' : 'addition', 'sum' : 'addition', 'times' : 'multiplication'}).
If you want full sentence semantics, you will in addition have to parse the sentences and derive the meaning from the parse trees/dependency graphs. The Stanford parser is a popular choice for parsing.
You can also find inspiration for this problem in Question Answering research. There, a common approach is to parse sentences, then store fragments of the parse tree in an index and search for them by common search engines techniques (e.g. tf-idf, as implemented in Lucene). That will also give you a score for each sentence.

You will need to stem the words in your sentences down to a common synonym, and then compare those stems and use the ratio of stem matches in a sentence (5 out of 10 words) to compare against some threshold that the sentence is a match. For example all sentences with a word match of over 80% (or what ever percentage you deem acurate). At least that is one way to do it.

Write a function which creates some kinda hash, or "expression" from a sentence, which can be easy compared with other sentences' hashes.
Cca:
1. "the sum of (x) and (y) is (z)" => x + y = z
4. "(z) is the sum of (x) and (y)" => z = x + y
Some tips for the transformation: omit "the" words, convert double-word terms to a single word "sum of" => "sumof", find operator word and replace "and" with it.

Not that easy ^^
You should use a stopword filter first, to get non-information-bearing words out of it. Here are some good ones
Then you wanna handle synonyms. Thats actually a really complex theme, cause you need some kind of word sense disambiguation to do it. And most state of the art methods are just a little bit better then the easiest solution. That would be, that you take the most used meaning of a word. That you can do with WordNet. You can get synsets for a word, where all synonyms are in it. Then you can generalize that word (its called a hyperonym) and take the most used meaning and replace the search term with it.
Just to say it, handling synonyms is pretty hard in NLP. If you just wanna handle different wordforms like add and adding for example, you could use a stemmer, but no stemmer would help you to get from add to sum (wsd is the only way there)
And then you have different word orderings in your sentences, which shouldnt be ignored aswell, if you want exact answers (x+y=z is different from x+z=y). So you need word dependencies aswell, so you can see which words depend on each other. The Stanford Parser is actually the best for that task if you wanna use english.
Perhaps you should just get nouns and verbs out of a sentence and make all the preprocessing on them and ask for the dependencies in your search index.
A dependency would look like
x (sum, y)
y (sum, x)
sum (x, y)
which you could use for ur search
So you need to tokenize, generalize, get dependencies, filter unimportant words to get your result. And if you wanna do it in german, you need a word decompounder aswell.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string