shortest path dependency between three entities using spacy - svm

I'm trying to build ternary intra sentence relationships. One of the methods I'm considering is shortest path dependency algorithm with pos tag sequence , shortest path dependency sequence which will be used as features to a kernel based SVM. I'm not sure on how to formulate these features.
txt='Domestic revenues increased 14% to $680.8 million and were 77% of total revenues for the year ended December 31, 2015.'
doc = nlp(txt)
for token in doc:
print((token.head.text, token.text, token.dep_,token.pos_))
edges = []
for token in doc:
for child in token.children:
edges.append(('{0}'.format(token.lower_),
'{0}'.format(child.lower_)))
graph = nx.Graph(edges)
the shortest path between second token of domestic revenues and 2015 looks like this
shortest path length :7
shortest path: ['revenues', 'increased', 'were', 'for', 'year', 'ended', 'december', '2015']
How do I use this dependency graph as a feature sequence for ternary relationship ?
( Audit-nsubj-increased-quantmod-million--conj-were )
How do i use generalized pos tags for these entity relationships
(Audit-verb-num-num).
Since the entities in question are compound Im ok for the model to classify last tokens of the entities as a ternany relationship like this: (revenues,million,2015)--> (Audit,value,data)

Related

How to measure similarity between sentences inside a cluster after clustering?

I'm conducting topic modeling analysis on messages from public Telegram groups, super new to this area so just learning.
I've been following this example here (https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6), and tried swapping out the HDBSCAN clustering algorithm with the one in BERT's documentation util.community_detection (https://www.sbert.net/docs/package_reference/util.html).
When I output the results of the clusters in this example (4899 Telegram messages), I get something that looks like this.
Topic: just a cluster label
Doc: all the messages in that cluster combined together
0: top keywords found via tf-idf
The problem I'm concerned with is that, there are clearly a ton of messages that are basically identical to each other, I've marked them in yellow. A few examples,
Cluster 3: this is just a bunch of "hellos" and variations thereof
Cluster 5: this is just a bunch of "Ok"s, people saying yes / ok
Cluster 7: people just saying thanks and variations on that
Cluster 9: some variations and misspellings of the word "gas"
Cluster 19: just "siap" which I think means "sorry if I already posted"
To a human reader I feel like this type of text should just be excluded from the analysis altogether, the question is how do I detect it.
Since they're already grouped together by the clustering algorithm, the algorithm must have ways to measure the "similarity" between these messages within a cluster. But I don't seem to be able to find these values exposed anywhere or what it's called. Like for example the HDBSCAN algorithm (https://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html#), I skimmed through the doc a few times and didn't find any such property or measure exposed, am I missing something here?
My hypothesis is that for the cases where it's just a word or a short phrase repeated over and over again, this similarity value must be super super high, and I'd just say "clusters whose internal similarity is higher than this threshold are getting thrown out".
Any help & advice would be greatly appreciated, thanks!
Index the corpus of your interest (for e.g. FAISS) just for an idea, example code is below:
def build_index(self):
""":returns an inverted index for the search documents"""
vectors = [self.encode(document) for document in self.documents]
index = faiss.IndexIDMap(faiss.IndexFlatIP(768)) # dimensionality of vector space
# Add document vectors into index after transforming into numpy arrays. IDs should match len(documents)
index.add_with_ids(np.array([vec.numpy() for vec in vectors]), np.array(range(0, len(self.documents))))
return index
Then perform any similarity metric like L2 Euclidean distance or cosine similarity with dot products. Essentially, concept is that once we transform vectors in an n-dimensional space, vectors with similar semantics are grouped together. Therefore, computing similarity is just computing the angle between them and applying a cosine on it. Similar vectors have less angle, therefore higher cosine value & vice-versa.
Check the following topics for your problem.
Cosine Similarity
FAISS
Sentence Vectors (similar to word vectors, but are good for long documents)
Check this repository for a better understanding of sentence vectorization and computing similarity to retrieve top n sentences.
In short,
Create an index file using FAISS for your data of interest.
Compute similarity by calling one of its methods.
Get top n most similar results.
Removing stop words:
Essentially your problem can be attributed to a list of finite stop words. If you can identify ones to some finite value (e.g. some 25) such different key words at max, then the task becomes stop word removal. Please use NLTK / Spacy libraries for easy stop word removal. You can also specify them in a list of strings, write a condition where if a token matches with one of those strings, they’re deleted from downstream processing. Stop words are omitted & is a necessary pre-processing task in NLP. Your task of telegram
data is also similar to Twitter analysis. Check this & this.

How to cluster customers with the model (Customer -> Item list -> Word list in items) with an unsupervised algorithm

Need advice on a clustering model. There is a list of customers -> customers have a list of products -> Each product contains several words.
I want to cluster clients into several groups by type of activity - that is, by general topics.
How would you introduce such a model for clustering into vectors, for example for K-means?
My hypothesis is so far - turn every word into a fasttext vector, select the top 100 words for example on TF-IDF and add * 100 (the size of the fasttext vector) by 100 words, and this will turn out 10,000 columns. Maybe something more economical in computing?
This is very related to a recommendation systems. I'd recommend reading about content-based vs collaborative-filtering recommender systems. An okay introduction is this blog post.
So, you can cluster based on many properties. Your proposed idea might work. If you have domain knowledge about the product, you could appeal to that before looking to word vectors. For example, let's say all the products are shelves. You could vectorize the products directly, say
vec = [
width,
depth,
height,
width * depth, # footprint/surface area is important on its own
width * depth * height,
color, # numeric representation
popularity, # possibly using a metric like sales
]
This is just an example, but it shows how you can directly vectorize your products without resorting to NLP.
If there is no way you can think of to directly vectorize your products, and you don't/can't use collaborative filtering (cold start problem, perhaps), then you might want to look at vectorizing the entire product description using Universal Sentence Encoder, which will output 512 dimensional vectors, regardless of input size.

Trying to detect products from text while using a dictionary

I have a list of products names and a collection of text generated from random users. I am trying to detect products mentioned in the text while talking into account spelling variation. For example the text
Text = i am interested in galxy s8
Mentions the product samsung galaxy s8
But note the difference in spellings.
I've implemented the following approaches:
1- max tokenized products names and users text (i split words by punctuation and digits so s8 will be tokenized into 's' and '8'. Then i did a check on each token in user's text to see if it is in my vocabulary with damerau levenshtein distance <= 1 to allow for variation in spelling. Once i have detected a sequence of tokens that do exist in the vocabulary i do a search for the product that matches the query while checking the damerau levenshtein distance on each token. This gave poor results. Mainly because the sequence of tokens that exist in the vocabulary do not necessarily represent a product. For example since text is max tokenized numbers can be found in the vocabulary and as such dates are detected as products.
2- i constructed bigram and trigram indicies from the list of products and converted each user text into a query.. but also results weren't so great given the spelling variation
3- i manually labeled 270 sentences and trained a named entity recognizer with labels ('O' and 'Product'). I split the data into 80% training and 20% test. Note that I didn't use the list of products as part of the features. Results were okay.. not great tho
None of the above results achieved a reliable performance. I tried regular expressions but since there are so many different combinations to consider it became too complicated.. Are there better ways to tackle this problem? I suppose ner could give better results if i train more data but suppose there isn't enough training data, what do u think a better solution would be?
If i come up with a better alternative to the ones I've already mentioned, I'll add it to this post. In the meantime I'm open to suggestions
Consider splitting your problem into two parts.
1) Conduct a spelling check using a dictionary of known product names (this is not a NLP task and there should be guides on how to impelement spell check).
2) Once you have done pre-processing (spell checking), use your NER algorithm
It should improve your accuracy.

How is polarity calculated for a sentence ??? (in sentiment analysis)

How is polarity of words in a statement are calculated....like
"i am successful in accomplishing the task,but in vain"
how each word is scored? (like - successful- 0.7 accomplishing- 0.8 but - -0.5
vain - - 0.8)
how is it calculated ? how is each word given a value or score?? what is the thing that's going behind ? As i am doing sentiment analysis I have few thing to be clear so .that would be great if someone helps.thanks in advance
If you are willing to use Python and NLTK, then check out Vader (http://www.nltk.org/howto/sentiment.html and skip down to the Vader section)
The scores from individual words can come from predefined word lists such as ANEW, General Inquirer, SentiWordNet, LabMT or my AFINN. Either individual experts have scored them or students or Amazon Mechanical Turk workers. Obviously, these scores are not the ultimate truth.
Word scores can also be computed by supervised learning with annotated texts, or word scores can be estimated from word ontologies or co-occurence patterns.
As for aggregation of individual words, there are various ways. One way would be to sum all the individual scores (valences), another to take the max valence among the words, a third to normalize (divide) by the number of words or by the number of scored words (i.e., getting a mean score), - or divide the square root of that number. The results may differ a bit.
I made some evaluation with my AFINN word list: http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6028/pdf/imm6028.pdf
Another approach is with recursive models like Richard Socher's models. The sentiment values of the individual words are aggregated in a tree-like structure and should find that the "but in vain"-part of your example should carry the most weight.

Applied NLP: how to score a document against a lexicon of multi-word terms?

This is probably a fairly basic NLP question but I have the following task at hand: I have a collection of text documents that I need to score against an (English) lexicon of terms that could be 1-, 2-, 3- etc N-word long. N is bounded by some "reasonable" number but the distribution of various terms in the dictionary for various values of n = 1, ..., N might be fairly uniform. This lexicon can, for example, contain a list of devices of certain type and I want to see if a given document is likely about any of these devices. So I would want to score a document high(er) if it has one or more occurrences of any of the lexicon entries.
What is a standard NLP technique to do the scoring while accounting for various forms of the words that may appear in the lexicon? What sort of preprocessing would be required for both the input documents and the lexicon to be able to perform the scoring? What sort of open-source tools exist for both the preprocessing and the scoring?
I studied LSI and topic modeling almost a year ago, so what I say should be taken as merely a pointer to give you a general idea of where to look.
There are many different ways to do this with varying degrees of success. This is a hard problem in the realm of information retrieval. You can search for topic modeling to learn about different options and state of the art.
You definitely need some preprocessing and normalization if the words could appear in different forms. How about NLTK and one of its stemmers:
>>> from nltk.stem.lancaster import LancasterStemmer
>>> st = LancasterStemmer()
>>> st.stem('applied')
'apply'
>>> st.stem('applies')
'apply'
You have a lexicon of terms that I am going to call terms and also a bunch of documents. I am going to explore a very basic technique to rank documents with regards to the terms. There are a gazillion more sophisticated ways you can read about, but I think this might be enough if you are not looking for something too sophisticated and rigorous.
This is called a vector space IR model. Terms and documents are both converted to vectors in a k-dimensional space. For that we have to construct a term-by-document matrix. This is a sample matrix in which the numbers represent frequencies of the terms in documents:
So far we have a 3x4 matrix using which each document can be expressed by a 3-dimensional array (each column). But as the number of terms increase, these arrays become too large and increasingly sparse. Also, there are many words such as I or and that occur in most of the documents without adding much semantic content. So you might want to disregard these types of words. For the problem of largeness and sparseness, you can use a mathematical technique called SVD that scales down the matrix while preserving most of the information it contains.
Also, the numbers we used on the above chart were raw counts. Another technique would be to use Boolean values: 1 for presence and 0 zero for lack of a term in a document. But these assume that words have equal semantic weights. In reality, rarer words have more weight than common ones. So, a good way to edit the initial matrix would be to use ranking functions like tf-id to assign relative weights to each term. If by now we have applied SVD to our weighted term-by-document matrix, we can construct the k-dimensional query vectors, which are simply an array of the term weights. If our query contained multiple instances of the same term, the product of the frequency and the term weight would have been used.
What we need to do from there is somewhat straightforward. We compare the query vectors with document vectors by analyzing their cosine similarities and that would be the basis for the ranking of the documents relative to the queries.

Resources