Word frequency database with senses - nlp

I am looking for a downloadable database of word frequencies/probabilities including their senses. Ideally, if it was mapped with WordNet.
In the list, some words would be listed multiple times if they have multiple senses e.g. frequency for 'bank' as an institution would be greater than river 'bank'.
Other datasets showing frequencies by word/part of speech would be helpful too.
Thanks for reading this.

N gram Frequencies are available in Google Ngram data. Although this does not answer the wordnet or the "senses" part of the question, it is a good start.
Use this package for experimenting the same

Related

Measure of similarity using meronym/holonym edge on Wordnet

StackOverflow!
I searched on stack but I have not found any response about my doubt. My question is follow:
There are any measure of similarity for Wordnet which explores (navigate) holonym / meronym and hypernym / hyponym edges at the same time? I have found only measures which look for common hypernyms vertex on Wordnet ...
My question not contains a snippet of code, it's only about a Wordnet feature.
UPDATE:
I'm searching for a measure which not only use 'is-a' for find two concepts for semantic comparation. I want some measure which, in some cases, for "bind" two concepts admits "skip" 'is-a' taxonomy until reach most close hyperonym and choose navigate in 'member of'(holonyms/meronyms) taxonomy under some justificative.
Thanks in advance.
I recently read this paper: Semantic Relatedness of Words and Phrases
Table 1 on p.3. shows how they used a weighting scheme. They then use the total weighted connections to decide if two words are related.
As far as I am aware, there is no ready-made function in nltk to do this.

Trying to detect products from text while using a dictionary

I have a list of products names and a collection of text generated from random users. I am trying to detect products mentioned in the text while talking into account spelling variation. For example the text
Text = i am interested in galxy s8
Mentions the product samsung galaxy s8
But note the difference in spellings.
I've implemented the following approaches:
1- max tokenized products names and users text (i split words by punctuation and digits so s8 will be tokenized into 's' and '8'. Then i did a check on each token in user's text to see if it is in my vocabulary with damerau levenshtein distance <= 1 to allow for variation in spelling. Once i have detected a sequence of tokens that do exist in the vocabulary i do a search for the product that matches the query while checking the damerau levenshtein distance on each token. This gave poor results. Mainly because the sequence of tokens that exist in the vocabulary do not necessarily represent a product. For example since text is max tokenized numbers can be found in the vocabulary and as such dates are detected as products.
2- i constructed bigram and trigram indicies from the list of products and converted each user text into a query.. but also results weren't so great given the spelling variation
3- i manually labeled 270 sentences and trained a named entity recognizer with labels ('O' and 'Product'). I split the data into 80% training and 20% test. Note that I didn't use the list of products as part of the features. Results were okay.. not great tho
None of the above results achieved a reliable performance. I tried regular expressions but since there are so many different combinations to consider it became too complicated.. Are there better ways to tackle this problem? I suppose ner could give better results if i train more data but suppose there isn't enough training data, what do u think a better solution would be?
If i come up with a better alternative to the ones I've already mentioned, I'll add it to this post. In the meantime I'm open to suggestions
Consider splitting your problem into two parts.
1) Conduct a spelling check using a dictionary of known product names (this is not a NLP task and there should be guides on how to impelement spell check).
2) Once you have done pre-processing (spell checking), use your NER algorithm
It should improve your accuracy.

Resource that provides number of documents where the term is covered

I am looking for resources that provides the number of documents a term is covered in. For example, there is about 25 billion documents that contains the term "the" in the indexed internet.
I don't know of any document frequency lists for large corpora such as the web, but there are some term frequency lists available. For example, there are the frequency lists from the web corpora compiled by the Web-As-Corpus Kool Yinitiative, which include the 2-billion ukWaC English web corpus. Alternatively, there are the n-grams from the Google Books Corpus.
It has been shown that such term frequency counts can be used to reliably approximate document frequency counts.
Here is a little more treatable frequencies.
Also take a look at this site - it contains a lot of info about existing corpora and words/ngrams lists. Unfortunately, most resources are paid, but not n-grams (for n > 1), so if you're going to process multiword terms, it can help.

Financial news headers classification to positive/negative classes

I'm doing a small research project where I should try to split financial news articles headers to positive and negative classes.For classification I'm using SVM approach.The main problem which I see now it that not a lot of features can be produced for ML. News articles contains a lot of Named Entities and other "garbage" elements (from my point of view of course).
Could you please suggest ML features which can be used for ML training? Current results are: precision =0.6, recall=0.8
Thanks
The task is not trivial at all.
The straightforward approach would be to find or create a training set. That is a set of headers with positive news and a set of headers with negative news.
You turn the training set to a TF/IDF representation and then you train a Linear SVM to separate the two classes. Depending on the quality and size of your training set you can achieve something decent - not sure for 0.7 break even point.
Then, to get better results you need to go for NLP approaches. Try use a part-of-speech tagger to identify adjectives (trivial), and then score them using some sentiment DB like SentiWordNet.
There is an excellent overview on Sentiment Analysis by Bo Pang and Lillian Lee you should read:
How about these features?
Length of article header in words
Average word length
Number of words in a dictionary of "bad" words, e.g. dictionary = {terrible, horrible, downturn, bankruptcy, ...}. You may have to generate this dictionary yourself.
Ratio of words in that dictionary to total words in sentence
Similar to 3, but number of words in a "good" dictionary of words, e.g. dictionary = {boon, booming, employment, ...}
Similar to 5, but use the "good"-word dictionary
Time of the article's publication
Date of the article's publication
The medium through which it was published (you'll have to do some subjective classification)
A count of certain punctuation marks, such as the exclamation point
If you're allowed access to the actual article, you could use surface features from the actual article, such as its total length and perhaps even the number of responses or the level of opposition to that article. You could also look at many other dictionaries online such as Ogden's 850 basic english dictionary, and see if bad/good articles would be likely to extract many words from those. I agree that it seems difficult to come up with a long list (e.g. 100 features) of useful features for this purpose.
iliasfl is right, this is not a straightforward task.
I would use a bag of words approach but use a POS tagger first to tag each word in the headline. Then you could remove all of the named entities - which as you rightly point out don't affect the sentiment. Other words should appear frequently enough (if your dataset is big enough) to cancel themselves out from being polarised as either positive or negative.
One step further along, if you still aren't close could be to only select the adjectives and verbs from the tagged data as they are the words that tend to convey the emotion or mood.
I wouldn't be too disheartened in your precision and recall figures though, an F number of 0.8 and above is actually quite good.

Finding related texts(correlation between two texts)

I'm trying to find similar articles in database via correlation.
So i split text in array of words, then delete frequently used words (articles,pronouns and so on), then compare two text with pearson coefficient function. For some text it's works but for other it's not so good(texts with large text have higher coefficient).
Can somebody advice a good method to find related texts?
Some of the problems you mention boild down to normalizing over document length and overall word frequency. Try tf-idf.
First and foremost, you need to specify what you precisely mean by similarity and when two documents are (more/less) similar.
If the similarity you are looking for is literal, then I would vectorise the documents using term frequencies, and use the cosine similarity to liken them to each other given that texts are inherently directional data. tf-idf and log-entropy weighting schemes may be tested depending on your use-case. The edit distance is inefficient with long texts.
If you care more about the semantics, word embeddings are your ally.

Resources