Classification documents based on topic frequency - text

I need a way to clarify The dominant topics for the following data set , the following data set produced after pre-processing all docs ,
the following selected topics frequencies are follow :
TOPICS
id Doc-name total words Politics sport food animals
1 doc1 1000 300 250 100 350
2 doc2 2000 1000 400 200 400
3 doc3 4000 500 300 2000 200
etc...
question are :
is there any classification method for this kind of data set ?
if I consider doc1 is animals is this true ?
is there any way to calculate probability of each topic in that document to find doc dominant topic ?
any suggestion please ?

This method of classification is only good when the type of document should be determine in relation to a given topic. In no way this type of analysis can give an idea of the real context it blogs to.
What is the context of the sentence if I say "The athlete is certainly faster than any cat, dog, cow or a sheep"? Does it speak about animals?
The only conclusion you can make about the context of the sentence through this type of analysis is that "The sentence has factors leading to describe sports and animals. The participation of those factors are 4 to 2".
You can go on calculating the probability using standard methods. But the relevance of the numbers to the real context can be distant.

Related

Gensim doc2vec's d2v.wv.most_similar() gives not relevant words with high similarity scores

I've got a dataset of job listings with about 150 000 records. I extracted skills from descriptions using NER using a dictionary of 30 000 skills. Every skill is represented as an unique identificator.
My data example:
job_title job_id skills
1 business manager 4 12 13 873 4811 482 2384 48 293 48
2 java developer 55 48 2838 291 37 484 192 92 485 17 23 299 23...
3 data scientist 21 383 48 587 475 2394 5716 293 585 1923 494 3
Then, I train a doc2vec model using these data where job titles (their ids to be precise) are used as tags and skills vectors as word vectors.
def tagged_document(df):
for index, row in df.iterrows():
yield gensim.models.doc2vec.TaggedDocument(row['skills'].split(), [str(row['job_id'])])
data_for_training = list(tagged_document(data[['job_id', 'skills']]))
model_d2v = gensim.models.doc2vec.Doc2Vec(dm=0, dbow_words=1, vector_size=80, min_count=3, epochs=100, window=100000)
model_d2v.build_vocab(data_for_training)
model_d2v.train(data_for_training, total_examples=model_d2v.corpus_count, epochs=model_d2v.epochs)
It works mostly okay, but I have issues with some job titles. I tried to collect more data from them, but I still have an unpredictable behavior with them.
For example, I have a job title "Director Of Commercial Operations" which is represented as 41 data records having from 11 to 96 skills (mean 32). When I get most similar words for it (skills in my case) I get the following:
docvec = model_d2v.docvecs[id_]
model_d2v.wv.most_similar(positive=[docvec], topn=5)
capacity utilization 0.5729076266288757
process optimization 0.5405482649803162
goal setting 0.5288119316101074
aeration 0.5124399662017822
supplier relationship management 0.5117508172988892
These are top 5 skills and 3 of them look relevant. However the top one doesn't look too valid together with "aeration". The problem is that none of the job title records have these skills at all. It seems like a noise in the output, but why it gets one of the highest similarity scores (although generally not high)?
Does it mean that the model can't outline very specific skills for this kind of job titles?
Can the number of "noisy" skills be reduced? Sometimes I see much more relevant skills with lower similarity score, but it's often lower than 0.5.
One more example of correct behavior with similar amount of data:
BI Analyst, 29 records, number of skills from 4 to 48 (mean 21). The top skills look alright.
business intelligence 0.6986587047576904
business intelligence development 0.6861011981964111
power bi 0.6589289903640747
tableau 0.6500121355056763
qlikview (data analytics software) 0.6307920217514038
business intelligence tools 0.6143202781677246
dimensional modeling 0.6032138466835022
exploratory data analysis 0.6005223989486694
marketing analytics 0.5737696886062622
data mining 0.5734485387802124
data quality 0.5729933977127075
data visualization 0.5691111087799072
microstrategy 0.5566076636314392
business analytics 0.5535123348236084
etl 0.5516749620437622
data modeling 0.5512707233428955
data profiling 0.5495884418487549
If the your gold standard of what the model should report is skills that appeared in the training data, are you sure you don't want a simple count-based solution? For example, just provide a ranked list of the skills that appear most often in Director Of Commercial Operations listings?
On the other hand, the essence of compressing N job titles, and 30,000 skills, into a smaller (in this case vector_size=80) coordinate-space model is to force some non-intuitive (but perhaps real) relationships to be reflected in the model.
Might there be some real pattern in the model – even if, perhaps, just some idiosyncracies in the appearance of less-common skills – that makes aeration necessarily slot near those other skills? (Maybe it's a rare skill whose few contextual appearances co-occur with other skills very much near 'capacity utilization' -meaning with the tiny amount of data available, & tiny amount of overall attention given to this skill, there's no better place for it.)
Taking note of whether your 'anomalies' are often in low-frequency skills, or lower-freqeuncy job-ids, might enable a closer look at the data causes, or some disclaimering/filtering of most_similar() results. (The most_similar() method can limit its returned rankings to the more frequent range of the known vocabulary, for cases when the long-tail or rare words are, in with their rougher vectors, intruding in higher-quality results from better-reqpresented words. See the restrict_vocab parameter.)
That said, tinkering with training parameters may result in rankings that better reflect your intent. A larger min_count might remove more tokens that, lacking sufficient varied examples, mostly just inject noise into the rest of training. A different vector_size, smaller or larger, might better capture the relationships you're looking for. A more-aggressive (smaller) sample could discard more high-frequency words that might be starving more-interesting less-frequent words of a chance to influence the model.
Note that with dbow_words=1 & a large window, and records with (perhaps?) dozens of skills each, the words are having a much-more neighborly effect on each other, in the model, than the tag<->word correlations. That might be good or bad.

Words hiererchical semantic distance

I need a labeled data (human judgment) for the structural/hierarchical semantic distance between many couples (at least hundreds) of word.
For example, d(computer, television) < d(radio, television) < d(dish washer, television).
If we organize all words in a dendogram or a tree, where each node is a category ("electric device", "with screen", etc...), and words are in leaves, the number will represent number of steps (nodes) we have to go from one word to another.
Does such dataset exist?
per couples ratings is enough, no need to have a full embeding/tree/specify the nodes
(An example dataset will be:
Computer Television 1
Radio Television 2
DishWasher Television 3
Thanks!
I'm now aware of such human judgements datasets, but I guess you could look at semantic networks like WordNet which is a lexical database of English in a form of a graph. Given two words, you could compute distance between nodes representing them in WordNet.
Both nouns and verbs are organized into hierarchies, defined by
hypernym or IS A relationships. For instance, one sense of the word
dog is found following hypernym hierarchy; the words at the same level
represent synset members. Each set of synonyms has a unique index.
dog, domestic dog, Canis familiaris
canine, canid
carnivore
placental, placental mammal, eutherian, eutherian mammal
mammal
vertebrate, craniate
chordate
animal, animate being, beast, brute, creature, fauna
...
If you are looking for a dataset, you could also ask here.

How is polarity calculated for a sentence ??? (in sentiment analysis)

How is polarity of words in a statement are calculated....like
"i am successful in accomplishing the task,but in vain"
how each word is scored? (like - successful- 0.7 accomplishing- 0.8 but - -0.5
vain - - 0.8)
how is it calculated ? how is each word given a value or score?? what is the thing that's going behind ? As i am doing sentiment analysis I have few thing to be clear so .that would be great if someone helps.thanks in advance
If you are willing to use Python and NLTK, then check out Vader (http://www.nltk.org/howto/sentiment.html and skip down to the Vader section)
The scores from individual words can come from predefined word lists such as ANEW, General Inquirer, SentiWordNet, LabMT or my AFINN. Either individual experts have scored them or students or Amazon Mechanical Turk workers. Obviously, these scores are not the ultimate truth.
Word scores can also be computed by supervised learning with annotated texts, or word scores can be estimated from word ontologies or co-occurence patterns.
As for aggregation of individual words, there are various ways. One way would be to sum all the individual scores (valences), another to take the max valence among the words, a third to normalize (divide) by the number of words or by the number of scored words (i.e., getting a mean score), - or divide the square root of that number. The results may differ a bit.
I made some evaluation with my AFINN word list: http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6028/pdf/imm6028.pdf
Another approach is with recursive models like Richard Socher's models. The sentiment values of the individual words are aggregated in a tree-like structure and should find that the "but in vain"-part of your example should carry the most weight.

PredictionIO for Content Recommendation e.g. Tweets

I recently installed PredictionIO.
What I'd like to achieve is: I'd like to categorize content on the words included in the text. But how can I import data like raw Tweets to PredictionIO? Is it possible to let PredictionIO run over the content and find strong words and to sort them in categories?
What I'd like to get is something like this: Query for Boston Red Sox --> keywords that should appear would be: baseball, Boston, sports, ...
So I'll add on a little to what Thomas said. He's right, it all depends whether or not you have labels associated to your tweets. If your data is labeled then this will be a Text Classification problem. Look at this for more detailed info:
If you're instead looking to cluster, or group, a set of unlabeled observations then, as Thomas said, your best bet is to incorporate LDA into the works. Look at the latter documentation for more information, but basically once you run the LDA model you'll obtain an object of type DistributedLDAModel which has a method topicDistributions which gives you, for each tweet, a vector where each component is associated to a topic, and the component entry gives you the probability that the tweet belongs to that topic. You can cluster by assigning each tweet the topic with highest probability.
You also have access to a matrix of size MxN, where M is the number of words in your vocabulary, and N is the number of topics, or clusters, you wish to discover in your data. You can roughly interpret the ij th entry of this Topics Matrix as the probability that the word i appears in a document given that the document belongs to topic j. Another rule you could use for clustering is to treat each word vector associated to your tweets as a vector of counts. Then, you can interpret the ij entry of the product of your word matrix (tweets as rows, words as columns) and the Topics Matrix returned by LDA as the probability that tweet i belongs to topic j (this follows under certain assumptions, feel free to ask if you want more details). Again now you assign tweet i to the topic associated to the largest numerical value in row i of the resulting matrix. You can even use this clustering rule for assigning topics to incoming observations once you have used your original set of tweets for topic discovery!
Now, for data processing, you can still use the Text Classification reference for transforming your Tweets to word count vectors via the DataSource and Preparator components. As for importing your data, if you have the tweets saved locally on a file, you can use PredictionIO's Python SDK to import your data. An example is also given in the classification reference.
Feel free to ask any questions if anything isn't clear, and good luck!
So, really depends on if you have labelled data.
For example:
Baseball :: "I love Boston Red Sox #GoRedSox"
Sports :: "Woohoo! I love sports #winning"
Boston :: "Baseball time at Fenway Park. Red Sox FTW!"
...
Then you would be able to train a model to classifying Tweets against these keywords. You might be interested in templates for MLlib Naive Bayes, Decision Trees.
If you don't have labelled data (really, who wants to manually label Tweets) you might be able to use approaches such as Topic Modeling (e.g., LDA).
I don't think there is a template for LDA but being an active open source project it wouldn't surprise me if someone has already implemented this so might be a good idea to ask on PredictionIO user or dev forums.

tf-idf: Does using it help to weigh documents that share the terms higher than a document that doesnt?

I'm working on a customized search feature for a website. and I was curious if using only tf-idf to rank documents in my corpus would also help to weigh documents that have multiple search terms higher than documents with only one search term.
Example: Search = "poland spring water"
Theoretically, would the above query weigh (using traditional tf-idf) a document higher if the a document contained "poland" 100 times and "water" zero times. Or would it weigh a document heavier if it contained "poland" 10 times and "water" 10 times.
I'm aware that it all depends on the tf-idf value of "poland" and "water" but theoretically on an even playing field, would the algorithm help bring documents to the top of the results more if there were multiple terms in the document, or is it really term independent?
It is term independent. Remember, the tf-idf weighing scheme treats the query as a bag of words and each document is seen as a vector. For the above example, consider tf for poland is 100 while its idf is 1 in doc x. Also, consider tf for poland is 10 and tf for water is 2 is doc y. the idf of water is 1.
score of doc x = 100
score of doc y = 12
doc x ranked higher even though has one term.
its term independent. Depends on the ratio of how many documents contain poland and how many contain water. it that ratio. If its half-half, than the second document wins. If the ratio is 100:1, then the first document wins since the ratio is more similar to in-document distribution of the words.

Resources