How Sklearn Latent Dirichlet Allocation really Works? - python-3.x

I have some texts and I'm using sklearn LatentDirichletAllocation algorithm to extract the topics from the texts.
I already have the texts converted into sequences using Keras and I'm doing this:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation()
X_topics = lda.fit_transform(X)
X:
print(X)
# array([[0, 988, 233, 21, 42, 5436, ...],
[0, 43, 6526, 21, 566, 762, 12, ...]])
X_topics:
print(X_topics)
# array([[1.24143852e-05, 1.23983890e-05, 1.24238815e-05, 2.08399432e-01,
7.91563331e-01],
[5.64976371e-01, 1.33304549e-05, 5.60003133e-03, 1.06638803e-01,
3.22771464e-01]])
My question is, what is exactly what's being returned from fit_transform, I know that should be the main topics detected from the texts but I cannot map those numbers to an index so I'm not able to see what those sequences means, I failed at searching for an explanation of what is actually happening, so any suggestion will be much appreciated.

First, a general explanation - think of LDiA as a clustering algorithm, that's going to determine, by default, 10 centroids, based on the frequencies of words in the texts, and it's going to put greater weights on some of those words than others by virtue of proximity to the centroid. Each centroid represents a 'topic' in this context, where the topic is unnamed, but can be sort of described by the words that are most dominant in forming each cluster.
So generally what you're doing with LDA is:
getting it to tell you what the 10 (or whatever) topics are of a given text.
or
getting it to tell you which centroid/topic some new text is closest to
For the second scenario, your expectation is that LDiA will output the "score" of the new text for each of the 10 clusters/topics. The index of the highest score is the index of the cluster/topic to which that new text belongs.
I prefer gensim.models.LdaMulticore, but since you've used the sklearn.decomposition.LatentDirichletAllocation I'll use that.
Here's some sample code (drawn from here) that runs through this process
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import random
n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
message = "Topic #%d: " % topic_idx
message += " ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]])
print(message)
print()
data, _ = fetch_20newsgroups(shuffle=True, random_state=1,
remove=('headers', 'footers', 'quotes'),
return_X_y=True)
X = data[:n_samples]
#create a count vectorizer using the sklearn CountVectorizer which has some useful features
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
max_features=n_features,
stop_words='english')
vectorizedX = tf_vectorizer.fit_transform(X)
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
lda.fit(vectorizedX)
Now let's try a new text:
testX = tf_vectorizer.transform(["I am educated about learned stuff"])
#get lda to score this text against each of the 10 topics
lda.transform(testX)
Out:
array([[0.54995409, 0.05001176, 0.05000163, 0.05000579, 0.05 ,
0.05001033, 0.05000001, 0.05001449, 0.05000123, 0.05000066]])
#looks like the first topic has the high score - now what are the words that are most associated with each topic?
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)
Out:
Topics in LDA model:
Topic #0: edu com mail send graphics ftp pub available contact university list faq ca information cs 1993 program sun uk mit
Topic #1: don like just know think ve way use right good going make sure ll point got need really time doesn
Topic #2: christian think atheism faith pittsburgh new bible radio games alt lot just religion like book read play time subject believe
Topic #3: drive disk windows thanks use card drives hard version pc software file using scsi help does new dos controller 16
Topic #4: hiv health aids disease april medical care research 1993 light information study national service test led 10 page new drug
Topic #5: god people does just good don jesus say israel way life know true fact time law want believe make think
Topic #6: 55 10 11 18 15 team game 19 period play 23 12 13 flyers 20 25 22 17 24 16
Topic #7: car year just cars new engine like bike good oil insurance better tires 000 thing speed model brake driving performance
Topic #8: people said did just didn know time like went think children came come don took years say dead told started
Topic #9: key space law government public use encryption earth section security moon probe enforcement keys states lunar military crime surface technology
Seems sensible - the sample text is about education and the word cloud for the first topic is about education.
The pictures below are from another dataset (ham vs spam SMS messages, so only two possible topics) which I reduced to 3 dimensions with PCA, but in case a picture helps, these two (same data from different angles) might give a general sense of what's going on with LDiA. (graphs are from Latent Discriminant Analysis vs LDiA, but the representation is still relevant)
While LDiA is an unsupervised method, to actually use it in a business context you'll likely want to at least manually intervene to give the topics names that are meaningful to your context. e.g. Assigning a subject area to stories on a news aggregation site, choosing amongst ['Business', 'Sports', 'Entertainment', etc]
For further study, perhaps run through something like this:
https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

Related

How to understand the answer_start parameter of Squad dataset for training BERT-QA model + practical implications for creating custom dataset?

I am in the process of creating a custom dataset to benchmark the accuracy of the 'bert-large-uncased-whole-word-masking-finetuned-squad' model for my domain, to understand if I need to fine-tune further, etc.
When looking at the different Question Answering datasets on the Hugging Face site (squad, adversarial_qa, etc. ), I see that the answer is commonly formatted as a dictionary with keys: answer (the text) and answer_start (char index where answer starts).
I'm trying to understand:
The intuition behind how the model uses the answer_start when calculating the loss, accuracy, etc.
If I need to go through the process of adding this to my custom dataset (easier to run model evaluation code, etc?)
If so, is there a programmatic way to do this to avoid manual effort?
Any help or direction would be greatly appreciated!
Code example to show format:
import datasets
ds = datasets.load_dataset('squad')
train = ds['train']
print('Example: \n')
print(train['answers'][0])
Your question is a bit broad to give you a specific answer, but I will try my best to point you in some directions.
The intuition behind how the model uses the answer_start when
calculating the loss, accuracy, etc.
There are different types of QA tasks/datasets. The ones you mentioned (SQuAD and adversarial_qa) belong to the field of extractive question answering. There, a model must select a span from a given context that answers the given question. For example:
context = 'Second, Democrats have always elevated their minority floor leader to the speakership upon reclaiming majority status. Republicans have not always followed this leadership succession pattern. In 1919, for instance, Republicans bypassed James R. Mann, R-IL, who had been minority leader for eight years, and elected Frederick Gillett, R-MA, to be Speaker. Mann "had angered many Republicans by objecting to their private bills on the floor;" also he was a protégé of autocratic Speaker Joseph Cannon, R-IL (1903–1911), and many Members "suspected that he would try to re-centralize power in his hands if elected Speaker." More recently, although Robert H. Michel was the Minority Leader in 1994 when the Republicans regained control of the House in the 1994 midterm elections, he had already announced his retirement and had little or no involvement in the campaign, including the Contract with America which was unveiled six weeks before voting day.'
question='How did Republicans feel about Mann in 1919?'
answer='angered' #-> starting at character 365
A simple approach that is often used today, is a linear layer that predicts the answer start and answer end from the last hidden state of a transformer encoder (code example). The last hidden state holds one vector for each input token (token!= words) and the linear layer is trained to assign high probabilities to tokens that could potentially be the start and end of the answer span. To train a model with your data, the loss function needs to know which tokens should get a high probability (i.e. the answer and the start token).
If I need to go through the process of adding this to my custom
dataset (easier to run model evaluation code, etc?)
You should go through this process, otherwise, how should someone know where the answer starts in your context? They can of course interfere with it programmatically, but what if your answer string appears twice in the context? Providing an answer start position avoids confusion and allows your users to use it right away with one of the many extractive questions answering scripts that are already available out there.
If so, is there a programmatic way to do this to avoid manual effort?
You could simply loop through your dataset and use str.find:
context.find(answer)
Output:
365

Gensim doc2vec's d2v.wv.most_similar() gives not relevant words with high similarity scores

I've got a dataset of job listings with about 150 000 records. I extracted skills from descriptions using NER using a dictionary of 30 000 skills. Every skill is represented as an unique identificator.
My data example:
job_title job_id skills
1 business manager 4 12 13 873 4811 482 2384 48 293 48
2 java developer 55 48 2838 291 37 484 192 92 485 17 23 299 23...
3 data scientist 21 383 48 587 475 2394 5716 293 585 1923 494 3
Then, I train a doc2vec model using these data where job titles (their ids to be precise) are used as tags and skills vectors as word vectors.
def tagged_document(df):
for index, row in df.iterrows():
yield gensim.models.doc2vec.TaggedDocument(row['skills'].split(), [str(row['job_id'])])
data_for_training = list(tagged_document(data[['job_id', 'skills']]))
model_d2v = gensim.models.doc2vec.Doc2Vec(dm=0, dbow_words=1, vector_size=80, min_count=3, epochs=100, window=100000)
model_d2v.build_vocab(data_for_training)
model_d2v.train(data_for_training, total_examples=model_d2v.corpus_count, epochs=model_d2v.epochs)
It works mostly okay, but I have issues with some job titles. I tried to collect more data from them, but I still have an unpredictable behavior with them.
For example, I have a job title "Director Of Commercial Operations" which is represented as 41 data records having from 11 to 96 skills (mean 32). When I get most similar words for it (skills in my case) I get the following:
docvec = model_d2v.docvecs[id_]
model_d2v.wv.most_similar(positive=[docvec], topn=5)
capacity utilization 0.5729076266288757
process optimization 0.5405482649803162
goal setting 0.5288119316101074
aeration 0.5124399662017822
supplier relationship management 0.5117508172988892
These are top 5 skills and 3 of them look relevant. However the top one doesn't look too valid together with "aeration". The problem is that none of the job title records have these skills at all. It seems like a noise in the output, but why it gets one of the highest similarity scores (although generally not high)?
Does it mean that the model can't outline very specific skills for this kind of job titles?
Can the number of "noisy" skills be reduced? Sometimes I see much more relevant skills with lower similarity score, but it's often lower than 0.5.
One more example of correct behavior with similar amount of data:
BI Analyst, 29 records, number of skills from 4 to 48 (mean 21). The top skills look alright.
business intelligence 0.6986587047576904
business intelligence development 0.6861011981964111
power bi 0.6589289903640747
tableau 0.6500121355056763
qlikview (data analytics software) 0.6307920217514038
business intelligence tools 0.6143202781677246
dimensional modeling 0.6032138466835022
exploratory data analysis 0.6005223989486694
marketing analytics 0.5737696886062622
data mining 0.5734485387802124
data quality 0.5729933977127075
data visualization 0.5691111087799072
microstrategy 0.5566076636314392
business analytics 0.5535123348236084
etl 0.5516749620437622
data modeling 0.5512707233428955
data profiling 0.5495884418487549
If the your gold standard of what the model should report is skills that appeared in the training data, are you sure you don't want a simple count-based solution? For example, just provide a ranked list of the skills that appear most often in Director Of Commercial Operations listings?
On the other hand, the essence of compressing N job titles, and 30,000 skills, into a smaller (in this case vector_size=80) coordinate-space model is to force some non-intuitive (but perhaps real) relationships to be reflected in the model.
Might there be some real pattern in the model – even if, perhaps, just some idiosyncracies in the appearance of less-common skills – that makes aeration necessarily slot near those other skills? (Maybe it's a rare skill whose few contextual appearances co-occur with other skills very much near 'capacity utilization' -meaning with the tiny amount of data available, & tiny amount of overall attention given to this skill, there's no better place for it.)
Taking note of whether your 'anomalies' are often in low-frequency skills, or lower-freqeuncy job-ids, might enable a closer look at the data causes, or some disclaimering/filtering of most_similar() results. (The most_similar() method can limit its returned rankings to the more frequent range of the known vocabulary, for cases when the long-tail or rare words are, in with their rougher vectors, intruding in higher-quality results from better-reqpresented words. See the restrict_vocab parameter.)
That said, tinkering with training parameters may result in rankings that better reflect your intent. A larger min_count might remove more tokens that, lacking sufficient varied examples, mostly just inject noise into the rest of training. A different vector_size, smaller or larger, might better capture the relationships you're looking for. A more-aggressive (smaller) sample could discard more high-frequency words that might be starving more-interesting less-frequent words of a chance to influence the model.
Note that with dbow_words=1 & a large window, and records with (perhaps?) dozens of skills each, the words are having a much-more neighborly effect on each other, in the model, than the tag<->word correlations. That might be good or bad.

spaCy fails to properly parse medical text

Recently I have been experiencing some issues while splitting some medical text into sentences with spaCy. Maybe you can explain, why these issues arise?
If the word has a length of 1 and the sentence ends with a dot, the end of the sentence won't be recognized.
For example:
There was no between-treatment difference in preoperative or
postoperative hemodynamics or in release of troponin I. (NO SPLIT HERE) Preoperative
oral coenzyme Q(10) therapy in patients undergoing cardiac surgery
increases myocardial and cardiac mitochondrial coenzyme Q(10) levels,
improves mitochondrial efficiency, and increases myocardial tolerance
to in vitro hypoxia-reoxygenation stress.
Another issue is with the characters +/-, which is treated as the end of a sentence. For instance one whole sentence is split into several sentences like below:
VO(2max) decreased significantly by 3.6 +/-
2.1, 14 +/-
2.5, and 27.4 +/-
3.6% in TW, and by 5 +/-
4, 9.4 +/-
6.4, and 18.7 +/-
7% in SW at 1000, 2500, and 4500 m, respectively.
All of the above should be one single sentence!
Sometimes the sentence is interrupted between a word and a special character (special and special character, number and a word with a length less than 3).
The survival rates for patients receiving left ventricular assist
devices (n = 68) versus patients receiving optimal medical management
(n = 61) were 52% versus 28% at 1 year and 29% versus 13% at 2 years SPLITS HERE
( P = .008, log-rank test).
Thank you very much!
SpaCy's English models are trained on web data - mostly stuff like blog posts. Obviously the average blog post looks nothing like the medical literature you're working on, so spaCy is wildly confused. This problem isn't specific to spaCy, it will also happen with any system designed to work on "typical" English that doesn't include medical papers and uses statistical modelling.
Medical text is pretty notorious for having problems with NLP techniques that work in other circumstances, so you may want to look around for something specifically tailored for that. Alternately you can try making a small training set based on your data and making a new spaCy model.
That said, the +/- issue does look strange, and might be based on a tokenization issue or something rather than a model issue - I would recommend you file a bug report here.

Scikit-Learn - No True Positives - Best Way to Normalize Data

Thanks for taking the time to read my question!
So I am running an experiment to see if I can predict whether an individual has been diagnosed with depression (or at least says they have been) based on the words (or tokens)they use in their tweets. I found 139 users that at some point tweeted "I have been diagnosed with depression" or some variant of this phrase in an earnest context (.e. not joking or sarcastic. Human beings that were native speakers in the language of the tweet were used to discern whether the tweet being made was genuine or not).
I then collected the entire public timeline of tweets of all of these users' tweets, giving me a "depressed user tweet corpus" of about 17000 tweets.
Next I created a database of about 4000 random "control" users, and with their timelines created a "control tweet corpus" of about 800,000 tweets.
Then I combined them both into a big dataframe,which looks like this:
,class,tweet
0,depressed,tweet text .. *
1,depressed,tweet text.
2,depressed,# tweet text
3,depressed,저 tweet text
4,depressed,# tweet text😚
5,depressed,# tweet text😍
6,depressed,# tweet text ?
7,depressed,# tweet text ?
8,depressed,tweet text *
9,depressed,# tweet text ?
10,depressed,# tweet text
11,depressed,tweet text *
12,depressed,#tweet text
13,depressed,
14,depressed,tweet text !
15,depressed,tweet text
16,depressed,tweet text. .
17,depressed,tweet text
...
50595,control,#tweet text?
150596,control,"# tweet text."
150597,control,# tweet text.
150598,control,"# tweet text. *"
150599,control,"#tweet text?"t
150600,control,"# tweet text?"
150601,control,# tweet text?
150602,control,# tweet text.
150603,control,#tweet text~
150604,control,# tweet text.
Then I trained a multinomial naive bayes classifier using an object from the CountVectorizer() class imported from the sklearn library:
count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(tweet_corpus['tweet'].values)
classifier = MultinomialNB()
targets = tweet_corpus['class'].values
classifier.fit(counts, targets)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior= True)
Unfortunately, after running a 6-fold cross validation test, the results suck and I am trying to figure out why.
Total tweets classified: 613952
Score: 0.0
Confusion matrix:
[[596070 743]
[ 17139 0]]
So, I didn't properly predict a single depressed person's tweet! My initial thought is that I have not properly normalized the counts of the control group, and therefore even tokens which appear more frequently among the depressed user corpus are over represented in the control tweet corpus due to its much larger size. I was under the impression that .fit() did this already, so maybe I am on the wrong track here? If not, any suggestions on the most efficient way to normalize the data between two groups of disparate size?
You should use a re-sampling techniques to deal with unbalanced classes. There are many ways to do that "by hand" in Python, but I recommend unbalanced learn which compiles re-sampling techniques commonly used in datasets showing strong between-class imbalance.
If you are using Anaconda, you can use:
conda install -c glemaitre imbalanced-learn.
or simply:
pip install -U imbalanced-learn
This library is compteible with sci-kit learn. Your dataset looks very interesting, is it public? Hope this helps.

PredictionIO for Content Recommendation e.g. Tweets

I recently installed PredictionIO.
What I'd like to achieve is: I'd like to categorize content on the words included in the text. But how can I import data like raw Tweets to PredictionIO? Is it possible to let PredictionIO run over the content and find strong words and to sort them in categories?
What I'd like to get is something like this: Query for Boston Red Sox --> keywords that should appear would be: baseball, Boston, sports, ...
So I'll add on a little to what Thomas said. He's right, it all depends whether or not you have labels associated to your tweets. If your data is labeled then this will be a Text Classification problem. Look at this for more detailed info:
If you're instead looking to cluster, or group, a set of unlabeled observations then, as Thomas said, your best bet is to incorporate LDA into the works. Look at the latter documentation for more information, but basically once you run the LDA model you'll obtain an object of type DistributedLDAModel which has a method topicDistributions which gives you, for each tweet, a vector where each component is associated to a topic, and the component entry gives you the probability that the tweet belongs to that topic. You can cluster by assigning each tweet the topic with highest probability.
You also have access to a matrix of size MxN, where M is the number of words in your vocabulary, and N is the number of topics, or clusters, you wish to discover in your data. You can roughly interpret the ij th entry of this Topics Matrix as the probability that the word i appears in a document given that the document belongs to topic j. Another rule you could use for clustering is to treat each word vector associated to your tweets as a vector of counts. Then, you can interpret the ij entry of the product of your word matrix (tweets as rows, words as columns) and the Topics Matrix returned by LDA as the probability that tweet i belongs to topic j (this follows under certain assumptions, feel free to ask if you want more details). Again now you assign tweet i to the topic associated to the largest numerical value in row i of the resulting matrix. You can even use this clustering rule for assigning topics to incoming observations once you have used your original set of tweets for topic discovery!
Now, for data processing, you can still use the Text Classification reference for transforming your Tweets to word count vectors via the DataSource and Preparator components. As for importing your data, if you have the tweets saved locally on a file, you can use PredictionIO's Python SDK to import your data. An example is also given in the classification reference.
Feel free to ask any questions if anything isn't clear, and good luck!
So, really depends on if you have labelled data.
For example:
Baseball :: "I love Boston Red Sox #GoRedSox"
Sports :: "Woohoo! I love sports #winning"
Boston :: "Baseball time at Fenway Park. Red Sox FTW!"
...
Then you would be able to train a model to classifying Tweets against these keywords. You might be interested in templates for MLlib Naive Bayes, Decision Trees.
If you don't have labelled data (really, who wants to manually label Tweets) you might be able to use approaches such as Topic Modeling (e.g., LDA).
I don't think there is a template for LDA but being an active open source project it wouldn't surprise me if someone has already implemented this so might be a good idea to ask on PredictionIO user or dev forums.

Resources