Could anyone please suggest an good pretrained NLP model to split a long question to multiple short questions in medical healthcare domain.
Example 1-
Long Question -
Does the patient have a diagnosis of HIV / AIDS?
Answer -
1. Does the patient have a diagnosis of HIV?
2. Does the patient have a diagnosis of AIDS?
Example 2
Long Question -
Does the patient have CLL/SLL with del(17p)/TP53 mutation?
Answer-
1. Does the patient have CLL with del(17p)/TP53 mutation?
2. Does the patient have SLL with del(17p)/TP53 mutation?
Example 3
Does the patient have HCV/HIV or HBV/HCV co-infections?
Answer-
1. Does the patient have HCV co-infections?
2. Does the patient have HIV co-infections?
3. Does the patient have HBV co-infections?
4. Does the patient have HCV co-infections?
I tried using sentence transformers, paraphrasing and it did not worked for me.
Related
I am in the process of creating a custom dataset to benchmark the accuracy of the 'bert-large-uncased-whole-word-masking-finetuned-squad' model for my domain, to understand if I need to fine-tune further, etc.
When looking at the different Question Answering datasets on the Hugging Face site (squad, adversarial_qa, etc. ), I see that the answer is commonly formatted as a dictionary with keys: answer (the text) and answer_start (char index where answer starts).
I'm trying to understand:
The intuition behind how the model uses the answer_start when calculating the loss, accuracy, etc.
If I need to go through the process of adding this to my custom dataset (easier to run model evaluation code, etc?)
If so, is there a programmatic way to do this to avoid manual effort?
Any help or direction would be greatly appreciated!
Code example to show format:
import datasets
ds = datasets.load_dataset('squad')
train = ds['train']
print('Example: \n')
print(train['answers'][0])
Your question is a bit broad to give you a specific answer, but I will try my best to point you in some directions.
The intuition behind how the model uses the answer_start when
calculating the loss, accuracy, etc.
There are different types of QA tasks/datasets. The ones you mentioned (SQuAD and adversarial_qa) belong to the field of extractive question answering. There, a model must select a span from a given context that answers the given question. For example:
context = 'Second, Democrats have always elevated their minority floor leader to the speakership upon reclaiming majority status. Republicans have not always followed this leadership succession pattern. In 1919, for instance, Republicans bypassed James R. Mann, R-IL, who had been minority leader for eight years, and elected Frederick Gillett, R-MA, to be Speaker. Mann "had angered many Republicans by objecting to their private bills on the floor;" also he was a protégé of autocratic Speaker Joseph Cannon, R-IL (1903–1911), and many Members "suspected that he would try to re-centralize power in his hands if elected Speaker." More recently, although Robert H. Michel was the Minority Leader in 1994 when the Republicans regained control of the House in the 1994 midterm elections, he had already announced his retirement and had little or no involvement in the campaign, including the Contract with America which was unveiled six weeks before voting day.'
question='How did Republicans feel about Mann in 1919?'
answer='angered' #-> starting at character 365
A simple approach that is often used today, is a linear layer that predicts the answer start and answer end from the last hidden state of a transformer encoder (code example). The last hidden state holds one vector for each input token (token!= words) and the linear layer is trained to assign high probabilities to tokens that could potentially be the start and end of the answer span. To train a model with your data, the loss function needs to know which tokens should get a high probability (i.e. the answer and the start token).
If I need to go through the process of adding this to my custom
dataset (easier to run model evaluation code, etc?)
You should go through this process, otherwise, how should someone know where the answer starts in your context? They can of course interfere with it programmatically, but what if your answer string appears twice in the context? Providing an answer start position avoids confusion and allows your users to use it right away with one of the many extractive questions answering scripts that are already available out there.
If so, is there a programmatic way to do this to avoid manual effort?
You could simply loop through your dataset and use str.find:
context.find(answer)
Output:
365
I have some texts and I'm using sklearn LatentDirichletAllocation algorithm to extract the topics from the texts.
I already have the texts converted into sequences using Keras and I'm doing this:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation()
X_topics = lda.fit_transform(X)
X:
print(X)
# array([[0, 988, 233, 21, 42, 5436, ...],
[0, 43, 6526, 21, 566, 762, 12, ...]])
X_topics:
print(X_topics)
# array([[1.24143852e-05, 1.23983890e-05, 1.24238815e-05, 2.08399432e-01,
7.91563331e-01],
[5.64976371e-01, 1.33304549e-05, 5.60003133e-03, 1.06638803e-01,
3.22771464e-01]])
My question is, what is exactly what's being returned from fit_transform, I know that should be the main topics detected from the texts but I cannot map those numbers to an index so I'm not able to see what those sequences means, I failed at searching for an explanation of what is actually happening, so any suggestion will be much appreciated.
First, a general explanation - think of LDiA as a clustering algorithm, that's going to determine, by default, 10 centroids, based on the frequencies of words in the texts, and it's going to put greater weights on some of those words than others by virtue of proximity to the centroid. Each centroid represents a 'topic' in this context, where the topic is unnamed, but can be sort of described by the words that are most dominant in forming each cluster.
So generally what you're doing with LDA is:
getting it to tell you what the 10 (or whatever) topics are of a given text.
or
getting it to tell you which centroid/topic some new text is closest to
For the second scenario, your expectation is that LDiA will output the "score" of the new text for each of the 10 clusters/topics. The index of the highest score is the index of the cluster/topic to which that new text belongs.
I prefer gensim.models.LdaMulticore, but since you've used the sklearn.decomposition.LatentDirichletAllocation I'll use that.
Here's some sample code (drawn from here) that runs through this process
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import random
n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
message = "Topic #%d: " % topic_idx
message += " ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]])
print(message)
print()
data, _ = fetch_20newsgroups(shuffle=True, random_state=1,
remove=('headers', 'footers', 'quotes'),
return_X_y=True)
X = data[:n_samples]
#create a count vectorizer using the sklearn CountVectorizer which has some useful features
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
max_features=n_features,
stop_words='english')
vectorizedX = tf_vectorizer.fit_transform(X)
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
lda.fit(vectorizedX)
Now let's try a new text:
testX = tf_vectorizer.transform(["I am educated about learned stuff"])
#get lda to score this text against each of the 10 topics
lda.transform(testX)
Out:
array([[0.54995409, 0.05001176, 0.05000163, 0.05000579, 0.05 ,
0.05001033, 0.05000001, 0.05001449, 0.05000123, 0.05000066]])
#looks like the first topic has the high score - now what are the words that are most associated with each topic?
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)
Out:
Topics in LDA model:
Topic #0: edu com mail send graphics ftp pub available contact university list faq ca information cs 1993 program sun uk mit
Topic #1: don like just know think ve way use right good going make sure ll point got need really time doesn
Topic #2: christian think atheism faith pittsburgh new bible radio games alt lot just religion like book read play time subject believe
Topic #3: drive disk windows thanks use card drives hard version pc software file using scsi help does new dos controller 16
Topic #4: hiv health aids disease april medical care research 1993 light information study national service test led 10 page new drug
Topic #5: god people does just good don jesus say israel way life know true fact time law want believe make think
Topic #6: 55 10 11 18 15 team game 19 period play 23 12 13 flyers 20 25 22 17 24 16
Topic #7: car year just cars new engine like bike good oil insurance better tires 000 thing speed model brake driving performance
Topic #8: people said did just didn know time like went think children came come don took years say dead told started
Topic #9: key space law government public use encryption earth section security moon probe enforcement keys states lunar military crime surface technology
Seems sensible - the sample text is about education and the word cloud for the first topic is about education.
The pictures below are from another dataset (ham vs spam SMS messages, so only two possible topics) which I reduced to 3 dimensions with PCA, but in case a picture helps, these two (same data from different angles) might give a general sense of what's going on with LDiA. (graphs are from Latent Discriminant Analysis vs LDiA, but the representation is still relevant)
While LDiA is an unsupervised method, to actually use it in a business context you'll likely want to at least manually intervene to give the topics names that are meaningful to your context. e.g. Assigning a subject area to stories on a news aggregation site, choosing amongst ['Business', 'Sports', 'Entertainment', etc]
For further study, perhaps run through something like this:
https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24
I have a question regarding time-dependant variables in survival analysis. Do you usually count age as a time varying variable?
I am looking at a population of cancer patients who received certain treatment and were cured initially. They were then followed up for certain period of time. The total follow up time is up to 15 years, so it is relatively long, but some have of course a much shorter follow up time. The event of interest is whether they developed recurrence of their cancer or not.
So in SAS, here is how I am doing this
proc phreg data=have ;
Title 'Cox for cancer recurrence';
class sex tumor_differentiation;
model Time*Recurrence(0)= age sex tumor_size tumor_differentiation/rl;
run;
And in R
surv_object <- Surv(time = df$Time, event = df$Recurrence)
fit.coxph <- coxph(surv_object ~ Age +Sex + TumorSize + TumorDifferentiation,
data = have)
The question here is, would you put in Age as a time dependant co-variate or would you just put in age at baseline in your model?
Thank you for your insights and appreciate your help
In the cancer research, the age at diagnosis has clinical importance, so in the usual setting, we use age at diagnosis as a variable, and it is fixed!
We have a task of identifying which company a news article is about. The input is a (business) news article, the goal is the company name.
Could recommend a solution, please?
At the moment, we start with finding the N most mentioned company names in the article (by a Named Entity Recognition algorithm). When N >= 2, NER results can give us >75% accuracy. But when N = 1, this only give us about 50% accuracy.
I'm using a lexicon-based approach to text analysis. Basically I have a long list of words marked with whether they are positive/negative/angry/sad/happy etc. I match the words in the text I want to analyze to the words in the lexicon in order to help me determine if my text is positive/negative/angry/sad/happy etc.
But the length of the texts I want to analyze vary. Most of them are under 100 words, but consider the following example:
John is happy. (1 word in the category 'happy' giving a score of 33% for happy)
John told Mary yesterday that he was happy. (12.5% happy)
So comparing across different sentences, it seems that my first sentence is more 'happy' than my second sentence, simply because the sentence is shorter, and gives a disproportionate % to the word 'happy'.
Is there an algorithm or way of calculation you can think of that would allow me to make a fairer comparison, perhaps by taking into account the length of the sentence?
As many pointed out, you have to go down to syntactic tree, something similar to this work.
Also, consider this:
John told Mary yesterday that he was happy.
John told Mary yesterday that she was happy.
The second one tells nothing about John's happiness, but naive algorithm would be confused quickly. So in addition to syntax parsing, pronouns have to represent linking to the subjects. In particular, that means that the algorithm should know that John is he and Mary is she.
Ignoring the issue of negation raised by HappyTimeGopher, you can simply divide the number of happy words in the sentence by the length of the sentence. You get:
John is happy. (1 word in the category 'happy' / 3 words in sentence = score of 33% for happy)
John told Mary yesterday that he was happy. (1/8 = 12.5% happy)
Keep in mind word-list based approaches will only go so far. What should be the score for "I was happy with the food, but the waiter was horrible"? Consider using a more sophisticated system--- the papers below are a good place to start your research:
Choi, Y., & Cardie, C. (2008). Learning with compositional semantics as structural inference for subsentential sentiment analysis.
Moilanen, K., & Pulman, S. (2009). Multi-entity sentiment scoring.
Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis.
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques.
Turney, P. D., & Littman, M. L. (2003). Measuring praise and criticism: Inference of semantic orientation from association.