New to NLP help needed with using spacy to get POS - nlp

I have a list below. I want to get the corresponding POS against each token. I have given a sample output below
processed_lst = [['The', 'wild', 'is', 'dangerous'], ['The', 'rockstar', 'is', 'wild']]
I want to use the spacy library and get output like
final_lst = [[(The, DET), (wild, NOUN), (is, AUX), (dangerous, ADJ)], [(The, DET), (rockstar, NOUN), (is, AUX), (wild, ADJ) ]]

You can do this with the .pos_ attributes of a token after you turn it into a spaCy document. The code below is pulled from this post on Part of Speech Tagging.
import spacy
nlp = spacy.load("en_core_web_sm")
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
doc = nlp(text)
for token in doc:
print(token.text, token.pos_, token.tag_)

Related

Extend length of OpenAI API TLDR output

I'd like to produce a 3-6 sentence summary from a 2-3 page article, using OpenAI's TLDR. I've pasted the article text but the output seems to stay between 1 and 2 sentences only.
Options to get a 3-sentence summary for a given prompt
There are multiple options how you can tell the OpenAI API you want a 3-sentence summary:
Option 1: Write TL;DR in 3 sentences (7 tokens)
Option 2: TL;DR 3 sentences (5 tokens)
Option 3: Write summary in 3 sentences (5 tokens)
Option 4: Summary 3 sentences (3 tokens)
Note: I used the Tokenizer to calculate the number of tokens.
All of the above will return a 3-sentence summary.
Choose Option 4
Why? Because you pay for every token you use. Therefore, Option 4 will cost the least.
What's more? If you write TL;DR you'll burn 3 tokens which is the same as if you write Summary 3 sentences. In other words, it costs you the same, but you get a 3-sentence summary if you write Summary 3 sentences.
Also, don't forget to set the max_tokens parameter high enough so the OpenAI API can return 3 sentences.
An example
Let's take an example from the official OpenAI website.
Prompt:
A neutron star is the collapsed core of a massive supergiant star, which
had a total mass of between 10 and 25 solar masses, possibly more if the
star was especially metal-rich.[1] Neutron stars are the smallest and
densest stellar objects, excluding black holes and hypothetical white
holes, quark stars, and strange stars.[2] Neutron stars have a radius on
the order of 10 kilometres (6.2 mi) and a mass of about 1.4 solar masses.
[3] They result from the supernova explosion of a massive star, combined
with gravitational collapse, that compresses the core past white dwarf star
density to that of atomic nuclei.
Summary 3 sentences
Completion:
Neutron stars are the smallest and densest stellar objects, with a
radius of about 10 kilometres and a mass of 1.4 solar masses. They are
formed from the supernova explosion of a massive star, which had a
total mass between 10-25 solar masses, combined with gravitational
collapse to compress the core beyond white dwarf density. As such,
neutron stars represent the collapsed cores of massive supergiant
stars.

Gensim doc2vec's d2v.wv.most_similar() gives not relevant words with high similarity scores

I've got a dataset of job listings with about 150 000 records. I extracted skills from descriptions using NER using a dictionary of 30 000 skills. Every skill is represented as an unique identificator.
My data example:
job_title job_id skills
1 business manager 4 12 13 873 4811 482 2384 48 293 48
2 java developer 55 48 2838 291 37 484 192 92 485 17 23 299 23...
3 data scientist 21 383 48 587 475 2394 5716 293 585 1923 494 3
Then, I train a doc2vec model using these data where job titles (their ids to be precise) are used as tags and skills vectors as word vectors.
def tagged_document(df):
for index, row in df.iterrows():
yield gensim.models.doc2vec.TaggedDocument(row['skills'].split(), [str(row['job_id'])])
data_for_training = list(tagged_document(data[['job_id', 'skills']]))
model_d2v = gensim.models.doc2vec.Doc2Vec(dm=0, dbow_words=1, vector_size=80, min_count=3, epochs=100, window=100000)
model_d2v.build_vocab(data_for_training)
model_d2v.train(data_for_training, total_examples=model_d2v.corpus_count, epochs=model_d2v.epochs)
It works mostly okay, but I have issues with some job titles. I tried to collect more data from them, but I still have an unpredictable behavior with them.
For example, I have a job title "Director Of Commercial Operations" which is represented as 41 data records having from 11 to 96 skills (mean 32). When I get most similar words for it (skills in my case) I get the following:
docvec = model_d2v.docvecs[id_]
model_d2v.wv.most_similar(positive=[docvec], topn=5)
capacity utilization 0.5729076266288757
process optimization 0.5405482649803162
goal setting 0.5288119316101074
aeration 0.5124399662017822
supplier relationship management 0.5117508172988892
These are top 5 skills and 3 of them look relevant. However the top one doesn't look too valid together with "aeration". The problem is that none of the job title records have these skills at all. It seems like a noise in the output, but why it gets one of the highest similarity scores (although generally not high)?
Does it mean that the model can't outline very specific skills for this kind of job titles?
Can the number of "noisy" skills be reduced? Sometimes I see much more relevant skills with lower similarity score, but it's often lower than 0.5.
One more example of correct behavior with similar amount of data:
BI Analyst, 29 records, number of skills from 4 to 48 (mean 21). The top skills look alright.
business intelligence 0.6986587047576904
business intelligence development 0.6861011981964111
power bi 0.6589289903640747
tableau 0.6500121355056763
qlikview (data analytics software) 0.6307920217514038
business intelligence tools 0.6143202781677246
dimensional modeling 0.6032138466835022
exploratory data analysis 0.6005223989486694
marketing analytics 0.5737696886062622
data mining 0.5734485387802124
data quality 0.5729933977127075
data visualization 0.5691111087799072
microstrategy 0.5566076636314392
business analytics 0.5535123348236084
etl 0.5516749620437622
data modeling 0.5512707233428955
data profiling 0.5495884418487549
If the your gold standard of what the model should report is skills that appeared in the training data, are you sure you don't want a simple count-based solution? For example, just provide a ranked list of the skills that appear most often in Director Of Commercial Operations listings?
On the other hand, the essence of compressing N job titles, and 30,000 skills, into a smaller (in this case vector_size=80) coordinate-space model is to force some non-intuitive (but perhaps real) relationships to be reflected in the model.
Might there be some real pattern in the model – even if, perhaps, just some idiosyncracies in the appearance of less-common skills – that makes aeration necessarily slot near those other skills? (Maybe it's a rare skill whose few contextual appearances co-occur with other skills very much near 'capacity utilization' -meaning with the tiny amount of data available, & tiny amount of overall attention given to this skill, there's no better place for it.)
Taking note of whether your 'anomalies' are often in low-frequency skills, or lower-freqeuncy job-ids, might enable a closer look at the data causes, or some disclaimering/filtering of most_similar() results. (The most_similar() method can limit its returned rankings to the more frequent range of the known vocabulary, for cases when the long-tail or rare words are, in with their rougher vectors, intruding in higher-quality results from better-reqpresented words. See the restrict_vocab parameter.)
That said, tinkering with training parameters may result in rankings that better reflect your intent. A larger min_count might remove more tokens that, lacking sufficient varied examples, mostly just inject noise into the rest of training. A different vector_size, smaller or larger, might better capture the relationships you're looking for. A more-aggressive (smaller) sample could discard more high-frequency words that might be starving more-interesting less-frequent words of a chance to influence the model.
Note that with dbow_words=1 & a large window, and records with (perhaps?) dozens of skills each, the words are having a much-more neighborly effect on each other, in the model, than the tag<->word correlations. That might be good or bad.

How Sklearn Latent Dirichlet Allocation really Works?

I have some texts and I'm using sklearn LatentDirichletAllocation algorithm to extract the topics from the texts.
I already have the texts converted into sequences using Keras and I'm doing this:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation()
X_topics = lda.fit_transform(X)
X:
print(X)
# array([[0, 988, 233, 21, 42, 5436, ...],
[0, 43, 6526, 21, 566, 762, 12, ...]])
X_topics:
print(X_topics)
# array([[1.24143852e-05, 1.23983890e-05, 1.24238815e-05, 2.08399432e-01,
7.91563331e-01],
[5.64976371e-01, 1.33304549e-05, 5.60003133e-03, 1.06638803e-01,
3.22771464e-01]])
My question is, what is exactly what's being returned from fit_transform, I know that should be the main topics detected from the texts but I cannot map those numbers to an index so I'm not able to see what those sequences means, I failed at searching for an explanation of what is actually happening, so any suggestion will be much appreciated.
First, a general explanation - think of LDiA as a clustering algorithm, that's going to determine, by default, 10 centroids, based on the frequencies of words in the texts, and it's going to put greater weights on some of those words than others by virtue of proximity to the centroid. Each centroid represents a 'topic' in this context, where the topic is unnamed, but can be sort of described by the words that are most dominant in forming each cluster.
So generally what you're doing with LDA is:
getting it to tell you what the 10 (or whatever) topics are of a given text.
or
getting it to tell you which centroid/topic some new text is closest to
For the second scenario, your expectation is that LDiA will output the "score" of the new text for each of the 10 clusters/topics. The index of the highest score is the index of the cluster/topic to which that new text belongs.
I prefer gensim.models.LdaMulticore, but since you've used the sklearn.decomposition.LatentDirichletAllocation I'll use that.
Here's some sample code (drawn from here) that runs through this process
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import random
n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
message = "Topic #%d: " % topic_idx
message += " ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]])
print(message)
print()
data, _ = fetch_20newsgroups(shuffle=True, random_state=1,
remove=('headers', 'footers', 'quotes'),
return_X_y=True)
X = data[:n_samples]
#create a count vectorizer using the sklearn CountVectorizer which has some useful features
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
max_features=n_features,
stop_words='english')
vectorizedX = tf_vectorizer.fit_transform(X)
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
lda.fit(vectorizedX)
Now let's try a new text:
testX = tf_vectorizer.transform(["I am educated about learned stuff"])
#get lda to score this text against each of the 10 topics
lda.transform(testX)
Out:
array([[0.54995409, 0.05001176, 0.05000163, 0.05000579, 0.05 ,
0.05001033, 0.05000001, 0.05001449, 0.05000123, 0.05000066]])
#looks like the first topic has the high score - now what are the words that are most associated with each topic?
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)
Out:
Topics in LDA model:
Topic #0: edu com mail send graphics ftp pub available contact university list faq ca information cs 1993 program sun uk mit
Topic #1: don like just know think ve way use right good going make sure ll point got need really time doesn
Topic #2: christian think atheism faith pittsburgh new bible radio games alt lot just religion like book read play time subject believe
Topic #3: drive disk windows thanks use card drives hard version pc software file using scsi help does new dos controller 16
Topic #4: hiv health aids disease april medical care research 1993 light information study national service test led 10 page new drug
Topic #5: god people does just good don jesus say israel way life know true fact time law want believe make think
Topic #6: 55 10 11 18 15 team game 19 period play 23 12 13 flyers 20 25 22 17 24 16
Topic #7: car year just cars new engine like bike good oil insurance better tires 000 thing speed model brake driving performance
Topic #8: people said did just didn know time like went think children came come don took years say dead told started
Topic #9: key space law government public use encryption earth section security moon probe enforcement keys states lunar military crime surface technology
Seems sensible - the sample text is about education and the word cloud for the first topic is about education.
The pictures below are from another dataset (ham vs spam SMS messages, so only two possible topics) which I reduced to 3 dimensions with PCA, but in case a picture helps, these two (same data from different angles) might give a general sense of what's going on with LDiA. (graphs are from Latent Discriminant Analysis vs LDiA, but the representation is still relevant)
While LDiA is an unsupervised method, to actually use it in a business context you'll likely want to at least manually intervene to give the topics names that are meaningful to your context. e.g. Assigning a subject area to stories on a news aggregation site, choosing amongst ['Business', 'Sports', 'Entertainment', etc]
For further study, perhaps run through something like this:
https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

spaCy fails to properly parse medical text

Recently I have been experiencing some issues while splitting some medical text into sentences with spaCy. Maybe you can explain, why these issues arise?
If the word has a length of 1 and the sentence ends with a dot, the end of the sentence won't be recognized.
For example:
There was no between-treatment difference in preoperative or
postoperative hemodynamics or in release of troponin I. (NO SPLIT HERE) Preoperative
oral coenzyme Q(10) therapy in patients undergoing cardiac surgery
increases myocardial and cardiac mitochondrial coenzyme Q(10) levels,
improves mitochondrial efficiency, and increases myocardial tolerance
to in vitro hypoxia-reoxygenation stress.
Another issue is with the characters +/-, which is treated as the end of a sentence. For instance one whole sentence is split into several sentences like below:
VO(2max) decreased significantly by 3.6 +/-
2.1, 14 +/-
2.5, and 27.4 +/-
3.6% in TW, and by 5 +/-
4, 9.4 +/-
6.4, and 18.7 +/-
7% in SW at 1000, 2500, and 4500 m, respectively.
All of the above should be one single sentence!
Sometimes the sentence is interrupted between a word and a special character (special and special character, number and a word with a length less than 3).
The survival rates for patients receiving left ventricular assist
devices (n = 68) versus patients receiving optimal medical management
(n = 61) were 52% versus 28% at 1 year and 29% versus 13% at 2 years SPLITS HERE
( P = .008, log-rank test).
Thank you very much!
SpaCy's English models are trained on web data - mostly stuff like blog posts. Obviously the average blog post looks nothing like the medical literature you're working on, so spaCy is wildly confused. This problem isn't specific to spaCy, it will also happen with any system designed to work on "typical" English that doesn't include medical papers and uses statistical modelling.
Medical text is pretty notorious for having problems with NLP techniques that work in other circumstances, so you may want to look around for something specifically tailored for that. Alternately you can try making a small training set based on your data and making a new spaCy model.
That said, the +/- issue does look strange, and might be based on a tokenization issue or something rather than a model issue - I would recommend you file a bug report here.

how to set dynamic time warping window adjustment?

I am using a DTW (dynamic time warping) code.
does anybody happen to know how should I set the size of adjustment window ? (global path constraint)
cross validation
get some labeled data, set the warping window to zero, measure the leave one out accuracy
Then keep increasing the warping window size until the accuracy gets worse.
See fig 5 and 6 of the paper below
eamonn
Ratanamahatana, C. A. and Keogh. E. (2004). Everything you know about Dynamic Time Warping is Wrong. Third Workshop on Mining Temporal and Sequential Data, in conjunction with the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), August 22-25, 2004 - Seattle, WA
http://www.cs.ucr.edu/~eamonn/DTW_myths.pdf

Resources