spaCy fails to properly parse medical text

Recently I have been experiencing some issues while splitting some medical text into sentences with spaCy. Maybe you can explain, why these issues arise?
If the word has a length of 1 and the sentence ends with a dot, the end of the sentence won't be recognized.
For example:
There was no between-treatment difference in preoperative or
postoperative hemodynamics or in release of troponin I. (NO SPLIT HERE) Preoperative
oral coenzyme Q(10) therapy in patients undergoing cardiac surgery
increases myocardial and cardiac mitochondrial coenzyme Q(10) levels,
improves mitochondrial efficiency, and increases myocardial tolerance
to in vitro hypoxia-reoxygenation stress.
Another issue is with the characters +/-, which is treated as the end of a sentence. For instance one whole sentence is split into several sentences like below:
VO(2max) decreased significantly by 3.6 +/-
2.1, 14 +/-
2.5, and 27.4 +/-
3.6% in TW, and by 5 +/-
4, 9.4 +/-
6.4, and 18.7 +/-
7% in SW at 1000, 2500, and 4500 m, respectively.
All of the above should be one single sentence!
Sometimes the sentence is interrupted between a word and a special character (special and special character, number and a word with a length less than 3).
The survival rates for patients receiving left ventricular assist
devices (n = 68) versus patients receiving optimal medical management
(n = 61) were 52% versus 28% at 1 year and 29% versus 13% at 2 years SPLITS HERE
( P = .008, log-rank test).
Thank you very much!

SpaCy's English models are trained on web data - mostly stuff like blog posts. Obviously the average blog post looks nothing like the medical literature you're working on, so spaCy is wildly confused. This problem isn't specific to spaCy, it will also happen with any system designed to work on "typical" English that doesn't include medical papers and uses statistical modelling.
Medical text is pretty notorious for having problems with NLP techniques that work in other circumstances, so you may want to look around for something specifically tailored for that. Alternately you can try making a small training set based on your data and making a new spaCy model.
That said, the +/- issue does look strange, and might be based on a tokenization issue or something rather than a model issue - I would recommend you file a bug report here.


Gensim doc2vec's d2v.wv.most_similar() gives not relevant words with high similarity scores

I've got a dataset of job listings with about 150 000 records. I extracted skills from descriptions using NER using a dictionary of 30 000 skills. Every skill is represented as an unique identificator.
My data example:
job_title job_id skills
1 business manager 4 12 13 873 4811 482 2384 48 293 48
2 java developer 55 48 2838 291 37 484 192 92 485 17 23 299 23...
3 data scientist 21 383 48 587 475 2394 5716 293 585 1923 494 3
Then, I train a doc2vec model using these data where job titles (their ids to be precise) are used as tags and skills vectors as word vectors.
def tagged_document(df):
for index, row in df.iterrows():
yield gensim.models.doc2vec.TaggedDocument(row['skills'].split(), [str(row['job_id'])])
data_for_training = list(tagged_document(data[['job_id', 'skills']]))
model_d2v = gensim.models.doc2vec.Doc2Vec(dm=0, dbow_words=1, vector_size=80, min_count=3, epochs=100, window=100000)
model_d2v.train(data_for_training, total_examples=model_d2v.corpus_count, epochs=model_d2v.epochs)
It works mostly okay, but I have issues with some job titles. I tried to collect more data from them, but I still have an unpredictable behavior with them.
For example, I have a job title "Director Of Commercial Operations" which is represented as 41 data records having from 11 to 96 skills (mean 32). When I get most similar words for it (skills in my case) I get the following:
docvec = model_d2v.docvecs[id_]
model_d2v.wv.most_similar(positive=[docvec], topn=5)
capacity utilization 0.5729076266288757
process optimization 0.5405482649803162
goal setting 0.5288119316101074
aeration 0.5124399662017822
supplier relationship management 0.5117508172988892
These are top 5 skills and 3 of them look relevant. However the top one doesn't look too valid together with "aeration". The problem is that none of the job title records have these skills at all. It seems like a noise in the output, but why it gets one of the highest similarity scores (although generally not high)?
Does it mean that the model can't outline very specific skills for this kind of job titles?
Can the number of "noisy" skills be reduced? Sometimes I see much more relevant skills with lower similarity score, but it's often lower than 0.5.
One more example of correct behavior with similar amount of data:
BI Analyst, 29 records, number of skills from 4 to 48 (mean 21). The top skills look alright.
business intelligence 0.6986587047576904
business intelligence development 0.6861011981964111
power bi 0.6589289903640747
tableau 0.6500121355056763
qlikview (data analytics software) 0.6307920217514038
business intelligence tools 0.6143202781677246
dimensional modeling 0.6032138466835022
exploratory data analysis 0.6005223989486694
marketing analytics 0.5737696886062622
data mining 0.5734485387802124
data quality 0.5729933977127075
data visualization 0.5691111087799072
microstrategy 0.5566076636314392
business analytics 0.5535123348236084
etl 0.5516749620437622
data modeling 0.5512707233428955
data profiling 0.5495884418487549
If the your gold standard of what the model should report is skills that appeared in the training data, are you sure you don't want a simple count-based solution? For example, just provide a ranked list of the skills that appear most often in Director Of Commercial Operations listings?
On the other hand, the essence of compressing N job titles, and 30,000 skills, into a smaller (in this case vector_size=80) coordinate-space model is to force some non-intuitive (but perhaps real) relationships to be reflected in the model.
Might there be some real pattern in the model – even if, perhaps, just some idiosyncracies in the appearance of less-common skills – that makes aeration necessarily slot near those other skills? (Maybe it's a rare skill whose few contextual appearances co-occur with other skills very much near 'capacity utilization' -meaning with the tiny amount of data available, & tiny amount of overall attention given to this skill, there's no better place for it.)
Taking note of whether your 'anomalies' are often in low-frequency skills, or lower-freqeuncy job-ids, might enable a closer look at the data causes, or some disclaimering/filtering of most_similar() results. (The most_similar() method can limit its returned rankings to the more frequent range of the known vocabulary, for cases when the long-tail or rare words are, in with their rougher vectors, intruding in higher-quality results from better-reqpresented words. See the restrict_vocab parameter.)
That said, tinkering with training parameters may result in rankings that better reflect your intent. A larger min_count might remove more tokens that, lacking sufficient varied examples, mostly just inject noise into the rest of training. A different vector_size, smaller or larger, might better capture the relationships you're looking for. A more-aggressive (smaller) sample could discard more high-frequency words that might be starving more-interesting less-frequent words of a chance to influence the model.
Note that with dbow_words=1 & a large window, and records with (perhaps?) dozens of skills each, the words are having a much-more neighborly effect on each other, in the model, than the tag<->word correlations. That might be good or bad.

Microsoft Translate Adding Extra Words To Translation

I am trying to translate to from English to Welsh. I have a data set of 3032 sentences which I am aware is below the recommended 10000 limit but the issue is random words being added to sentences or added at the end of the translation.
With the dataset I have, I am getting a BLEU score of 94.25.
Image of Translation Differences
I have attached four examples where extra words are being added throughout the form. At no point in the dataset is there duplication of words that match any of these formats and there is no trailing whitespace in the translations which would explain why "yn" in particular is appearing as a new sentence.
Is there any way of removing these erroneous extra words or increasing the accuracy of the translation? To increase the overall amount of sentences to more than 10000 would be a very large task and would not be something to undertake if the system is still going to have a high chance of returning random words.
I also raised this as a support request with Microsoft. They had said the issue was down to using a dictionary that included verbs as part of the translation.
I have since tried using English UK as the basis for the translation - an option that previously failed to build - and with the same dataset the BLEU score is 93.24 but the extra words have disappeared.
My issue has been resolved and it's now down to training out the incorrect translations. It appears the English to Welsh translation has a bug.

How can Stanford CoreNLP Named Entity Recognition capture measurements like 5 inches, 5", 5 in., 5 in

I'm looking to capture measurements using Stanford CoreNLP. (If you can suggest a different extractor, that is fine too.)
For example, I want to find 15kg, 15 kg, 15.0 kg, 15 kilogram, 15 lbs, 15 pounds, etc. But among CoreNLPs extraction rules, I don't see one for measurements.
Of course, I can do this with pure regexes, but toolkits can run more quickly, and they offer the opportunity to chunk at a higher level, e.g. to treat gb and gigabytes together, and RAM and memory as building blocks--even without full syntactic parsing--as they build bigger units like 128 gb RAM and 8 gigabytes memory.
I want an extractor for this that is rule-based, not machine-learning-based), but don't see one as part of RegexNer or elsewhere. How do I go about this?
IBM Named Entity Extraction can do this. The regexes are run in an efficient way rather than passing the text through each one. And the regexes are bundled to express meaningful entities, as for example one that unites all the measurement units into a single concept.
I don't think a rule-based system exists for this particular task. However, it shouldn't be hard to make with TokensregexNER. For example, a mapping like:
[{ner:NUMBER}]+ /(k|m|g|t)b/ memory? MEMORY
[{ner:NUMBER}]+ /"|''|in(ches)?/ LENGTH
You could try using vanilla TokensRegex as well, and then just extract out the relevant value with a capture group:
(?$group_name [{ner:NUMBER}]+) /(k|m|g|t)b/ memory?
You can build your own training data and label the required measurements accordingly.
For example if you have a sentence like Jack weighs about 50 kgs
So the model will classify your input as:
weighs, O
about, O
50, MES
kgs, MES
Where MES stands for measurements.
I have recently made training data for the Stanford NER tagger for my customized problem and have built a model for it.
I think for Stanford CoreNLP NER also you can do the same thing
This may be a machine learning-based approach rather than a rule-based approach

Systematic threshold for cosine similarity with TF-IDF weights

I am running an analysis of several thousand (e.g., 10,000) text documents. I have computed TF-IDF weights and have a matrix with pairwise cosine similarities. I want to treat the documents as a graph to analyze various properties (e.g., the path length separating groups of documents) and to visualize the connections as a network.
The problem is that there are too many similarities. Most are too small to be meaningful. I see many people dealing with this problem by dropping all similarities below a particular threshold, e.g., similarities below 0.5.
However, 0.5 (or 0.6, or 0.7, etc.) is an arbitrary threshold, and I'm looking for techniques that are more objective or systematic to get rid of tiny similarities.
I'm open to many different strategies. For example, is there a different alternative to tf-idf that would make most of the small similarities 0? Other methods to keep only significant similarities?
In short, take the average cosine value of an initial clustering or even all of the initial sentences and accept or reject clusters based on something akin to the following.
One way to look at the problem is to try and develop a score based on a distance from the mean similarity (1.5 standard deviations (86th percentile if the data were normal) tends to mark an outlier with 3 (99.9th percentile) being an extreme outlier), taking the high end for good measure. I cannot remember where, but this idea has had traction in other forums and formed the basis for my similarity.
Keep in mind that the data is not likely to be normally distributed.
In order to obtain alpha, you could use the Wu Palmer score or another score as described by NLTK. Strong similarities with Wu Palmer should lead to a larger range of acceptance while lower Wu Palmer scores should lead to a more strict acceptance. Therefore, taking 1-Wu Palmer score would be adviseable. You can even use this method for LSA or LDA groups. To be even more strict and take things close to 1.5 or more standard deviations, you could even try 1+Wu Palmer (the cream of the crop), re-find the ultimate K,find the new score, cluster, and repeat.
Beware though, this would mean finding the Wu Palmer of all relevant words and is quite a large computational problem. Also, 10000 documents is peanuts compared to most algorithms. The smallest I have seen for tweets was 15,000 and the 20 news groups set was 20,000 documents. I am pretty sure Alchemy API uses something akin to the 20 news groups set. They definitely use senti-wordnet.
The basic equation is not really mine so feel free to dig around for it.
Another thing to keep in mind is that the calculation is time intensive. It may be a good idea to use a student t value for estimating the expected value/mean wu-palmer score of SOV pairings and especially good if you try to take the entire sentence. Commons Math3 for java/scala includes the distribution as does scipy for python and R should already have something as well.
Xbar +/- tsub(alpha/2)*sample_std/sqrt(sample_size)
Note: There is another option with this weight. You could use an algorithm that adds or subtracts from this threshold until achieving the best result. This would likely not be related solely to the cosine importance but possibly to an inflection point or gap as with Tibshirani's gap statistic.

tf-idf using data on unigram frequency from Google

I'm trying to identify important terms in a set of government documents. Generating the term frequencies is no problem.
For document frequency, I was hoping to use the handy Python scripts and accompanying data that Peter Norvig posted for his chapter in "Beautiful Data", which include the frequencies of unigrams in a huge corpus of data from the Web.
My understanding of tf-idf, however, is that "document frequency" refers to the number of documents containing a term, not the number of total words that are this term, which is what we get from the Norvig script. Can I still use this data for a crude tf-idf operation?
Here's some sample data:
word tf global frequency
china 1684 0.000121447
the 352385 0.022573582
economy 6602 0.0000451130774123
and 160794 0.012681757
iran 2779 0.0000231482902018
romney 1159 0.000000678497795593
Simply dividing tf by gf gives "the" a higher score than "economy," which can't be right. Is there some basic math I'm missing, perhaps?
As I understand, Global Frequency is equal "inverse total term frequency" mentioned here Robertson. From this Robertson's paper:
One possible way to get away from this problem would be to make a fairly radical re-
placement for IDF (that is, radical in principle, although it may be not so radical
in terms of its practical effects). ....
the probability from the event space of documents to the event space of term positions
in the concatenated text of all the documents in the collection.
Then we have a new measure, called here
inverse total term frequency:
On the whole, experiments with inverse total term frequency weights have tended to show
that they are not as effective as IDF weights
According to this text, you can use inverse global frequency as IDF term, albeit more crude than standard one.
Also you are missing stop words removal. Words such as the are used in almost all documents, therefore they do not give any information. Before tf-idf , you should remove such stop words.
