Extend length of OpenAI API TLDR output - openai-api

I'd like to produce a 3-6 sentence summary from a 2-3 page article, using OpenAI's TLDR. I've pasted the article text but the output seems to stay between 1 and 2 sentences only.

Options to get a 3-sentence summary for a given prompt
There are multiple options how you can tell the OpenAI API you want a 3-sentence summary:
Option 1: Write TL;DR in 3 sentences (7 tokens)
Option 2: TL;DR 3 sentences (5 tokens)
Option 3: Write summary in 3 sentences (5 tokens)
Option 4: Summary 3 sentences (3 tokens)
Note: I used the Tokenizer to calculate the number of tokens.
All of the above will return a 3-sentence summary.
Choose Option 4
Why? Because you pay for every token you use. Therefore, Option 4 will cost the least.
What's more? If you write TL;DR you'll burn 3 tokens which is the same as if you write Summary 3 sentences. In other words, it costs you the same, but you get a 3-sentence summary if you write Summary 3 sentences.
Also, don't forget to set the max_tokens parameter high enough so the OpenAI API can return 3 sentences.
An example
Let's take an example from the official OpenAI website.
Prompt:
A neutron star is the collapsed core of a massive supergiant star, which
had a total mass of between 10 and 25 solar masses, possibly more if the
star was especially metal-rich.[1] Neutron stars are the smallest and
densest stellar objects, excluding black holes and hypothetical white
holes, quark stars, and strange stars.[2] Neutron stars have a radius on
the order of 10 kilometres (6.2 mi) and a mass of about 1.4 solar masses.
[3] They result from the supernova explosion of a massive star, combined
with gravitational collapse, that compresses the core past white dwarf star
density to that of atomic nuclei.
Summary 3 sentences
Completion:
Neutron stars are the smallest and densest stellar objects, with a
radius of about 10 kilometres and a mass of 1.4 solar masses. They are
formed from the supernova explosion of a massive star, which had a
total mass between 10-25 solar masses, combined with gravitational
collapse to compress the core beyond white dwarf density. As such,
neutron stars represent the collapsed cores of massive supergiant
stars.

Related

New to NLP help needed with using spacy to get POS

I have a list below. I want to get the corresponding POS against each token. I have given a sample output below
processed_lst = [['The', 'wild', 'is', 'dangerous'], ['The', 'rockstar', 'is', 'wild']]
I want to use the spacy library and get output like
final_lst = [[(The, DET), (wild, NOUN), (is, AUX), (dangerous, ADJ)], [(The, DET), (rockstar, NOUN), (is, AUX), (wild, ADJ) ]]
You can do this with the .pos_ attributes of a token after you turn it into a spaCy document. The code below is pulled from this post on Part of Speech Tagging.
import spacy
nlp = spacy.load("en_core_web_sm")
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
doc = nlp(text)
for token in doc:
print(token.text, token.pos_, token.tag_)

Custom entities extraction from texts

What is the right approach for multi-label text information extraction/classification
Having texts that describe a caregiver/patient visit : (made-up example)
Mr *** visits the clinic on 02/2/2018 complaining about pain in the
lower back for several days, No pathological findings in the x-ray or
in the blood tests. I suggest Mr *** 5 resting days.
Now, that text can be even in a paragraph size where the only information I care about will be lower back pain and resting days. I have 300-400 different labels but the number of labeled samples can be around 1000-1500 (total) . When I label the text I also mark the relevant words that create the "label" ,here it will be ['pain','lower','back'].
When I just use look-up for those words (or the other 300-400 labels) in other texts I manage to label a larger amount of texts but if the words are written in different patterns such as Ache in the lower back or "lowerback pain" and I've never added that pattern to the look-up table of "lower back pain" I won't find it.
Due to the fact that I can have large paragraph but the only information I need is just 3-4 words, DL/ML models do not manage to learn with that amount of data and a high number of labels.I am wondering if there is a way to use the lookup table as a feature in the training phase or to try other approaches

How to combine the results of multiple OCR tools to get better text recognition

Imagine, you have different OCR tools to read text from images but none of them gives you a 100% accurate output. Combined however, the result could come very close to the ground truth - What would be the best technique to "fuse" the text together to get good results?
Example:
Actual text
§ 5.1: The contractor is obliged to announce the delay by 01.01.2019 at the latest. The identification-number to be used is OZ-771LS.
OCR tool 1
5 5.1 The contractor is obliged to announce the delay by O1.O1.2019 at the latest. The identification-number to be used is OZ77lLS.
OCR tool 2
§5.1: The contract or is obliged to announce theedelay by 01.O1. 2O19 at the latest. The identification number to be used is O7-771LS
OCR tool 3
§ 5.1: The contractor is oblige to do announced he delay by 01.01.2019 at the latest. T he identification-number ti be used is OZ-771LS.
What could be a promising algorithm to fuse OCR 1, 2 and 3 to get the actual text?
My first idea was creating a "tumbling window" of an arbitrary length, compare the words in the window and take the words 2 out of 3 tools predict for every position.
For example with window size 3:
[5 5.1 The]
[§5.1: The contract]
[§ 5.1: The]
As you see, the algorithm doesn't work as all three tools have different candidates for position one (5, §5.1: and §).
Of course it would be possible to add some tricks like Levenshtein distance to allow some deviations but I fear this will not really be robust enough.

spaCy fails to properly parse medical text

Recently I have been experiencing some issues while splitting some medical text into sentences with spaCy. Maybe you can explain, why these issues arise?
If the word has a length of 1 and the sentence ends with a dot, the end of the sentence won't be recognized.
For example:
There was no between-treatment difference in preoperative or
postoperative hemodynamics or in release of troponin I. (NO SPLIT HERE) Preoperative
oral coenzyme Q(10) therapy in patients undergoing cardiac surgery
increases myocardial and cardiac mitochondrial coenzyme Q(10) levels,
improves mitochondrial efficiency, and increases myocardial tolerance
to in vitro hypoxia-reoxygenation stress.
Another issue is with the characters +/-, which is treated as the end of a sentence. For instance one whole sentence is split into several sentences like below:
VO(2max) decreased significantly by 3.6 +/-
2.1, 14 +/-
2.5, and 27.4 +/-
3.6% in TW, and by 5 +/-
4, 9.4 +/-
6.4, and 18.7 +/-
7% in SW at 1000, 2500, and 4500 m, respectively.
All of the above should be one single sentence!
Sometimes the sentence is interrupted between a word and a special character (special and special character, number and a word with a length less than 3).
The survival rates for patients receiving left ventricular assist
devices (n = 68) versus patients receiving optimal medical management
(n = 61) were 52% versus 28% at 1 year and 29% versus 13% at 2 years SPLITS HERE
( P = .008, log-rank test).
Thank you very much!
SpaCy's English models are trained on web data - mostly stuff like blog posts. Obviously the average blog post looks nothing like the medical literature you're working on, so spaCy is wildly confused. This problem isn't specific to spaCy, it will also happen with any system designed to work on "typical" English that doesn't include medical papers and uses statistical modelling.
Medical text is pretty notorious for having problems with NLP techniques that work in other circumstances, so you may want to look around for something specifically tailored for that. Alternately you can try making a small training set based on your data and making a new spaCy model.
That said, the +/- issue does look strange, and might be based on a tokenization issue or something rather than a model issue - I would recommend you file a bug report here.

tf-idf using data on unigram frequency from Google

I'm trying to identify important terms in a set of government documents. Generating the term frequencies is no problem.
For document frequency, I was hoping to use the handy Python scripts and accompanying data that Peter Norvig posted for his chapter in "Beautiful Data", which include the frequencies of unigrams in a huge corpus of data from the Web.
My understanding of tf-idf, however, is that "document frequency" refers to the number of documents containing a term, not the number of total words that are this term, which is what we get from the Norvig script. Can I still use this data for a crude tf-idf operation?
Here's some sample data:
word tf global frequency
china 1684 0.000121447
the 352385 0.022573582
economy 6602 0.0000451130774123
and 160794 0.012681757
iran 2779 0.0000231482902018
romney 1159 0.000000678497795593
Simply dividing tf by gf gives "the" a higher score than "economy," which can't be right. Is there some basic math I'm missing, perhaps?
As I understand, Global Frequency is equal "inverse total term frequency" mentioned here Robertson. From this Robertson's paper:
One possible way to get away from this problem would be to make a fairly radical re-
placement for IDF (that is, radical in principle, although it may be not so radical
in terms of its practical effects). ....
the probability from the event space of documents to the event space of term positions
in the concatenated text of all the documents in the collection.
Then we have a new measure, called here
inverse total term frequency:
...
On the whole, experiments with inverse total term frequency weights have tended to show
that they are not as effective as IDF weights
According to this text, you can use inverse global frequency as IDF term, albeit more crude than standard one.
Also you are missing stop words removal. Words such as the are used in almost all documents, therefore they do not give any information. Before tf-idf , you should remove such stop words.

Resources