OpenAI API error: "This model's maximum context length is 4097 tokens" - openai-api

I am making a request to the completions endpoint. My prompt is 1360 tokens, as verified by the Playground and the Tokenizer. I won't show the prompt as it's a little too long for this question.
Here is my request to openai in Nodejs using the openai npm package.
const response = await openai.createCompletion({
model: 'text-davinci-003',
prompt,
max_tokens: 4000,
temperature: 0.2
})
When testing in the playground my total tokens after response are 1374.
When submitting my prompt via the completions API I am getting the following error:
error: {
message: "This model's maximum context length is 4097 tokens, however you requested 5360 tokens (1360 in your prompt; 4000 for the completion). Please reduce your prompt; or completion length.",
type: 'invalid_request_error',
param: null,
code: null
}
If you have been able to solve this one, I'd love to hear how you did it.

The max_tokens parameter is shared between the prompt and the completion. Tokens from the prompt and the completion all together should not exceed the token limit of a particular GPT-3 model.
As stated on official OpenAI website:
Depending on the model used, requests can use up to 4097 tokens shared
between prompt and completion. If your prompt is 4000 tokens, your
completion can be 97 tokens at most.
The limit is currently a technical limitation, but there are often
creative ways to solve problems within the limit, e.g. condensing your
prompt, breaking the text into smaller pieces, etc.
GPT-3 models:

This was solved by Reddit user 'bortlip'.
The max_tokens parameter defines the response tokens.
From OpenAI:
https://platform.openai.com/docs/api-reference/completions/create#completions/create-max_tokens
The token count of your prompt plus max_tokens cannot exceed the model's context length.
Therefore to solve the issue I subtract the token count of the prompt from the max_tokens and it works just fine.

Related

Openai API continue the output of the above content

How do you solve the problem of continuous output of the Openai API, such as letting the gpt api write an article. If the content is interrupted, you can continue to ask questions, so as to continue the output of the above content. This is very easy to do in ChatGPT, but after the Openai API adds the above to prompt, it will always report an error because the tokens exceed. If you don't add the above content, you can't continue the above content?
adds the above to prompt, it will always report an error because the tokens exceed.
In terms of the error related to the tokens being exceeded. I also ran into this problem.
This problem has been answered in this question: OpenAI API error: "This model's maximum context length is 4097 tokens"
It has to do with the fact that your prompt tokens plus the max_tokens must be less than 4097 tokens in total.
For example if your prompt is 200 tokens and your max_tokens is set to 4000 this will result in a context length of 4200 and the API will return an error.
If your prompt is 200 tokens your max_tokens can be set to a maximum of 3,896.
What you likely want to do here build a system that can take the input, and make a summary of them. You can then keep track of the length of the inputs and ensure that the state of your "conversation memory" does not exceed the max number of tokens you want. Then, each time you make a request, you pass in the summarized/truncated version of your conversation history along with the prompt itself.

How is a token value generated in mainline dht's get_peers query

I am reading bep 5 and trying to understand how a token value is generated. As I understand the token value is a randomly generated value that is used in a get_peers query for safety. This same token value would then be used in an announced_peers query to see if the same IP previously requested the same Infohash.
My question is how is this value generated exactly? It says something about an unspecified implementation - does this mean I can implement it myself (for example by using the SHA-1 value)?
I tried looking at other beps but couldn't find anything about specific rules for generating a token value, found nothing.
The token represents a write permission so that the other node may follow up with an announce request carrying that write permission.
Since the write permission is specific to an individual node providing the token it is not necessary to specify how it keeps track of valid write permissions, as there needs to be no agreement between nodes how the implementation works. For everyone else the token is just an opaque sequence of bytes, essentially a key.
Possible implementations are
(stateful) keep a hashmap mapping from supplied tokens to their expiration time and which remote IP it is valid for.
(stateless) hash a secret the remote ip, remote id and a validity-time-window-counter. then truncate the hash. bump the counter on a timer. when verifying check with the current and the previous counter.
Since a token is only valid for a few minutes and a node should also have a spam throttle it doesn't need to be high strength, just enough bits to make it impossible to brute-force. 6-8 bytes is generally enough for that purpose.
The underlying goal is to hand out a space-efficient, time-limited write permission to individual nodes in a way that other nodes can't forge.

context window length in Codex

Is the completion window length included in the context window length for Codex?
For da-vinci, the context window length is set to 4000 tokens.
From what I understand, as an example, if the prompt length is 3500 tokens, then the remaining 500 is for the completion. And there is no way use the whole 4000 token as the prompt.
I am pretty sure in my understanding, but it would be helpful to have it confirmed by someone knowledgeable.
The context length for da-vinci is 4096 tokens. The prompt tokens and max_tokens for the response cannot be greater than the context length.
This is from the OpenAI API docs:
The token count of your prompt plus max_tokens cannot exceed the model's context length. Most models have a context length of 2048 tokens (except for the newest models, which support 4096).
Ref: https://platform.openai.com/docs/api-reference/completions/create

yubikey 5 NFC enter 6 digit code on touch

I'm using my yubikey 5 NFC with U2F as well as for OTP codes. I get OTP codes using Yubico Authenticator app which seems to be a little too complicated and I was wondering if there is a way to assign it to short/long touch on my key so I don't need to open that app every time for codes I use often enough?
It seems that the authenticator uses something else than slots to store credentials, is it possible to read them with ykman or some other official command line utility/sdk?
There are two types of 6-digit OTP codes that are part of OATH: HMAC-based (HOTP), which are generated in a fixed sequence, and time-based (TOTP), which update every 30 seconds or so. TOTP are more commonly used.
The Yubikey can generate HOTP codes on touch, in either slot 1 (short touch) or 2 (long touch). You can set this up with ykman otp hotp 1 or ... 2 as the case may be. It expects the secret key in base 32 format.
This can't be done for TOTP, for the simple reason that in order to generate a time-based code, you have to know what time it is, and the Yubikey doesn't have a real-time clock on board (because it doesn't have any power source to keep it running). So it can't generate TOTP codes without assistance from the software application, which feeds it the current time from the system clock.
If you don't like the graphical authenticator app, you can generate HOTP/TOTP codes from the command line by running ykman oath code.

Why should checking a wrong password take longer than checking the right one?

This question has always troubled me.
On Linux, when asked for a password, if your input is the correct one, it checks right away, with almost no delay. But, on the other hand, if you type the wrong password, it takes longer to check. Why is that?
I observed this in all Linux distributions I've ever tried.
It's actually to prevent brute force attacks from trying millions of passwords per second. The idea is to limit how fast passwords can be checked and there are a number of rules that should be followed.
A successful user/password pair should succeed immediately.
There should be no discernible difference in reasons for failure that can be detected.
That last one is particularly important. It means no helpful messages like:
Your user name is correct but your password is wrong, please try again
or:
Sorry, password wasn't long enough
Not even a time difference in response between the "invalid user and password" and "valid user but invalid password" failure reasons.
Every failure should deliver exactly the same information, textual and otherwise.
Some systems take it even further, increasing the delay with each failure, or only allowing three failures then having a massive delay before allowing a retry.
This makes it take longer to guess passwords.
I am not sure, but it is quite common to integrate a delay after entering a wrong password to make attacks harder. This makes a attack practicaly infeasible, because it will take you a long time to check only a few passwords.
Even trying a few passwords - birthdates, the name of the cat, and things like that - is turned into no fun.
Basically to mitigate against brute force and dictionary attacks.
From The Linux-PAM Application Developer's Guide:
Planning for delays
extern int pam_fail_delay(pam_handle_t *pamh, unsigned int micro_sec);
This function is offered by Linux-PAM
to facilitate time delays following a
failed call to pam_authenticate() and
before control is returned to the
application. When using this function
the application programmer should
check if it is available with,
#ifdef PAM_FAIL_DELAY
....
#endif /* PAM_FAIL_DELAY */
Generally, an application requests
that a user is authenticated by
Linux-PAM through a call to
pam_authenticate() or pam_chauthtok().
These functions call each of the
stacked authentication modules listed
in the relevant Linux-PAM
configuration file. As directed by
this file, one of more of the modules
may fail causing the pam_...() call to
return an error. It is desirable for
there to also be a pause before the
application continues. The principal
reason for such a delay is security: a
delay acts to discourage brute force
dictionary attacks primarily, but also
helps hinder timed (covert channel)
attacks.
It's a very simple, virtually effortless way to greatly increase security. Consider:
System A has no delay. An attacker has a program that creates username/password combinations. At a rate of thousands of attempts per minute, it takes only a few hours to try every combination and record all successful logins.
System B generates a 5-second delay after each incorrect guess. The attacker's efficiency has been reduced to 12 attempts per minute, effectively crippling the brute-force attack. Instead of hours, it can take months to find a valid login. If hackers were that patient, they'd go legit. :-)
Failed authentification delays are there to reduce the rate of login attempt. The idea that if somebody is trying a dictionary or a brute force attack against one or may user accounts that attacker will be required to wait the fail delay and thus forcing him to take more time and giving you more chance to detect it.
You might also be interested in knowing that, depending on what you are using as a login shell there is usually a way to configure this delay.
In GDM, the delay is set in the gdm.conf file (usually in /etc/gdm/gdm.conf). you need to set RetryDelay=x where x is a value in seconds.
Most linux distribution these day also support having FAIL_DELAY defined in /etc/login.defs allowing you to set a wait time after a failed login attempt.
Finally, PAM also allows you to set a nodelay attribute on your auth line to bypass the fail delay. (Here's an article on PAM and linux)
I don't see that it can be as simple as the responses suggest.
If response to a correct password is (some value of) immediate, don't you only have to wait until longer than that value to know the password is wrong? (at least know probabilistically, which is fine for cracking purposes) And anyway you'd be running this attack in parallel... is this all one big DoS welcome mat?
What I tried before appeared to work, but actually did not; if you care you must review the wiki edit history...
What does work (for me) is, to both lower the value of pam_faildelay.so delay=X in /etc/pam.d/login (I lowered it to 500000, half a second), and also add nodelay (preceded by a space) to the end of the line in common-auth, as described by Gabriel in his answer.
auth [success=1 default=ignore] pam_unix.so nullok_secure nodelay
At least for me (debian sid), only making one of these changes will not shorten the delay appreciably below the default 3 seconds, although it is possible to lengthen the delay by only changing the value in /etc/pam.d/login.
This kind of crap is enough to make a grown man cry!
On Ubuntu 9.10, and I think new versions too, the file you're looking for is located on
/etc/pam.d/login
edit the line:
auth optional pam_faildelay.so delay=3000000
changing the number 3 with another you may want.
Note that to have a 'nodelay' authentication, I THINK you should edit the file
/etc/pam.d/common-auth
too. On the line:
auth [success=1 default=ignore] pam_unix.so nullok_secure
add 'nodelay' to the final (without quotes).
But this final explanation about the 'nodelay' is what I think.
I would like to add a note from a developers perspective. Though this wouldn't be obvious to the naked eye a smart developer would break out of a match query when the match is found. In witness, a successful match would complete faster than a failed match. Because, the matching function would compare the credentials to all known accounts until it finds the correct match. In other words, let's say there are 1,000,000 user accounts in order by IDs; 001, 002, 003 and so on. Your ID is 43,001. So, when you put in a correct username and password, the scan stops at 43,001 and logs you in. If your credentials are incorrect then it scans all 1,000,000 records. The difference in processing time on a dual core server might be in the milliseconds. On Windows Vista with 5 user accounts it would be in the nanoseconds.
I agree. This is an arbitrary programming decision. Putting the delay to one second instead of three doesn't really hurt the crackability of the password, but makes it more user-friendly.
Technically, this deliberate delay is to prevent attacks like the "Linearization attack" (there are other attacks and reasons as well).
To illustrate the attack, consider a program (without this
deliberate delay), which checks an entered serial to see whether it
matches the correct serial, which in this case happens to be
"xyba". For efficiency, the programmer decided to check one
character at a time and to exit as soon as an incorrect character is
found, before beginning the lengths are also checked.
The correct serial length will take longer to process than an incorrect serial length. Even better (for attacker), a serial number
that has the first character correct will take longer than any that
has an incorrect first character. The successive steps in waiting time
is because each time there's one more loop, comparison to go through
on correct input.
So, attacker can select a four-character string and that the string beginning with x takes the most time. (by guess work)
Attacker can then fix character as x and vary the second character, in which case they will find that y takes the longest.
Attacker can then fix the first two characters as xy and vary the third character, in which case they will find that b takes the
longest.
Attacker can then fix the first three character as xyb and vary the fourth character,in which case they will find that a takes the
longest.
Hence, the attackers can recover the serial one character at a time.
Linearization.java.
Linearization.docx, sample output
The serial number is four characters long ans each character has 128
possible values. Then there are 1284 = 228 =
268,435,456 possible serials. If attacker must randomly guess
complete serial numbers, she would guess the serial number in about
227 = 134,217,728 tries, which is an enormous amount of work. On the other hand, by using the linearization attack above, an
average of only 128/2 = 64 guesses are required for each letter, for a
total expected work of about 4 * 64 = 28 = 256 guesses,
which is a trivial amount of work.
Much of the written martial is adapted from this (taken from Mark Stamp's "Information Security: Principles and Practice"). Also the calculations above do not take into account the amount of guesswork needed to to figure out the correct serial length.

Resources