How does OpenAI API make tokens (tokenize) from all kinds of different languanges? - nlp

We all know that ChatGPT can accept and produce all kinds of languages such as English, French, Chinese, Japanese and so on.
In traditional NLP, different languages have different token-making methods.
For those Alphabetic Languages such as English, Bert use BPE method to make tokens like below:
Insomnia caused much frustration.
==>
In-, som-, nia, caus-, ed, much, frus-, tra-, tion, .,
For those Charactaristic Languages such as Chinese or Japanese, just use the character itself as the token like below.
東京メトロは心に寄り添う
==>
東, 京, メ, ト, ロ, は, 心, に, 寄, り, 添, う,
我说你倒是快点啊!!!
==>
我, 说, 你, 倒, 是, 快, 点, 啊, !, !, !,
But for ChatGPT, it composes of different languages and can produce both Chinese and English in one sentence. So I am really curious how this model makes tokens.

Use the Tokenizer to understand how a piece of text would be tokenized by the OpenAI API.
For example, Insomnia caused much frustration. would be tokenized as 6 tokens.
Whereas, 我说你倒是快点啊!!! would be tokenized as 27 tokens with a slight note at the bottom:
Note: Your input contained one or more unicode characters that map to
multiple tokens. The output visualization may display the bytes in
each token in a non-standard way.

Related

How to use NLP to detect sentences in a long text?

I am using automatic speech recognition to extract text from an audio file. However, the output is just a long sequence of words with no punctuation whatsoever. What I'd like to do is use some NLP technique to estimate beginnings and endings of sentences, or, in other words, predict positions of punctuation markers. I found that CoreNLP can do sentence splitting, but apparently only if punctuation is already present.
You may find relevant info in the answers to this other question: Sentence annotation in text without punctuation.
In particular, one of the answers claims the deepsegment package works well on unpunctuated text.
In spoken language you often find that people don't use sentences, but that the clauses simply run into each other. The degree to which this happens depends on the formality and setting -- a speech will conform more to written sentence structures than a conversation in a pub among friends.
One approach you could try is to identify words that typically begin/end sentences in written text, and see if that can help you segmenting your data. Or look for verbs, and then try to find boundaries between them; this might be clause boundaries rather than sentence boundaries, but as I said, in spoken language there often are no sentences.

Stanford NLP POS Tagger has issues with very simple phrases?

I found examples of inconsistent behavior in my application using Stanford NLP Parser/POS Tagger and I was able to replicate it online http://nlp.stanford.edu:8080/corenlp/process . I am using version 3.60:
Here are the 3 issues I have found so far:
Dot with or without inconsistency problem:
Verbs that are found as Nouns
Verbs that are tagged as Adjectives
I know that language is fairly ambiguous but I would like to know if I can trust this library even for those simple phrases. I would like to also know if I am doing something wrong. I tried the problematic cases of each of an example alone or in other words in separate sentences and the problem persists.
This is the expected behavior:
Any help is appreciated! Thanks
You're not doing anything wrong. You're of course welcome to decide for yourself how much to trust any tool, but I suspect you'll see similar issues with any parser trained empirically/statistically. As to your issues:
Periods are treated like any other token in model building, so, yes, they can influence the parse chosen.
There are indeed a lot of ambiguities in English (as there are in all other human languages), and the question of whether to interpret forms ending in ing as verbs, nouns (verbal nouns or gerunds), or adjectives is a common one. The parser does not always get it right.
In terms of particular bad choices it made, often they reflect usage/domain mismatches between the parser training data and the sentences you are trying. The training data is predominantly news articles – last millennium news articles for that matter – although we do mix in some other data and occasionally add to it. So:
The use of flagging as a verb, common in modern internet developer use, doesn't occur at all in the training data, so it not surprisingly tends to choose JJ for flagging, since that's the analysis of the only cases in the training data.
In news articles drinking is just more commonly a noun, with discussions of underage drinking, coffee drinking, drinking and driving, etc.
The different results from POS taggers was driving me crazy so for sanity checks I finally wrote something to quickly compare results against the three to typically use (Stanford NLP, NLTK 3.2.1 and Senna)
It also times them as often one tagger can choke on certain text.
https://github.com/StealthyK/TaggerTimer

Unknown word handling in Part of speech Tagger

What is the correct way to apply the unknown word handling.....
I am confused with in the things like first I have to check that word starts with Capital or first to check for the suffix?
Should I gather the knowledge of Capitalize word being a noun from corpus or assign them Noun Tag blindly....
What would be better approached?
Your question is probably too broad to answer properly but given your level of abstraction, here are a few things to consider when deciding how "it depends".
Capitalization is not a good universal strategy because different languages have different capitalization norms. In German, every properly spelled Noun is written with a Capital Letter, whereas some languages do not distinguish between upper and lower case at all (and some scripts lack this distinction -- Arabic, Hebrew, Thai, Devanagari, not to mention Far Eastern scripts which of course are a completely different challenge altogether).
In English, obviously, capitalization is a good indicator that you are probably looking at a proper noun, but the absence of capitalization does not help you decide the correct POS at all.
Suffix matching is one of many possible categories for deciding the POS of an unknown word. Your choice of wording -- "the suffix" -- implies you have a very simplistic understanding of word formation. Some languages have suffix derivation and inflection but there are many other patterns. Swahili inflection uses prefixes, Arabic and Hebrew use infixes (which are however not marked orthographically), some languages mark plural through reduplication, etc.
Though it's no longer state of the art, a look at the Brill tagger is probably a good start for a better understanding of possible strategies.
A competing approach is to use syntactic constraints to disambiguate the role of each word. An application of constraint grammar is to use the POS tags of surrounding words to decide the most likely reading of an ambiguous or unknown word.
Are you trying to write your own POS-tagger?
If not, I suggest you use the Stanford POS-tagger, or some other open source software. It will attempt to assign each word in a sentence the correct POS-tag. You can download it here:
http://nlp.stanford.edu/software/tagger.shtml
This paper presents a simple lexicon-based approach for tagging unknown-words. It shows that the lexicon-based approach obtains promising tagging results of unknown words on 13 languages, including Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese.
In addition, you can also find in the paper accuracy results (for known words and unknown words) of 3 POS and morphological taggers on the 13 languages.

What methods are used for recognizing language a text is written in?

If I have a given text (both long or short), with which methods do you usually detect which language it is written in?
It is clear that:
You need a training corpus to train the models you use (e.g. neural networks, if used)
Easiest thing coming to my mind is:
Check characters used in the text (e.g. hiragana are only used in Japanese, Umlauts probably only in European languages, ç in French, Turkish, …)
Increase the check to two or three letter pairs to find specific combinations of a language
Lookup a dictionary to check which words occur in which language (probably only without stemming, as stemming depends on the language)
But I guess there are better ways to go. I am not searching for existing projects (those questions have already been answered), but for methods like Hidden-Markov-Models, Neural Networks, … whatever may be used for this task.
In product I'm working on we use dictionary-based approach.
First relative probabilities for all words in training corpus are calculated and this is stored as a model.
Then input text is processed word by word to see if particular model gives best match (much better then the other models).
In some cases all models provide quite bad match.
Few interesting points:
As we are working with social media both normalized and non-normalized matches are attempted (in this context normalization is removal of diacritics from symbols). Non-normalized matches have a higher weight
This method works rather bad on very short phrases (1-2 words) in particular when these words are there in few languages, which is the case of few European languages
Also for a better detection we are considering added per-character model as you have described (certain languages have certain unique characters)
Btw, we use ICU library to split words. Works rather good for European and Eastern languages (currently we support Chinese)
Check the Cavnar and Trenkle algorithm.

Word/Phoneme Corpus for an Elman SRN (English)

I'm writing an Elman Simple Recurrent Network. I want to give it sequences of words, where each word is a sequence of phonemes, and I want a lot of training and test data.
So, what I need is a corpus of English words, together with the phonemes they're made up of, written as something like ARPAbet or SAMPA. British English would be nice but is not essential so long as I know what I'm dealing with. Any suggestions?
I do not currently have the time or the inclination to code something that derives the phonemes a word is comprised of from spoken or written data so please don't propose that.
Note: I'm aware of the CMU Pronouncing Dictionary, but it claims it's only based on the ARPABet symbol set - anyone know if there are actually any differences and if so what they are? (If there aren't any then I could just use that...)
EDIT: CMUPD 0.7a Symbol list - vowels may have lexical stress, and there are variants (of ARPABET standard symbols) indicating this.
CMUdict should be fine. "Arpabet symbol set" just means Arpabet. If there are any minor differences, they should be explained in the CMUdict documentation.
If you need data that's closer to real life than stringing together dictionary pronunciations of individual words, look for phonetically transcribed corpora, e.g., TIMIT.

Resources