Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 months ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I have never thought about until recently, but I'm not sure why we call strings strings. I am a .NET programmer, but I believe the concept of strings exist in virtually every programming language.
Outside of programming, I don't believe I've heard the word string used to describe words or letters. A quick Google of, 'Define: string' yields a bunch of definitions that have nothing to do with the concept of letters, words, or anything of the nature associated to programming.
My guess of it, is that, back in the day, strings were really just arrays of characters of a particular length, often with a delimiting character at the end. But, I don't see a natural transition from 'character array' to string.
Can someone offer up some insight to why we call strings strings?
My assumption has always been that the programming term originated from the following definition of the word "string" (from Merriam-Webster):
(1): a series of things arranged in or as if in a line <a string of cars> <a string of names>
(2): a sequence of like items (as bits, characters, or words)
Since a string in programming is simply an ordered sequence of characters, referring to this as a "string of characters" (or simply "string") seems like the most probable origin.
From this reference:
The 1971 OED (p. 3097) quotes an 1891
Century Dictionary on a source in the
Milwaukee Sentinel of 11 Jan. 1898
(section 3, p. 1) to the effect that
this is a compositor's term. Printers
would paste up the text that they had
generated in a long strip of
characters. (Presumably, they were
paid by the foot, not by the word!)
The quote says that it was not unusual
for compositors to create more than
1500 (characters?) per hour.
From searching through the ACM bibliography it seems the word string acquired its meaning in computer science during the 1960s. At the beginning a string is a general kind of sequence or list, e.g. A command language for handling strings of symbols from 1958.
This article explicitly mentions "character strings" in 1964.
Unfortunately I can't access the full texts, which are behind a toll booth.
I had guessed that "string" was in use by mathematicians long before its adoption in programming languages. Turing machines effectively operate on strings. Turing may not have used the term, but it is used everywhere in automata textbooks, going back decades.
The earliest reference I could find was a fragment in Google books of a 1944 article "Recursively enumerable sets of positive integers and their decision problems" by logician Emil Post in Bulletin of the AMS. Fortunately, AMS provides online archives of complete articles free for download. Here is a link: http://www.ams.org/journals/bull/1944-50-05/S0002-9904-1944-08111-1/S0002-9904-1944-08111-1.pdf
I think there is little doubt that he is using "string" in the conventional sense used in computer science. P. 286 "For working purposes, we in-
troduce the letter b, and consider "strings" of 1's and b's such as
11b1bb1. An operation on such strings such as "b1bP produces P1bb1"
we term a normal operation. This particular normal operation is ap-
plicable only to strings starting with b1b, and the derived string is
then obtained from the given string by first removing the initial b1b,
and then tacking on 1bb1 at the end. Thus b1bb becomes b1bb1."
I suspect it's because string originally meant just a sequence of data values: "I'll just string these together" etc. These values didn't have to be characters. One very common use for this general concept happened to be a sequence of characters, and this took over as the general meaning of the word.
The earliest reference I could find in computing is from March 1963's METEOR: A LISP Interpreter for String Transformations by Daniel G. Bobrow at MIT's AI Labs.
However, definition 15d. in the Oxford English Dictionary is:
Computing A linear sequence of records or data.
... and with a first quotation from a 1956 Journal of the Association for Computing Machinery:
Areas are set aside for shuttling strings of control fields back and forth until a completely sorted sequence is obtained.
This use naturally follows on from definition 15c.:
Math., etc. A sequence of symbols or linguistic elements in a definite order.
... and first used in Clarence Irving Lewis and Cooper Harold Langford's Symbolic Logic (1932):
Propositions are not strings of marks, or series of sounds, except incidentally.
This in turn follows on from many other, much earlier definitions for things in a line.
The word was originally used to differentiate between a set of values to which the particular order of elements doesn't matter (for instance, a set of random samples of measurements) and another that could only have its meaning preserved when the order is also preserved. Originally a string could be a set of any kind of values, but since in the post-mainframe era a string of characters is by far the most common kind, the fact that the values are characters became a "default".
A string is a sequence of discrete objects (usually char).
Given that, I would probably venture a guess that it may have to do with a metaphor related to "string of pearls". Each bead on the string is a single character.
It's called a strings, because it's actually an array of char type elements.
That being said, they are "stringing together" (or is it strung together) via this array, which turns them into a "string".
Related
The lexical grammar of most programming languages is fairly non-expressive in order to quickly lex it. I'm not sure what category Rust's lexical grammar belongs to. Most of it seems regular, probably with the exception of raw string literals:
let s = r##"Hi lovely "\" and "#", welcome to Rust"##;
println!("{}", s);
Which prints:
Hi lovely "\" and "#", welcome to Rust
As we can add arbitrarily many #, it seems like it can't be regular, right? But is the grammar at least context-free? Or is there something non-context free about Rust's lexical grammar?
Related: Is Rust's syntactical grammar context-free or context-sensitive?
The raw string literal syntax is not context-free.
If you think of it as a string surrounded by r#k"…"#k (using the superscript k as a count operator), then you might expect it to be context-free:
raw_string_literal
: 'r' delimited_quoted_string
delimited_quoted_string
: quoted_string
| '#' delimited_quoted_string '#'
But that is not actually the correct syntax, because the quoted_string is not allowed to contain "#k although it can contain "#j for any j<k
Excluding the terminating sequence without excluding any other similar sequence of a different length cannot be accomplished with a context-free grammar because it involves three (or more) uses of the k-repetition in a single production, and stack automata can only handle two. (The proof that the grammar is not context-free is surprisingly complicated, so I'm not going to attempt it here for lack of MathJax. The best proof I could come up with uses Ogden's lemma and the uncommonly cited (but highly useful) property that context-free grammars are closed under the application of a finite-state transducer.)
C++ raw string literals are also context-sensitive [or would be if the delimiter length were not limited, see Note 1], and pretty well all whitespace-sensitive languages (like Python and Haskell) are context-sensitive. None of these lexical analysis tasks is particularly complicated so the context-sensitivity is not a huge problem, although most standard scanner generators don't provide as much assistance as one might like. But there it is.
Rust's lexical grammar offers a couple of other complications for a scanner generator. One issue is the double meaning of ', which is used both to create character literals and to mark lifetime variables and loop labels. Apparently it is possible to determine which of these applies by considering the previously recognized token. That could be solved with a lexical scanner which is capable of generating two consecutive tokens from a single pattern, or it could be accomplished with a scannerless parser; the latter solution would be context-free but not regular. (C++'s use of ' as part of numeric literals does not cause the same problem; the C++ tokens can be recognized with regular expressions, because the ' can not be used as the first character of a numeric literal.)
Another slightly context-dependent lexical issue is that the range operator, .., takes precedence over floating point values, so that 2..3 must be lexically analysed as three tokens: 2 .. 3, rather than two floating point numbers 2. .3, which is how it would be analysed in most languages which use the maximal munch rule. Again, this might or might not be considered a deviation from regular expression tokenisation, since it depends on trailing context. But since the lookahead is at most one character, it could certainly be implemented with a DFA.
Postscript
On reflection, I am not sure that it is meaningful to ask about a "lexical grammar". Or, at least, it is ambiguous: the "lexical grammar" might refer to the combined grammar for all of the languages "tokens", or it might refer to the act of separating a sentence into tokens. The latter is really a transducer, not a parser, and suggests the question of whether the language can be tokenised with a finite-state transducer. (The answer, again, is no, because raw strings cannot be recognized by a FSA, or even a PDA.)
Recognizing individual tokens and tokenising an input stream are not necessarily equivalent. It is possible to imagine a language in which the individual tokens are all recognized by regular expressions but an input stream cannot be handled with a finite-state transducer. That will happen if there are two regular expressions T and U such that some string matching T is the longest token which is a strict prefix of an infinite set of strings in U. As a simple (and meaningless) example, take a language with tokens:
a
a*b
Both of these tokens are clearly regular, but the input stream cannot be tokenized with a finite state transducer because it must examine any sequence of as (of any length) before deciding whether to fallback to the first a or to accept the token consisting of all the as and the following b (if present).
Few languages show this pathology (and, as far as I know, Rust is not one of them), but it is technically present in some languages in which keywords are multiword phrases.
Notes
Actually, C++ raw string literals are, in a technical sense, regular (and therefore context free) because their delimiters are limited to strings of maximum length 16 drawn from an alphabet of 88 characters. That means that it is (theoretically) possible to create a regular expression consisting of 13,082,362,351,752,551,144,309,757,252,761 patterns, each matching a different possible raw string delimiter.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
When processing text, why would one need a tokenizer specialized for the language?
Wouldn't tokenizing by whitespace be enough? What are the cases where it is not good idea to use simply a white space tokenization?
Tokenization is the identification of linguistically meaningful units (LMU) from the surface text.
Chinese: 如果您在新加坡只能前往一间夜间娱乐场所,Zouk必然是您的不二之选。
English: If you only have time for one club in Singapore, then it simply has to be Zouk.
Indonesian: Jika Anda hanya memiliki waktu untuk satu klub di Singapura, pergilah ke Zouk.
Japanese: シンガポールで一つしかクラブに行く時間がなかったとしたら、このズークに行くべきです。
Korean: 싱가포르에서 클럽 한 군데밖에 갈시간이 없다면, Zouk를 선택하세요.
Vietnamese: Nếu bạn chỉ có thời gian ghé thăm một câu lạc bộ ở Singapore thì hãy đến Zouk.
Text Source: http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf
The tokenized version of the parallel text above should look like this:
For English, it's simple because each LMU is delimited/separated by whitespaces. However in other languages, it might not be the case. For most romanized languages, such as Indonesian, they have the same whitespace delimiter that can easily identify a LMU.
However, sometimes an LMU is a combination of two "words" separated by spaces. E.g. in the Vietnamese sentence above, you have to read thời_gian (it means time in English) as one token and not 2 tokens. Separating the two words into 2 tokens yields no LMU (e.g. http://vdict.com/th%E1%BB%9Di,2,0,0.html) or wrong LMU(s) (e.g. http://vdict.com/gian,2,0,0.html). Hence a proper Vietnamese tokenizer would output thời_gian as one token rather than thời and gian.
For some other languages, their orthographies might have no spaces to delimit "words" or "tokens", e.g. Chinese, Japanese and sometimes Korean. In that case, tokenization is necessary for computer to identify LMU. Often there are morphemes/inflections attached to an LMU, so sometimes a morphological analyzer is more useful than a tokenizer in Natural Language Processing.
Some languages, like Chinese, don't use whitespace to separate words at all.
Other languages will use punctuation differently - an apostrophe might or might not be a part of a word, for instance.
Case-folding rules vary from language to language.
Stopwords and stemming are different between languauges (though I guess I'm straying from tokenizer to analyzer here).
Edit by Bjerva: Additionally, many languages concatenate compound nouns. Whether this should be tokenised to several tokens or not can not be easily determined using only whitespace.
The question also implies "What is a word?" and can be quite task-specific (even disregarding multilinguality as one parameter). Here's my try of a subsuming answer:
(Missing) Spaces between words
Many languages do not put spaces in between words at all, and so the
basic word division algorithm of breaking on whitespace is of no use
at all. Such languages include major East-Asian languages/scripts,
such as Chinese, Japanese, and Thai. Ancient Greek was also written by
Ancient Greeks without word spaces. Spaces were introduced (together
with accent marks, etc.) by those who came afterwards. In such
languages, word segmentation is a much more major and challenging
task. (MANNI:1999, p. 129)
Compounds
German compound nouns are written as a single word, e.g.
"Kartellaufsichtsbehördenangestellter" (an employee at the "Anti-Trust
agency"), and compounds de facto are single words -- phonologically (cf. (MANNI:1999, p. 120)).
Their information-density, however, is high, and one may wish to
divide such a compound, or at least to be aware of the internal
structure of the word, and this becomes a limited word segmentation
task.(Ibidem)
There is also the special case of agglutinating languages; prepositions, possessive pronouns, ... 'attached' to the 'main' word; e.g. Finnish, Hungarian, Turkish in European domains.
Variant styles and codings
Variant coding of information of a certain semantic type E.g. local syntax for phone numbers, dates, ...:
[...] Even if one is not dealing with multilingual text, any
application dealing with text from different countries or written
according to different stylistic conventions has to be prepared to
deal with typographical differences. In particular, some items such as
phone numbers are clearly of one semantic sort, but can appear in many
formats. (MANNI:1999, p. 130)
Misc.
One major task is the disambiguation of periods (or interpunctuation in general) and other non-alpha(-numeric) symbols: if e.g. a period is part of the word, keep it that way, so we can distinguish Wash., an abbreviation for the state of Washington, from the capitalized form of the verb wash (MANNI:1999, p.129). Besides cases like this, handling contractions and hyphenation can also not be viewed as a cross-language standard case (even disregarding the missing whitespace-separator).
If one wants to handle multilingual contractions/"cliticons":
English: They‘re my father‘s cousins.
French: Montrez-le à l‘agent!
German: Ich hab‘s ins Haus gebracht. (in‘s is still a valid variant)
Since tokenization and sentence segmentation go hand in hand, they share the same (cross-language) problems. To whom it may concern/wants a general direction:
Kiss, Tibor and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics32(4), p. 485-525.
Palmer, D. and M. Hearst. 1997. Adaptive Multilingual Sentence Boundary Disambiguation. Computational Linguistics, 23(2), p. 241-267.
Reynar, J. and A. Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. Proceedingsof the Fifth Conference on Applied Natural Language Processing, p. 16-19.
References
(MANNI:1999) Manning Ch. D., H. Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press.
The Unicode Normalization FAQ includes the following paragraph:
Programs should always compare canonical-equivalent Unicode strings as equal ... The Unicode Standard provides well-defined normalization forms that can be used for this: NFC and NFD.
and continues...
The choice of which to use depends on the particular program or system. NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings. ... NFD and NFKD are most useful for internal processing.
My questions are:
What makes NFC best for "general text." What defines "internal processing" and why is it best left to NFD? And finally, never minding what is "best," are the two forms interchangable as long as two strings are compared using the same normalization form?
The FAQ is somewhat misleading, starting from its use of “should” followed by the inconsistent use of “requirement” about the same thing. The Unicode Standard itself (cited in the FAQ) is more accurate. Basically, you should not expect programs to treat canonically equivalent strings as different, but neither should you expect all programs to treat them as identical.
In practice, it really depends on what your software needs to do. In most situations, you don’t need to normalize at all, and normalization may destroy essential information in the data.
For example, U+0387 GREEK ANO TELEIA (·) is defined as canonical equivalent to U+00B7 MIDDLE DOT (·). This was a mistake, as the characters are really distinct and should be rendered differently and treated differently in processing. But it’s too late to change that, since this part of Unicode has been carved into stone. Consequently, if you convert data to NFC or otherwise discard differences between canonically equivalent strings, you risk getting wrong characters.
There are risks that you take by not normalizing. For example, the letter “ä” can appear as a single Unicode character U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or as two Unicode characters U+0061 LATIN SMALL LETTER A U+0308 COMBINING DIAERESIS. It will mostly be the former, i.e. the precomposed form, but if it is the latter and your code tests for data containing “ä”, using the precomposed form only, then it will not detect the latter. But in many cases, you don’t do such things but simply store the data, concatenate strings, print them, etc. Then there is a risk that the two representations result in somewhat different renderings.
It also matters whether your software passes character data to other software somehow. The recipient might expect, due to naive implicit assumptions or consciously and in a documented manner, that its input is normalized.
NFC is the general common sense form that you should use, ä is 1 code point there and that makes sense.
NFD is good for certain internal processing - if you want to make accent-insensitive searches or sorting, having your string in NFD makes it much easier and faster. Another usage is making more robust slug titles. These are just the most obvious ones, I am sure there are plenty of more uses.
If two strings x and y are canonical equivalents, then
toNFC(x) = toNFC(y)
toNFD(x) = toNFD(y)
Is that what you meant?
Are there a finite number of questions that can be asked regarding a specific language (and or topic), for example - for T-SQL given that there are only so many commands, can there be a limited number of non-repetitive questions? and if so can you use that to determine sizing for a site like stackoverflow and to determine the probability of a new question being a repeat of a prior one? If there is a finite number, how would you determine/calculate it: for instance, T-SQL has x number of commands, each one can have a set of relevant questions (syntax, example of use, etc.) - so could the # of questions = x times potential questions time some relevant variation? or something like that?
No, since, theoretically, programs can be of infinite length, and this site is not just about language commands, but programs developed with those languages.
I'm pretty sure Turing says no, and if you don't believe him them Gödel might have something to say about it.
A stack overflow question is expressed as a finite length sequence of bytes. One could in principle consider the question body in terms of an integer, expressed lowest digit first, in base 256 (or larger, if you wish to think about it as unicode). This is a bijection between questions and whole numbers. Therefore the set of all stack overflow questions has a countably infinite cardinality (How do i typeset \aleph_0 in SO?).
This question already has answers here:
Closed 14 years ago.
I personally do not like programming languages being case sensitive.
(I know that the disadvantages of case sensitivity are now-a-days complemented by good IDEs)
Still I would like to know whether there are any advantages for a programming language if it is case sensitive. Is there any reason why designers of many popular languages chose to make them case sensitive?
EDIT: duplicate of Why are many languages case sensitive?
EDIT: (I cannot believe I asked this question a few years ago)
This is a preference. I prefer case sensitivity, I find it easier to read code this way. For instance, the variable name "myVariable" has a different word shape than "MyVariable," "MYVARIABLE," and "myvariable." This makes it more straightforward at a glance to tell the two identifiers apart. Of course, you should not or very rarely create identifiers that differ only in case. This is more about consistency than the obvious "benefit" of increasing the number of possible identifiers. Some people think this is a disadvantage. I can't think of any time in which case sensitivity gave me any problems. But again, this is a preference.
Case-sensitivity is inherently faster to parse (albeit only slightly) since it can compare character sequences directly without having to figure out which characters are equivalent to each other.
It allows the implementer of a class/library to control how casing is used in the code. Case may also be used to convey meaning.
The code looks more the same. In the days of BASIC these were equivalent:
PRINT MYVAR
Print MyVar
print myvar
With type checking, case sensitivity prevents you from having a misspelling and unrecognized variable. I have fixed bugs in code that is a case insensitive, non typed language (FORTRAN77), where the zero (0) and capital letter O looked the same in the editor. The language created a new object and so the output was flawed. With a case sensitive, typed language, this would not have happened.
In the compiler or interpreter, a case-insensitive language is going to have to make everything upper or lowercase to test for matches, or otherwise use a case insensitive matching tool, but that's only a small amount of extra work for the compiler.
Plus case-sensitive code allows certain patterns of declarations such as
MyClassName myClassName = new MyClassName()
and other situations where case sensitivity is nice.