Start state for a DFA in Automata - regular-language

Suppose a DFA has to be designed which accept all string over Σ={0,1}* which start and ends with same symbol(e.g-0110,10101 etc.).Is ε a acceptable string ? Which means,Is start state a final state?

It depends entirely on what is meant. Human languages are vague and imprecise; that's why we invent formalisms like regular expressions in the first place.
If this is an exercise, I would ask whomever is giving you the exercise for clarification. On the surface, two interpretations seem reasonable:
The empty string does not start and end with different letters, so it should not be excluded
The empty string does not start and end with the same letter, so it should not be included
If it is an exercise and you have the original wording, you can provide a quote, but as stated, the answer is simply not clear. If homework, you could always provide two DFAs, one for each interpretation, with some discussion of the ambiguity.
If it is just a question you made up, then you will have to answer for yourself whether you want the empty string in your language.

YES.
The String ε belongs to {0,1}* and its start and end symbols are not different.So it should be accepted by the DFA

Related

Is Rust's lexical grammar regular, context-free or context-sensitive?

The lexical grammar of most programming languages is fairly non-expressive in order to quickly lex it. I'm not sure what category Rust's lexical grammar belongs to. Most of it seems regular, probably with the exception of raw string literals:
let s = r##"Hi lovely "\" and "#", welcome to Rust"##;
println!("{}", s);
Which prints:
Hi lovely "\" and "#", welcome to Rust
As we can add arbitrarily many #, it seems like it can't be regular, right? But is the grammar at least context-free? Or is there something non-context free about Rust's lexical grammar?
Related: Is Rust's syntactical grammar context-free or context-sensitive?
The raw string literal syntax is not context-free.
If you think of it as a string surrounded by r#k"…"#k (using the superscript k as a count operator), then you might expect it to be context-free:
raw_string_literal
: 'r' delimited_quoted_string
delimited_quoted_string
: quoted_string
| '#' delimited_quoted_string '#'
But that is not actually the correct syntax, because the quoted_string is not allowed to contain "#k although it can contain "#j for any j<k
Excluding the terminating sequence without excluding any other similar sequence of a different length cannot be accomplished with a context-free grammar because it involves three (or more) uses of the k-repetition in a single production, and stack automata can only handle two. (The proof that the grammar is not context-free is surprisingly complicated, so I'm not going to attempt it here for lack of MathJax. The best proof I could come up with uses Ogden's lemma and the uncommonly cited (but highly useful) property that context-free grammars are closed under the application of a finite-state transducer.)
C++ raw string literals are also context-sensitive [or would be if the delimiter length were not limited, see Note 1], and pretty well all whitespace-sensitive languages (like Python and Haskell) are context-sensitive. None of these lexical analysis tasks is particularly complicated so the context-sensitivity is not a huge problem, although most standard scanner generators don't provide as much assistance as one might like. But there it is.
Rust's lexical grammar offers a couple of other complications for a scanner generator. One issue is the double meaning of ', which is used both to create character literals and to mark lifetime variables and loop labels. Apparently it is possible to determine which of these applies by considering the previously recognized token. That could be solved with a lexical scanner which is capable of generating two consecutive tokens from a single pattern, or it could be accomplished with a scannerless parser; the latter solution would be context-free but not regular. (C++'s use of ' as part of numeric literals does not cause the same problem; the C++ tokens can be recognized with regular expressions, because the ' can not be used as the first character of a numeric literal.)
Another slightly context-dependent lexical issue is that the range operator, .., takes precedence over floating point values, so that 2..3 must be lexically analysed as three tokens: 2 .. 3, rather than two floating point numbers 2. .3, which is how it would be analysed in most languages which use the maximal munch rule. Again, this might or might not be considered a deviation from regular expression tokenisation, since it depends on trailing context. But since the lookahead is at most one character, it could certainly be implemented with a DFA.
Postscript
On reflection, I am not sure that it is meaningful to ask about a "lexical grammar". Or, at least, it is ambiguous: the "lexical grammar" might refer to the combined grammar for all of the languages "tokens", or it might refer to the act of separating a sentence into tokens. The latter is really a transducer, not a parser, and suggests the question of whether the language can be tokenised with a finite-state transducer. (The answer, again, is no, because raw strings cannot be recognized by a FSA, or even a PDA.)
Recognizing individual tokens and tokenising an input stream are not necessarily equivalent. It is possible to imagine a language in which the individual tokens are all recognized by regular expressions but an input stream cannot be handled with a finite-state transducer. That will happen if there are two regular expressions T and U such that some string matching T is the longest token which is a strict prefix of an infinite set of strings in U. As a simple (and meaningless) example, take a language with tokens:
a
a*b
Both of these tokens are clearly regular, but the input stream cannot be tokenized with a finite state transducer because it must examine any sequence of as (of any length) before deciding whether to fallback to the first a or to accept the token consisting of all the as and the following b (if present).
Few languages show this pathology (and, as far as I know, Rust is not one of them), but it is technically present in some languages in which keywords are multiword phrases.
Notes
Actually, C++ raw string literals are, in a technical sense, regular (and therefore context free) because their delimiters are limited to strings of maximum length 16 drawn from an alphabet of 88 characters. That means that it is (theoretically) possible to create a regular expression consisting of 13,082,362,351,752,551,144,309,757,252,761 patterns, each matching a different possible raw string delimiter.

Algorithm (or pointer to literature) sought for string processing challenge

A group of amusing students write essays exclusively by plagiarising portions of the complete works of WIlliam Shakespere. At one end of the scale, an essay might exclusively consist a verbatim copy of a soliloquy... at the other, one might see work so novel that - while using a common alphabet - no two adjacent characters in the essay were used adjacently by Will.
Essays need to be graded. A score of 1 is assigned to any essay which can be found (character-by-character identical) in the plain-text of the complete works. A score of 2 is assigned to any work that can be successfully constructed from no fewer than two distinct (character-by-character identical) passages in the complete works, and so on... up to the limit - for an essay with N characters - which scores N if, and only if, no two adjacent characters in the essay were also placed adjacently in the complete works.
The challenge is to implement a program which can efficiently (and accurately) score essays. While any (practicable) data-structure to represent the complete works is acceptable - the essays are presented as ASCII strings.
Having considered this teasing question for a while, I came to the conclusion that it is much harder than it sounds. The naive solution, for an essay of length N, involves 2**(N-1) traversals of the complete works - which is far too inefficient to be practical.
While, obviously, I'm interested in suggested solutions - I'd also appreciate pointers to any literature that deals with this, or any similar, problem.
CLARIFICATIONS
Perhaps some examples (ranging over much shorter strings) will help clarify the 'score' for 'essays'?
Assume Shakespere's complete works are abridged to:
"The quick brown fox jumps over the lazy dog."
Essays scoring 1 include "own fox jump" and "The quick brow". The essay "jogging" scores 6 (despite being short) because it can't be represented in fewer than 6 segments of the complete works... It can be segmented into six strings that are all substrings of the complete works as follows: "[j][og][g][i][n][g]". N.B. Establishing scores for this short example is trivial compared to the original problem - because, in this example "complete works" - there is very little repetition.
Hopefully, this example segmentation helps clarify the 2*(N-1) substring searches in the complete works. If we consider the segmentation, the (N-1) gaps between the N characters in the essay may either be a gap between segments, or not... resulting in ~ 2*(N-1) substring searches of the complete works to test each segmentation hypothesis.
An (N)DFA would be a wonderful solution - if it were practical. I can see how to construct something that solved 'substring matching' in this way - but not scoring. The state space for scoring, on the surface, at least, seems wildly too large (for any substantial complete works of Shakespere.) I'd welcome any explanation that undermines my assumptions that the (N)DFA would be too large to be practical to compute/store.
A general approach for plagiarism detection is to append the student's text to the source text separated by a character not occurring in either and then to build either a suffix tree or suffix array. This will allow you to find in linear time large substrings of the student's text which also appear in the source text.
I find it difficult to be more specific because I do not understand your explanation of the score - the method above would be good for finding the longest stretch in the students work which is an exact quote, but I don't understand your N - is it the number of distinct sections of source text needed to construct the student's text?
If so, there may be a dynamic programming approach. At step k, we work out the least number of distinct sections of source text needed to construct first k characters of the student's text. Using a suffix array built just from the source text or otherwise, we find the longest match between the source text and characters x..k of the student's text, where x is of course as small as possible. Then the least number of sections of source text needed to construct the first k characters of student text is the least needed to construct 1..x-1 (which we have already worked out) plus 1. By running this process for k=1..the length of the student text we find the least number of sections of source text needed to reconstruct the whole of it.
(Or you could just search StackOverflow for the student's text, on the grounds that students never do anything these days except post their question on StackOverflow :-)).
I claim that repeatedly moving along the target string from left to right, using a suffix array or tree to find the longest match at any time, will find the smallest number of different strings from the source text that produces the target string. I originally found this by looking for a dynamic programming recursion but, as pointed out by Evgeny Kluev, this is actually a greedy algorithm, so let's try and prove this with a typical greedy algorithm proof.
Suppose not. Then there is a solution better than the one you get by going for the longest match every time you run off the end of the current match. Compare the two proposed solutions from left to right and look for the first time when the non-greedy solution differs from the greedy solution. If there are multiple non-greedy solutions that do better than the greedy solution I am going to demand that we consider the one that differs from the greedy solution at the last possible instant.
If the non-greedy solution is going to do better than the greedy solution, and there isn't a non-greedy solution that does better and differs later, then the non-greedy solution must find that, in return for breaking off its first match earlier than the greedy solution, it can carry on its next match for longer than the greedy solution. If it can't, it might somehow do better than the greedy solution, but not in this section, which means there is a better non-greedy solution which sticks with the greedy solution until the end of our non-greedy solution's second matching section, which is against our requirement that we want the non-greedy better solution that sticks with the greedy one as long as possible. So we have to assume that, in return for breaking off the first match early, the non-greedy solution gets to carry on its second match longer. But this doesn't work, because, when the greedy solution finally has to finish using its first match, it can jump on to the same section of matching text that the non-greedy solution is using, just entering that section later than the non-greedy solution did, but carrying on for at least as long as the non-greedy solution. So there is no non-greedy solution that does better than the greedy solution and the greedy solution is optimal.
Have you considered using N-Grams to solve this problem?
http://en.wikipedia.org/wiki/N-gram
First read the complete works of Shakespeare and build a trie. Then process the string left to right. We can greedily take the longest substring that matches one in the data because we want the minimum number of strings, so there is no factor of 2^N. The second part is dirt cheap O(N).
The depth of the trie is limited by the available space. With a gigabyte of ram you could reasonably expect to exhaustively cover Shakespearean English string of length at least 5 or 6. I would require that the leaf nodes are unique (which also gives a rule for constructing the trie) and keep a pointer to their place in the actual works, so you have access to the continuation.
This feels like a problem of partial matching a very large regular expression.
If so it can be solved by a very large non deterministic finite state automata or maybe more broadly put as a graph representing for every character in the works of Shakespeare, all the possible next characters.
If necessary for efficiency reasons the NDFA is guaranteed to be convertible to a DFA. But then this construction can give rise to 2^n states, maybe this is what you were alluding to?
This aspect of the complexity does not really worry me. The NDFA will have M + C states; one state for each character and C states where C = 26*2 + #punctuation to connect to each of the M states to allow the algorithm to (re)start when there are 0 matched characters. The question is would the corresponding DFA have O(2^M) states and if so is it necessary to make that DFA, theoretically it's not necessary. However, consider that in the construction, each state will have one and only one transition to exactly one other state (the next state corresponding to the next character in that work). We would expect that each one of the start states will be connected to on average M/C states, but in the worst case M meaning the NDFA will have to track at most M simultaneous states. That's a large number but not an impossibly large number for computers these days.
The score would be derived by initializing to 1 and then it would incremented every time a non-accepting state is reached.
It's true that one of the approaches to string searching is building a DFA. In fact, for the majority of the string search algorithms, it looks like a small modification on failure to match (increment counter) and success (keep going) can serve as a general strategy.

caesar cipher check in ocaml

I want to implement a check function that given two strings s1 and s2 will check if s2 is the caesar cipher of s1 or not. the inter face needs to be looked like string->string->bool.
the problem is that I am not allowed to use any string functions other than String.length, so how can I solve it? i am not permitted any list array, iterations. Only recursions and pattern matching.
Please help me. And also can you tell me how I can write a substring function in ocaml other than the module function with the above restrictions?
My guess is that you are probably allowed to use s.[i] to get the ith character of string s. This is the same as String.get, but the instructor may not think of it in those terms. Without some form of getting the individual characters for the string, I believe that this is impossible. You should probably double check with your instructor to be sure, but I would be surprised if he had meant for you to be unable to separate a string into characters (which is something that you cannot do with pattern-matching alone in Ocaml).
Once you can get individual characters, the way to do it should be pretty clear (you do not need substring to traverse each string recursively).
If you still want to write substring, creating it would be complex since you don't have access to String.create or other similar functions. But you can write your own version of String.create using recursion, one character string literals (like "x"), the ability to set a character in a string to another (like s.[0] <- c), and string concatenation (s1 ^ s2). Again, of course, all of this is assuming that those operators are allowed to be used.

The History Behind the Definition of a 'String' [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 months ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I have never thought about until recently, but I'm not sure why we call strings strings. I am a .NET programmer, but I believe the concept of strings exist in virtually every programming language.
Outside of programming, I don't believe I've heard the word string used to describe words or letters. A quick Google of, 'Define: string' yields a bunch of definitions that have nothing to do with the concept of letters, words, or anything of the nature associated to programming.
My guess of it, is that, back in the day, strings were really just arrays of characters of a particular length, often with a delimiting character at the end. But, I don't see a natural transition from 'character array' to string.
Can someone offer up some insight to why we call strings strings?
My assumption has always been that the programming term originated from the following definition of the word "string" (from Merriam-Webster):
(1): a series of things arranged in or as if in a line <a string of cars> <a string of names>
(2): a sequence of like items (as bits, characters, or words)
Since a string in programming is simply an ordered sequence of characters, referring to this as a "string of characters" (or simply "string") seems like the most probable origin.
From this reference:
The 1971 OED (p. 3097) quotes an 1891
Century Dictionary on a source in the
Milwaukee Sentinel of 11 Jan. 1898
(section 3, p. 1) to the effect that
this is a compositor's term. Printers
would paste up the text that they had
generated in a long strip of
characters. (Presumably, they were
paid by the foot, not by the word!)
The quote says that it was not unusual
for compositors to create more than
1500 (characters?) per hour.
From searching through the ACM bibliography it seems the word string acquired its meaning in computer science during the 1960s. At the beginning a string is a general kind of sequence or list, e.g. A command language for handling strings of symbols from 1958.
This article explicitly mentions "character strings" in 1964.
Unfortunately I can't access the full texts, which are behind a toll booth.
I had guessed that "string" was in use by mathematicians long before its adoption in programming languages. Turing machines effectively operate on strings. Turing may not have used the term, but it is used everywhere in automata textbooks, going back decades.
The earliest reference I could find was a fragment in Google books of a 1944 article "Recursively enumerable sets of positive integers and their decision problems" by logician Emil Post in Bulletin of the AMS. Fortunately, AMS provides online archives of complete articles free for download. Here is a link: http://www.ams.org/journals/bull/1944-50-05/S0002-9904-1944-08111-1/S0002-9904-1944-08111-1.pdf
I think there is little doubt that he is using "string" in the conventional sense used in computer science. P. 286 "For working purposes, we in-
troduce the letter b, and consider "strings" of 1's and b's such as
11b1bb1. An operation on such strings such as "b1bP produces P1bb1"
we term a normal operation. This particular normal operation is ap-
plicable only to strings starting with b1b, and the derived string is
then obtained from the given string by first removing the initial b1b,
and then tacking on 1bb1 at the end. Thus b1bb becomes b1bb1."
I suspect it's because string originally meant just a sequence of data values: "I'll just string these together" etc. These values didn't have to be characters. One very common use for this general concept happened to be a sequence of characters, and this took over as the general meaning of the word.
The earliest reference I could find in computing is from March 1963's METEOR: A LISP Interpreter for String Transformations by Daniel G. Bobrow at MIT's AI Labs.
However, definition 15d. in the Oxford English Dictionary is:
Computing A linear sequence of records or data.
... and with a first quotation from a 1956 Journal of the Association for Computing Machinery:
Areas are set aside for shuttling strings of control fields back and forth until a completely sorted sequence is obtained.
This use naturally follows on from definition 15c.:
Math., etc. A sequence of symbols or linguistic elements in a definite order.
... and first used in Clarence Irving Lewis and Cooper Harold Langford's Symbolic Logic (1932):
Propositions are not strings of marks, or series of sounds, except incidentally.
This in turn follows on from many other, much earlier definitions for things in a line.
The word was originally used to differentiate between a set of values to which the particular order of elements doesn't matter (for instance, a set of random samples of measurements) and another that could only have its meaning preserved when the order is also preserved. Originally a string could be a set of any kind of values, but since in the post-mainframe era a string of characters is by far the most common kind, the fact that the values are characters became a "default".
A string is a sequence of discrete objects (usually char).
Given that, I would probably venture a guess that it may have to do with a metaphor related to "string of pearls". Each bead on the string is a single character.
It's called a strings, because it's actually an array of char type elements.
That being said, they are "stringing together" (or is it strung together) via this array, which turns them into a "string".

Is there any advantage of being a case-sensitive programming language? [duplicate]

This question already has answers here:
Closed 14 years ago.
I personally do not like programming languages being case sensitive.
(I know that the disadvantages of case sensitivity are now-a-days complemented by good IDEs)
Still I would like to know whether there are any advantages for a programming language if it is case sensitive. Is there any reason why designers of many popular languages chose to make them case sensitive?
EDIT: duplicate of Why are many languages case sensitive?
EDIT: (I cannot believe I asked this question a few years ago)
This is a preference. I prefer case sensitivity, I find it easier to read code this way. For instance, the variable name "myVariable" has a different word shape than "MyVariable," "MYVARIABLE," and "myvariable." This makes it more straightforward at a glance to tell the two identifiers apart. Of course, you should not or very rarely create identifiers that differ only in case. This is more about consistency than the obvious "benefit" of increasing the number of possible identifiers. Some people think this is a disadvantage. I can't think of any time in which case sensitivity gave me any problems. But again, this is a preference.
Case-sensitivity is inherently faster to parse (albeit only slightly) since it can compare character sequences directly without having to figure out which characters are equivalent to each other.
It allows the implementer of a class/library to control how casing is used in the code. Case may also be used to convey meaning.
The code looks more the same. In the days of BASIC these were equivalent:
PRINT MYVAR
Print MyVar
print myvar
With type checking, case sensitivity prevents you from having a misspelling and unrecognized variable. I have fixed bugs in code that is a case insensitive, non typed language (FORTRAN77), where the zero (0) and capital letter O looked the same in the editor. The language created a new object and so the output was flawed. With a case sensitive, typed language, this would not have happened.
In the compiler or interpreter, a case-insensitive language is going to have to make everything upper or lowercase to test for matches, or otherwise use a case insensitive matching tool, but that's only a small amount of extra work for the compiler.
Plus case-sensitive code allows certain patterns of declarations such as
MyClassName myClassName = new MyClassName()
and other situations where case sensitivity is nice.

Resources