Finite strings but possibly infinite language - programming-languages

We know that a string is finite but on the other hand we know that a language is a set of strings(possibly infinite) over an alphabet. Isn't this relation a contradiction?

Every natural number has a finite number of digits in it. Yet, there's an infinite number of natural numbers.
In other words, as long as there's no bound on the number of digits per number, you can always create longer and longer numbers from the same alphabet.

In the phrase "a language is a set of strings(possibly infinite) over an alphabet", the parenthetical observation relates to the set, not to the strings. That is, it might equally well have been described as "a set (possibly infinite) of (finite) strings". There is no contradiction in the definition (properly understood), because it is the strings which are said to be finite and the set which is said to be infinite.
Note, by the way, that it is possible to allow infinite strings and to consider the properties of languages defined as sets of finite or infinite strings, but pretty much all work on formal languages restricts sentences to finite length; the restriction makes a lot of questions tractable which are not tractable for the case where infinite strings are allowed.

Related

What is a basic explanation of Finite Fields?

I have zero background and I have never seen these symbols before. Can someone explain what is going on here?
In a simple explanation: A finite field, is a finite set, meaning multiplication, addition, subtraction and division are defined and follow the rules of field axioms. It contains a finite number of elements. A basic example of Finite Fields are
p = prime
modulo P
If you are looking for more information, regarding programming with Finite Fields
See this source: https://jeremykun.com/2014/03/13/programming-with-finite-fields/
It might seem a bit wild, but if you read more about it, it will start to make sense.
I'd also research into Finite Field Arithmetic.
I hope my answer was useful
- yosh

Is Rust's lexical grammar regular, context-free or context-sensitive?

The lexical grammar of most programming languages is fairly non-expressive in order to quickly lex it. I'm not sure what category Rust's lexical grammar belongs to. Most of it seems regular, probably with the exception of raw string literals:
let s = r##"Hi lovely "\" and "#", welcome to Rust"##;
println!("{}", s);
Which prints:
Hi lovely "\" and "#", welcome to Rust
As we can add arbitrarily many #, it seems like it can't be regular, right? But is the grammar at least context-free? Or is there something non-context free about Rust's lexical grammar?
Related: Is Rust's syntactical grammar context-free or context-sensitive?
The raw string literal syntax is not context-free.
If you think of it as a string surrounded by r#k"…"#k (using the superscript k as a count operator), then you might expect it to be context-free:
raw_string_literal
: 'r' delimited_quoted_string
delimited_quoted_string
: quoted_string
| '#' delimited_quoted_string '#'
But that is not actually the correct syntax, because the quoted_string is not allowed to contain "#k although it can contain "#j for any j<k
Excluding the terminating sequence without excluding any other similar sequence of a different length cannot be accomplished with a context-free grammar because it involves three (or more) uses of the k-repetition in a single production, and stack automata can only handle two. (The proof that the grammar is not context-free is surprisingly complicated, so I'm not going to attempt it here for lack of MathJax. The best proof I could come up with uses Ogden's lemma and the uncommonly cited (but highly useful) property that context-free grammars are closed under the application of a finite-state transducer.)
C++ raw string literals are also context-sensitive [or would be if the delimiter length were not limited, see Note 1], and pretty well all whitespace-sensitive languages (like Python and Haskell) are context-sensitive. None of these lexical analysis tasks is particularly complicated so the context-sensitivity is not a huge problem, although most standard scanner generators don't provide as much assistance as one might like. But there it is.
Rust's lexical grammar offers a couple of other complications for a scanner generator. One issue is the double meaning of ', which is used both to create character literals and to mark lifetime variables and loop labels. Apparently it is possible to determine which of these applies by considering the previously recognized token. That could be solved with a lexical scanner which is capable of generating two consecutive tokens from a single pattern, or it could be accomplished with a scannerless parser; the latter solution would be context-free but not regular. (C++'s use of ' as part of numeric literals does not cause the same problem; the C++ tokens can be recognized with regular expressions, because the ' can not be used as the first character of a numeric literal.)
Another slightly context-dependent lexical issue is that the range operator, .., takes precedence over floating point values, so that 2..3 must be lexically analysed as three tokens: 2 .. 3, rather than two floating point numbers 2. .3, which is how it would be analysed in most languages which use the maximal munch rule. Again, this might or might not be considered a deviation from regular expression tokenisation, since it depends on trailing context. But since the lookahead is at most one character, it could certainly be implemented with a DFA.
Postscript
On reflection, I am not sure that it is meaningful to ask about a "lexical grammar". Or, at least, it is ambiguous: the "lexical grammar" might refer to the combined grammar for all of the languages "tokens", or it might refer to the act of separating a sentence into tokens. The latter is really a transducer, not a parser, and suggests the question of whether the language can be tokenised with a finite-state transducer. (The answer, again, is no, because raw strings cannot be recognized by a FSA, or even a PDA.)
Recognizing individual tokens and tokenising an input stream are not necessarily equivalent. It is possible to imagine a language in which the individual tokens are all recognized by regular expressions but an input stream cannot be handled with a finite-state transducer. That will happen if there are two regular expressions T and U such that some string matching T is the longest token which is a strict prefix of an infinite set of strings in U. As a simple (and meaningless) example, take a language with tokens:
a
a*b
Both of these tokens are clearly regular, but the input stream cannot be tokenized with a finite state transducer because it must examine any sequence of as (of any length) before deciding whether to fallback to the first a or to accept the token consisting of all the as and the following b (if present).
Few languages show this pathology (and, as far as I know, Rust is not one of them), but it is technically present in some languages in which keywords are multiword phrases.
Notes
Actually, C++ raw string literals are, in a technical sense, regular (and therefore context free) because their delimiters are limited to strings of maximum length 16 drawn from an alphabet of 88 characters. That means that it is (theoretically) possible to create a regular expression consisting of 13,082,362,351,752,551,144,309,757,252,761 patterns, each matching a different possible raw string delimiter.

How to determine whether given language is regular or not(by just looking at the language)?

Is there any trick to guess if a language is regular by just looking at the language?
In order to choose proof methods, I have to have some hypothesis at first. Do you know any hints/patterns required to reduce time consumption in solving long questions?
For instance, in order not to spend time on pumping lemma, when language is regular and I don't want to construct DFA/grammar.
For example:
1. L={w ε {a,b}*/no of a in (w) < no of b in (w)}
2. L={a^nb^m/n,m>=0}
How to tell which is regular by just looking at the above examples??
In general, when looking at a language, a good rule of thumb for whether the language is regular or not is to think of a program that can read a string and answer the question "is this string in the language?"
To write such a program, do you need to store some arbitrary value in a variable or is the program's state (that is, the combination of all possible variables' values) limited to some finite fixed number of possibilities? If the language can be recognized by a program that only needs a fixed number of variables that can only have a fixed number of values, then you've got a regular language. If not, then not.
Using this, I can see that the first language is not regular, but the second language is. In the first language, I need to remember how many as I've seen, and how many bs. (Or at the very least, I need to keep track of (# of as) - (# of bs), and accept if the string ends while that count is negative). At the same time, there's no limit on the number of as, so this count could go arbitrarily large.
In the second language, I don't care what n and m are at all. So with the second language, my program would just keep track of "have I seen at least one b yet?" to make sure we don't have any a characters that occur after the first b. (So, one variable with only two values - true or false)
So one way to make language 1 into a regular language is to change it to be:
1. L={w ∈ {a,b}*/no of a in (w) < no of b in (w), and no of a in (w) < 100}
Now I don't need to keep track of the number of as that I've seen once I hit 100 (since then I know automatically that the string isn't in the language), and likewise with the number of bs - once I hit 100, I can stop counting because I know that'll be enough unless the number of as is itself too large.
One common case you should watch out for with this is when someone asks you about languages where "number of as is a multiple of 13" or "w ∈ {0,1}* and w is the binary representation of a multiple of 13". With these, it might seem like you need to keep track of the whole number to make the determination, but in fact you don't - in both cases, you only need to keep a variable that can count from 0 to 12. So watch out for "multiple of"-type languages. (And the related "is odd" or "is even" or "is 1 more than a multiple of 13")
Other mathematical properties though - for example, w ∈ {0,1}* and w is the binary representation of a perfect square - will result in non-regular languages.

Can a Language accept Infinite numbers

I have a question whether a language can accept infinite numbers
I have to reduce Lempty to Linf
where Lempty ={e|L(Pe) is null}
Linf={e|L(Pe) is infinite}
so can i define a program P like this
"
input n
Run Pe on 1...n for n steps
if Pe accept any number then enter an infinite loop and start accepting infinite number
if pe does not accept any number then accept 1"
Now if i can define program P like above then asking the question whether Language accepted by P belong to Linf will tell me whether Language accepted by Pe is null or not.
Any advice or suggestions is always welcome
This question is a bit ill-defined, but here's an approximation:
In standard computational theory, languages cannot contain infinite strings. Infinite number is unclear. If you mean a number requiring infinite digits to precisely represent, no. Standard automata cannot recognize infinite length strings (how could it ever terminate?). However, many numbers that appear to require infinite digits (consider one-third as .3333333...) do not require infinite characters to represent otherwise, making them candidates for any language.
Hope that helps, depending on what exactly you're asking.

Is there a formal definition of character difference across a string and if so how is it calculated?

Overview
I'm looking to analyse the difference between two characters as part of a password strength checking process.
I'll explain what I'm trying to achieve and why and would like to know if what I'm looking to do is formally defined and whether there are any recommended algorithms for achieving this.
What I'm looking to do
Across a whole string, I'm looking to compare the current character with the previous character and determine how different they are.
As this relates to password strength checking, the difference between one character and it's predecessor in a string might be defined as being how predictable character N is from knowing character N - 1. There might be a formal definition for this of which I'm not aware.
Example
A password of abc123 could be arguably less secure than azu590. Both contain three letters followed by three numbers, however in the case of the former the sequence is more predictable.
I'm assuming that a password guesser might try some obvious sequences such that abc123 would be tried much before azu590.
Considering the decimal ASCII values for the characters in these strings, and given that b is 1 different from a and c is 1 different again from b, we could derive a simplistic difference calculation.
Ignoring cases where two consecutive characters are not in the same character class, we could say that abc123 has an overall character to character difference of 4 whereas azu590 has a similar difference of 25 + 5 + 4 + 9 = 43.
Does this exist?
This notion of character to character difference across a string might be defined, similar to the Levenshtein distance between two strings. I don't know if this concept is defined or what it might be called. Is it defined and if so what is it called?
My example approach to calculating the character to character difference across a string is a simple and obvious approach. It may be flawed, it may be ineffective. Are there any known algorithms for calculating this character to character difference effectively?
It sounds like you want a Markov Chain model for passwords. A Markov Chain has a number of states and a probability of transitioning between the states. In your case the states are the characters in the allowed character set and the probability of a transition is proportional to the frequency that those two letters appear consecutively. You can construct the Markov Chain by looking at the frequency of the transitions in an existing text, for example a freely available word list or password database.
It is also possible to use variations on this technique (Markov chain of order m) where you for example consider the previous two characters instead of just one.
Once you have created the model you can use the probability of generating the password from the model as a measure of its strength. This is the product of the probabilities of each state transition.
For general signals/time-series data, this is known as Autocorrelation.
You could try adapting the Durbin–Watson statistic and test for positive auto-correlation between the characters. A naïve way may be to use the unicode code-points of each character, but I'm sure that will not be good enough.

Resources