Representing the strings we use in programming in math notation - string

Now I'm a programmer who's recently discovered how bad he is when it comes to mathematics and decided to focus a bit on it from that point forward, so I apologize if my question insults your intelligence.
In mathematics, is there the concept of strings that is used in programming? i.e. a permutation of characters.
As an example, say I wanted to translate the following into mathematical notation:
let s be a string of n number of characters.
Reason being I would want to use that representation in find other things about string s, such as its length: len(s).
How do you formally represent such a thing in mathematics?
Talking more practically, so to speak, let's say I wanted to mathematically explain such a function:
fitness(s,n) = 1 / |n - len(s)|
Or written in more "programming-friendly" sort of way:
fitness(s,n) = 1 / abs(n - len(s))
I used this function to explain how a fitness function for a given GA works; the question was about finding strings with 5 characters, and I needed the solutions to be sorted in ascending order according to their fitness score, given by the above function.
So my question is, how do you represent the above pseudo-code in mathematical notation?

You can use the notation of language theory, which is used to discuss things like regular languages, context free grammars, compiler theory, etc. A quick overview:
A set of characters is known as an alphabet. You could write: "Let A be the ASCII alphabet, a set containing the 128 ASCII characters."
A string is a sequence of characters. ε is the empty string.
A set of strings is formally known as a language. A common statement is, "Let s ∈ L be a string in language L."
Concatenating alphabets produces sets of strings (languages). A represents all 1-character strings, AA, also written A2, is the set of all two character strings. A0 is the set of all zero-length strings and is precisely A0 = {ε}. (It contains exactly one string, the empty string.)
A* is special notation and represents the set of all strings over the alphabet A, of any length. That is, A* = A0 ∪ A1 ∪ A2 ∪ A3 ... . You may recognize this notation from regular expressions.
For length use absolute value bars. The length of a string s is |s|.
So for your statement:
let s be a string of n number of characters.
You could write:
Let A be a set of characters and s ∈ An be a string of n characters. The length of s is |s| = n.

Mathematically, you have explained fitness(s, n) just fine as long as len(s) is well-defined.
In CS texts, a string s over a set S is defined as a finite ordered list of elements of S and its length is often written as |s| - but this is only notation, and doesn't change the (mathematical) meaning behind your definition of fitness, which is pretty clear just how you've written it.

Related

Hash function to see if one string is scrambled form/permutation of another?

I want to check if string A is just a reordered version of string B. For example, "abc" = "bca" = "cab"...
There are other solutions here: https://www.geeksforgeeks.org/check-if-two-strings-are-permutation-of-each-other/
However, I was thinking a hash function would be an easy way of doing this, but the typical hash function takes order into consideration. Are there any hash functions that do not care about character order?
Are there any hash functions that do not care about character order?
I don't know of real-world hash functions that have this property, no. Because this is not a problem they are designed to solve.
However, in this specific case, you can make your own "hash" function (a very very bad one) that will indeed ignore order: just sum ASCII codes of characters. This works due to the commutative property of addition (a + b == b + a)
def isAnagram(self,a,b):
sum_a = 0
sum_b = 0
for c in a:
sum_a += ord(c)
for c in b:
sum_b += ord(c)
return sum_a == sum_b
To reiterate, this is absolutely a hack, that only happens to work because input strings are limited in content in the judge system (only have lowercase ASCII characters and do not contain spaces). It will not work (reliably) on arbitrary strings.
For a fast check you could use a kind af hash-funkction
Candidates are:
xor all characters of a String
add all characters of a String
multiply all characters of a String (be careful might lead to overflow for large Strings)
If the hash-value is equal, it could still be a collision of two not 'equal' strings. So you still need to make a dedicated compare. (e.g. sort the characters of each string before comparing them).

Representing a non-numeric string as in integer in Julia

I'm looking for a simple and invertible way to represent a Julia string by an integer (e.g. for cryptography). To be clear, I'm not considering string representations of integers like "123", but arbitrary strings like "Hello". The representation doesn't need to be human-readable, but it needs to be easily invertible back to a unique string (so not a hash). It doesn't need to be efficient; I'm just looking for something as simple as possible. (Also, it's fine if it only works on a small character set, e.g. lowercase Roman letters.)
One naive way would be to collect the string into a vector of chars, parse(Int, _) each char to an integer, and concatenate the integers. But this seems cumbersome, and I suspect that there's in built-in Julia function (or small composition of functions) that will get the job done more easily.
If your strings only use the numbers 0-9 and letters a-z and A-Z, then you can parse the string directly as base 62 BigInteger:
julia> s = randstring(123)
"RFXkzD6VpWcwvbsxOtdTxS4DGcgciKgDXECa9fEK0Djcdkcj5N75vIHEMVyuH9mcYgvFbLhbPdrKyPIO4JsK1DKgZIacov6WKDZdIpGJ5iJ15dpjmcCBCybMmxB"
julia> i = parse(BigInt, s, base=62)
12798646956721889529517502411501433963894611324020956397632780092623456213685688389093681112679380669903728068303911743800989012987014660454736389459814982802097607808640628339365945710572579898457023165244164689548286133
julia> string(i, base=62)
"RFXkzD6VpWcwvbsxOtdTxS4DGcgciKgDXECa9fEK0Djcdkcj5N75vIHEMVyuH9mcYgvFbLhbPdrKyPIO4JsK1DKgZIacov6WKDZdIpGJ5iJ15dpjmcCBCybMmxB"
I created a (somewhat complicated) implementation that works for ASCII strings:
stringToInt(str::String) = sum(i -> Int(str[end-i]) * 128^i, 0:length(str)-1)
function intToString(m::Int)
chars = Char[]
for n in div(ceil(Int, log2(x)), 7)-1:-1:0
d, m = divrem(m, 128^n)
push!(chars, d)
end
String(chars)
end
Let me know if you can think of a better one.

Prolog DCG Building/Recognizing Word Strings from Alphanumeric Characters

So I'm writing simple parsers for some programming languages in SWI-Prolog using Definite Clause Grammars. The goal is to return true if the input string or file is valid for the language in question, or false if the input string or file is not valid.
In all almost all of the languages there is an "identifier" predicate. In most of the languages the identifier is defined as the one of the following in EBNF: letter { letter | digit } or ( letter | digit ) { letter | digit }, that is to say in the first case a letter followed by zero or more alphanumeric characters, or i
My input file is split into a list of word strings (i.e. someIdentifier1 = 3 becomes the list [someIdentifier1,=,3]). The reason for the string to be split into lists of words rather than lists of letters is for recognizing keywords defined as terminals.
How do I implement "identifier" so that it recognizes any alphanumeric string or a string consisting of a letter followed by alphanumeric characters.
Is it possible or necessary to further split the word into letters for this particular predicate only, and if so how would I go about doing this? Or is there another solution, perhaps using SWI-Prolog libraries' built-in predicates?
I apologize for the poorly worded title of this question; however, I am unable to clarify it any further.
First, when you need to reason about individual letters, it is typically most convenient to reason about lists of characters.
In Prolog, you can easily convert atoms to characters with atom_chars/2.
For example:
?- atom_chars(identifier10, Cs).
Cs = [i, d, e, n, t, i, f, i, e, r, '1', '0'].
Once you have such characters, you can used predicates like char_type/2 to reason about properties of each character.
For example:
?- char_type(i, T).
T = alnum ;
T = alpha ;
T = csym ;
etc.
The general pattern to express identifiers such as yours with DCGs can look as follows:
identifier -->
[L],
{ letter(L) },
identifier_rest.
identifier_rest --> [].
identifier_rest -->
[I],
{ letter_or_digit(I) },
identifier_rest.
You can use this as a building block, and only need to define letter/1 and letter_or_digit/1. This is very easy with char_type/2.
Further, you can of course introduce an argument to relate such lists to atoms.

Find the minimal lexographical string formed by merging two strings

Suppose we are given two strings s1 and s2(both lowercase). We have two find the minimal lexographic string that can be formed by merging two strings.
At the beginning , it looks prettty simple as merge of the mergesort algorithm. But let us see what can go wrong.
s1: zyy
s2: zy
Now if we perform merge on these two we must decide which z to pick as they are equal, clearly if we pick z of s2 first then the string formed will be:
zyzyy
If we pick z of s1 first, the string formed will be:
zyyzy which is correct.
As we can see the merge of mergesort can lead to wrong answer.
Here's another example:
s1:zyy
s2:zyb
Now the correct answer will be zybzyy which will be got only if pick z of s2 first.
There are plenty of other cases in which the simple merge will fail. My question is Is there any standard algorithm out there used to perform merge for such output.
You could use dynamic programming. In f[x][y] store the minimal lexicographical string such that you've taken x charecters from the first string s1 and y characters from the second s2. You can calculate f in bottom-top manner using the update:
f[x][y] = min(f[x-1][y] + s1[x], f[x][y-1] + s2[y]) \\ the '+' here represents
\\ the concatenation of a
\\ string and a character
You start with f[0][0] = "" (empty string).
For efficiency you can store the strings in f as references. That is, you can store in f the objects
class StringRef {
StringRef prev;
char c;
}
To extract what string you have at certain f[x][y] you just follow the references. To udapate you point back to either f[x-1][y] or f[x][y-1] depending on what your update step says.
It seems that the solution can be almost the same as you described (the "mergesort"-like approach), except that with special handling of equality. So long as the first characters of both strings are equal, you look ahead at the second character, 3rd, etc. If the end is reached for some string, consider the first character of the other string as the next character in the string for which the end is reached, etc. for the 2nd character, etc. If the ends for both strings are reached, then it doesn't matter from which string to take the first character. Note that this algorithm is O(N) because after a look-ahead on equal prefixes you know the whole look-ahead sequence (i.e. string prefix) to include, not just one first character.
EDIT: you look ahead so long as the current i-th characters from both strings are equal and alphabetically not larger than the first character in the current prefix.

Finding length of substring

I have given n strings . I have to find a string S so that, given n strings are sub-sequence of S.
For example, I have given the following 5 strings:
AATT
CGTT
CAGT
ACGT
ATGC
Then the string is "ACAGTGCT" . . Because, ACAGTGCT contains all given strings as super-sequence.
To solve this problem I have to know the algorithm . But I have no idea how to solve this . Guys, can you help me by telling technique to solve this problem ?
This is a NP-complete problem called multiple sequence alignment.
The wiki page describes solution methods such as dynamic programming which will work for small n, but becomes prohibitively expensive for larger n.
The basic idea is to construct an array f[a,b,c,...] representing the length of the shortest string S that generates "a" characters of the first string, "b" characters of the second, and "c" characters of the third.
My Approach: using Trie
Building a Trie from the given words.
create empty string (S)
create empty string (prev)
for each layer in the trie
create empty string (curr)
for each character used in the current layer
if the character not used in the previous layer (not in prev)
add the character to S
add the character to curr
prev = curr
Hope this helps :)
1 Definitions
A sequence of length n is a concatenation of n symbols taken from an alphabet .
If S is a sequence of length n and T is a sequence of length m and n m then S is a subsequence of T if S can be obtained by deleting m-n symbols from T. The symbols need not be contiguous.
A sequence T of length m is a supersequence of S of length n if T can be obtained by inserting m-n symbols. That is, T is a supersequence of S if and only if S is a subsequence of T.
A sequence T is a common supersequence of the sequences S1 and S2 of T is a supersequence of both S1 and S2.
2 The problem
The problem is to find a shortest common supersequence (SCS), which is a common supersequence of minimal length. There could be more than one SCS for a given problem.
2.1 Example
S= {a, b, c}
S1 = bcb
S2 = baab
S3 = babc
One shortest common supersequence is babcab (babacb, baabcb, bcaabc, bacabc, baacbc).
3 Techniques
Dynamic programming Requires too much memory unless the number of input-sequences are very small.
Branch and bound Requires too much time unless the alphabet is very small.
Majority merge The best known heuristic when the number of sequences is large compared to the alphabet size. [1]
Greedy (take two sequences and replace them by their optimal shortest common supersequence until a single string is left) Worse than majority merge. [1]
Genetic algorithms Indications that it might be better than majority merge. [1]
4 Implemented heuristics
4.1 The trivial solution
The trivial solution is at most || times the optimal solution length and is obtained by concatenating the concatenation of all characters in sigma as many times as the longest sequence. That is, if = {a, b, c} and the longest input sequence is of length 4 we get abcabcabcabc.
4.2 Majority merge heuristic
The Majority merge heuristic builds up a supersequence from the empty sequence (S) in the following way:
WHILE there are non-empty input sequences
s <- The most frequent symbol at the start of non-empty input-sequences.
Add s to the end of S.
Remove s from the beginning of each input sequence that starts with s.
END WHILE
Majority merge performs very well when the number of sequences is large compared to the alphabet size.
5 My approach - Local search
My approach was to apply a local search heuristic to the SCS problem and compare it to the Majority merge heuristic to see if it might do better in the case when the alphabet size is larger than the number of sequences.
Since the length of a valid supersequence may vary and any change to the supersequence may give an invalid string a direct representation of a supersequence as a feasible solution is not an option.
I chose to view a feasible solution (S) as a sequence of mappings x1...xSl where Sl is the sum of the lengths of all sequences and xi is a mapping to a sequencenumber and an index.
That means, if L={{s1,1...s1,m1}, {s2,1...s2,m2} ...{sn,1...s3,mn}} is the set of input sequences and L(i) is the ith sequence the mappings are represented like this:
xi {k, l}, where k L and l L(k)
To be sure that any solution is valid we need to introduce the following constraints:
Every symbol in every sequence may only have one xi mapped to it.
If xi ss,k and xj ss,l and k < l then i < j.
If xi ss,k and xj ss,l and k > l then i > j.
The second constraint enforces that the order of each sequence is preserved but not its position in S. If we have two mappings xi and xj then we may only exchange mappings between them if they map to different sequences.
5.1 The initial solution
There are many ways to choose an initial solution. As long as the order of the sequences are preserved it is valid. I chose not to in some way randomize a solution but try two very different solution-types and compare them.
The first one is to create an initial solution by simply concatenating all the sequences.
The second one is to interleave the sequences one symbol at a time. That is to start with the first symbol of every sequence then, in the same order, take the second symbol of every sequence and so on.
5.2 Local change and the neighbourhood
A local change is done by exchanging two mappings in the solution.
One way of doing the iteration is to go from i to Sl and do the best exchange for each mapping.
Another way is to try to exchange the mappings in the order they are defined by the sequences. That is, first exchange s1,1, then s2,1. That is what we do.
There are two variants I have tried.
In the first one, if a single mapping exchange does not yield a better value I return otherwise I go on.
In the second one, I seperately for each sequence do as many exchanges as there are sequences so a symbol in each sequence will have a possibility of moving. The exchange that gives the best value I keep and if that value is worse than the value of the last step in the algorithm I return otherwise I go on.
A symbol may move any number of position to the left or to the right as long as the exchange does not change the order of the original sequences.
The neighbourhood in the first variant is the number of valid exchanges that can be made for the symbol. In the second variant it is the sum of valid exchanges of each symbol after the previous symbol has been exchanged.
5.3 Evaluation
Since the length of the solution is always constant it has to be compressed before the real length of the solution may be obtained.
The solution S, which consists of mappings is converted to a string by using the symbols each mapping points to. A new, initialy empty, solution T is created. Then this algorithm is performed:
T = {}
FOR i = 0 TO Sl
found = FALSE
FOR j = 0 TO |L|
IF first symbol in L(j) = the symbol xi maps to THEN
Remove first symbol from L(j)
found = TRUE
END IF
END FOR
IF found = TRUE THEN
Add the symbol xi maps to to the end of T
END IF
END FOR
Sl is as before the sum of the lengths of all sequences. L is the set of all sequences and L(j) is sequence number j.
The value of the solution S is obtained as |T|.
With Many Many Thanks to : Andreas Westling

Resources