What does the pound sign represent in a language? - string

L = { w#w#w | w ϵ {0,1}* }
What are some string outputs for this language? The pound sign is throwing me off. Is it the same as www? For example w = 01 so 010101 is in the language of L?

So, the pound sign (or sometimes the $ sign) is used in formal language theory as a separator. It's just some symbol that is not already in your original alphabet. So, formally what that means is that you are considering words on the extended alphabet containing three symbols {0, 1, #}.
The language you described is basically a word replicated three times seperated by #. Here are some examples of words in the language: 0#0#0, 111000#111000#111000, ###. Here are some examples of words not in the language: 0, 1###, 01#11#11.

Related

Python: lower() method generates wrong letter in a string

text = 'ÇEKİM GÜNÜ KALİTESİNİ DÜZENLERLSE'
sentence = text.split(' ')
print(sentence)
if "ÇEKİM" in sentence:
print("yes-1")
print(" ")
sentence_ = text.lower().split(' ')
print(sentence_)
if "çekim" in sentence_:
print("yes-2")
>> output:
['ÇEKİM', 'GÜNÜ', 'KALİTESİNİ', 'DÜZENLERLSE']
yes-1
['çeki̇m', 'günü', 'kali̇tesi̇ni̇', 'düzenlerlse']
I have a problem about string. I have a sentence like a text. When I check a specific word in this sentence-splitted list, I can find "ÇEKİM" word (prints yes). However, while I make search by lowering sentence, I can not find in the list because it changes "i" letter. What is the reason of it (encoding/decoding) ? Why "lower()" method changes string in addition to lowering ? Btw, it is a turkish word. Upper:ÇEKİM - Lower:çekim
Turkish i and English i are treated differently. Capitalized Turkish i is İ, while capitalized English i is I. To differentiate Unicode has rules for converting to lower and upper case. Lowercase Turkish i has a combining mark. Also, converting the lower case version to upper case leaves the characters in a decomposed form, so proper comparison needs to normalize the string to a standard form. You can't compare a decomposed form to a composed form. Note the differences in the strings below:
#coding:utf8
import unicodedata as ud
def dump_names(s):
print('string:',s)
for c in s:
print(f'U+{ord(c):04X} {ud.name(c)}')
turkish_i = 'İ'
dump_names(turkish_i)
dump_names(turkish_i.lower())
dump_names(turkish_i.lower().upper())
dump_names(ud.normalize('NFC',turkish_i.lower().upper()))
string: İ
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
string: i̇
U+0069 LATIN SMALL LETTER I
U+0307 COMBINING DOT ABOVE
string: İ
U+0049 LATIN CAPITAL LETTER I
U+0307 COMBINING DOT ABOVE
string: İ
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
Some terminals also have display issues. My system displays '' with the dot over the m, not the i. For example, on the Chrome browser, below displays correctly:
>>> s = 'ÇEKİM'
>>> s.lower()
'çeki̇m'
But on one of my editors it displays as:
So it appears something like this is what the OP is seeing. The following comparison will work:
if "çeki\N{COMBINING DOT ABOVE}m" in sentence_:
print("yes-2")

How to manipulate strings in GO to reverse them?

I'm trying to invert a string in go but I'm having trouble handling the characters. Unlike C, GO treats strings as vectors of bytes, rather than characters, which are called runes here. I tried to do some type conversions to do the assignments, but so far I could not.
The idea here is to generate 5 strings with random characters of sizes 100, 200, 300, 400 and 500 and then invert their characters. I was able to make C work with ease, but in GO, the language returns an error saying that it is not possible to perform the assignment.
func inverte() {
var c = "A"
var strs, aux string
rand.Seed(time.Now().UnixNano())
// Gera 5 vetores de 100, 200, 300, 400, e 500 caracteres
for i := 1; i < 6; i++ {
strs = randomString(i * 100)
fmt.Print(strs)
for i2, j := 0, len(strs); i2 < j; i2, j = i+1, j-1 {
aux = strs[i2]
strs[i2] = strs[j]
strs[j] = aux
}
}
}
If you want to take into account unicode combining characters (characters that are intended to modify other characters, like an acute accent ´ + e = é), Andrew Sellers has an interesting take in this gist.
It starts by listing the Unicode block range for all combining diacritical marks (CDM) (the Unicode block containing the most common combining characters)
regulars (inherited), so the usual ◌̀ ◌́ ◌̂ ◌̃ ◌̄ ◌̅ ◌̆ ◌̇ ◌̈, ...;
extended (containing diacritical marks used in German dialectology -- Teuthonista)
supplement (or the Uralic Phonetic Alphabet, Medievalist notations, and German dialectology -- again, Teuthonista)
for symbols (arrows, dots, enclosures, and overlays for modifying symbol characters)
Half Marks (diacritic mark parts for spanning multiple characters, as seen here)
var combining = &unicode.RangeTable{
R16: []unicode.Range16{
{0x0300, 0x036f, 1}, // combining diacritical marks
{0x1ab0, 0x1aff, 1}, // combining diacritical marks extended
{0x1dc0, 0x1dff, 1}, // combining diacritical marks supplement
{0x20d0, 0x20ff, 1}, // combining diacritical marks for symbols
{0xfe20, 0xfe2f, 1}, // combining half marks
},
}
You can then read, rune after rune, your initial string:
sv := []rune(s)
But if you do so in reverse order, you will encounter combining diacritical marks (CDMs) first, and those need to preserve their order, to not be reversed
for ix := len(sv) - 1; ix >= 0; ix-- {
r := sv[ix]
if unicode.In(r, combining) {
cv = append(cv, r)
fmt.Printf("Detect combining diacritical mark ' %c'\n", r)
}
(note the space around the %c combining rune: '%c' without space would means combining the mark with the first 'ͤ': instead of ' ͤ '. I tried to use the CGJ Combining Grapheme Joiner \u034F, but that does not work)
If you encounter finally a regular rune, you need to combine with those CDMs, before adding it to your reverse final rune array.
} else {
rrv := make([]rune, 0, len(cv)+1)
rrv = append(rrv, r)
rrv = append(rrv, cv...)
fmt.Printf("regular mark '%c' (with '%d' combining diacritical marks '%s') => '%s'\n", r, len(cv), string(cv), string(rrv))
rv = append(rv, rrv...)
cv = make([]rune, 0)
}
Where it gets even more complex is with emojis, and, for instance more recently, modifiers like the Medium-Dark Skin Tone, the type 5 on the Fitzpatrick Scale of skin tones.
If ignored, Reverse '👩🏾‍🦰👱🏾🧑🏾‍⚖️' will give '️⚖‍🏾🧑🏾👱🦰‍🏾👩', loosing the skin tone on the last two emojis.
And don't get me started on the ZERO WIDTH JOINER (200D), which, from Wisdom/Awesome-Unicode, forces adjacent characters to be joined together (e.g., Arabic characters or supported emoji). It Can be used this to compose sequentially combined emoji.
Here are two examples of composed emojis, whose inner elements order should remain in the same order when "reversed":
👩🏾‍🦰 alone is (from Unicode to code points converter):
👩: women (1f469)
dark skin (1f3fe)
ZERO WIDTH JOINER (200d)
🦰red hair (1f9b0)
Those should remain in the exact same order.
The "character" "judge" (meaning an abstract idea of the semantic value for "judge") can be represented with several glyphs or one glyph.
🧑🏾‍⚖️ is actually one composed glyph (composed here of two emojis), representing a judge. That sequence should not be inverted.
The program below correctly detect the "zero width joiner" and do not invert the emojis it combines.
It you inspect that emoji, you will find it composed of:
🧑: Adult (1F9D1)
🏾: dark skin (1f3fe)
ZERO WIDTH JOINER (200d) discussed above
⚖: scale (2696)
VARIATION SELECTOR (FE0F), part of the unicode combining characters (characters that are intended to modify other characters), here requesting that character 'scale' to be displayed emoji-style (with color) ⚖️, using VS16 (U+FE0F), instead of text style (monochrome) '⚖', using VS15 (U+FE0E).
Again, that sequence order needs to be preserved.
Note: the actual judge emoji 👨🏾‍⚖️ uses a MAN 🧑 (1F468), instead of an Adult 🧑 (1F9D1) (plus the other characters listed above: dark skin, ZWJ, scale), and is therefore represented as one glyph, instead of a cluster of graphemes.
Meaning: the single glyph, official emoji for "judge", needs to combine "man" with "scale" (resulting in one glyph 👨🏾‍⚖️) instead of "adult" + "scale".
The latter, "adult" + "scale", is still considered as "one character": you cannot select just the scale, because of the ZWJ (Zero Width Joiner).
But that "character" is represented as a composed glyph 🧑🏾‍⚖️, two glyphs, each one a concrete written representations a corresponding grapheme through codepoint+font)
Obviously, using the first combination ("man"+"scale") results in a more expressive character 👨🏾‍⚖️.
See "The relationship between graphemes and abstract characters for textual representation"
Graphemes and orthographic characters are fairly concrete objects, in the sense that they are familiar to common users—non-experts, who are typically taught to work in terms of them from the time they first learn their “ABCs” (or equivalent from their writing system, of course).
In the domain of information systems, however, we have a different sense of character: abstract characters which are minimal units of textual representation within a given system.
These are, indeed, abstract in two important senses:
first, some of these abstract characters may not correspond to anything concrete in an orthography, as we saw above in the case of HORIZONTAL TAB.
Secondly, the concrete objects of writing (graphemes and orthographic characters) can be represented by abstract characters in more than one way, and not necessarily in a one-to-one manner, as we saw above in the case of “ô” being represented by a sequence <O, CIRCUMFLEX>.
Then: "From grapheme to codepoint to glyph":
Graphemes are the units in terms of which users are usually accustomed to thinking.
Within the computer, however, processes are done in terms of characters.
We don’t make any direct connection between graphemes and glyphs.
As we have defined these two notions here, there is no direct connection between them. They can only be related indirectly through the abstract characters.
This is a key point to grasp: the abstract characters are the element in common through which the others relate.
Full example in Go playground.
Reverse 'Hello, World' => 'dlroW ,olleH'
Reverse '👽👶⃠🎃' => '🎃👶⃠👽'
Reverse '👩🏾‍🦰👱🏾🧑🏾‍⚖️' => '🧑🏾‍⚖️👱🏾👩🏾‍🦰'
Reverse 'aͤoͧiͤ š́ž́ʟ́' => 'ʟ́ž́š́ iͤoͧaͤ'
Reverse 'H̙̖ell͔o̙̟͚͎̗̹̬ ̯W̖͝ǫ̬̞̜rḷ̦̣̪d̰̲̗͈' => 'd̰̲̗͈ḷ̦̣̪rǫ̬̞̜W̖͝ ̯o̙̟͚͎̗̹̬l͔leH̙̖'
As you correctly identified, go strings are immutable, so you cannot assign to rune/character values at given indices.
Instead of reversing the string in-place one must create a copy of the runes in the string and reverse those instead, and then return the resulting string.
For example (Go Playground):
func reverse(s string) string {
rs := []rune(s)
for i, j := 0, len(rs)-1; i < j; i, j = i+1, j-1 {
rs[i], rs[j] = rs[j], rs[i]
}
return string(rs)
}
func main() {
fmt.Println(reverse("Hello, World!"))
// !dlroW ,olleH
fmt.Println(reverse("Hello, 世界!"))
// !界世 ,olleH
}
There are problems with this approach due to the intricacies of Unicode (e.g. combining diacritical marks) but this will get you started.

Prolog DCG Building/Recognizing Word Strings from Alphanumeric Characters

So I'm writing simple parsers for some programming languages in SWI-Prolog using Definite Clause Grammars. The goal is to return true if the input string or file is valid for the language in question, or false if the input string or file is not valid.
In all almost all of the languages there is an "identifier" predicate. In most of the languages the identifier is defined as the one of the following in EBNF: letter { letter | digit } or ( letter | digit ) { letter | digit }, that is to say in the first case a letter followed by zero or more alphanumeric characters, or i
My input file is split into a list of word strings (i.e. someIdentifier1 = 3 becomes the list [someIdentifier1,=,3]). The reason for the string to be split into lists of words rather than lists of letters is for recognizing keywords defined as terminals.
How do I implement "identifier" so that it recognizes any alphanumeric string or a string consisting of a letter followed by alphanumeric characters.
Is it possible or necessary to further split the word into letters for this particular predicate only, and if so how would I go about doing this? Or is there another solution, perhaps using SWI-Prolog libraries' built-in predicates?
I apologize for the poorly worded title of this question; however, I am unable to clarify it any further.
First, when you need to reason about individual letters, it is typically most convenient to reason about lists of characters.
In Prolog, you can easily convert atoms to characters with atom_chars/2.
For example:
?- atom_chars(identifier10, Cs).
Cs = [i, d, e, n, t, i, f, i, e, r, '1', '0'].
Once you have such characters, you can used predicates like char_type/2 to reason about properties of each character.
For example:
?- char_type(i, T).
T = alnum ;
T = alpha ;
T = csym ;
etc.
The general pattern to express identifiers such as yours with DCGs can look as follows:
identifier -->
[L],
{ letter(L) },
identifier_rest.
identifier_rest --> [].
identifier_rest -->
[I],
{ letter_or_digit(I) },
identifier_rest.
You can use this as a building block, and only need to define letter/1 and letter_or_digit/1. This is very easy with char_type/2.
Further, you can of course introduce an argument to relate such lists to atoms.

How can I remove repeated characters in a string with R?

I would like to implement a function with R that removes repeated characters in a string. For instance, say my function is named removeRS, so it is supposed to work this way:
removeRS('Buenaaaaaaaaa Suerrrrte')
Buena Suerte
removeRS('Hoy estoy tristeeeeeee')
Hoy estoy triste
My function is going to be used with strings written in spanish, so it is not that common (or at least correct) to find words that have more than three successive vowels. No bother about the possible sentiment behind them. Nonetheless, there are words that can have two successive consonants (especially ll and rr), but we could skip this from our function.
So, to sum up, this function should replace the letters that appear at least three times in a row with just that letter. In one of the examples above, aaaaaaaaa is replaced with a.
Could you give me any hints to carry out this task with R?
I did not think very carefully on this, but this is my quick solution using references in regular expressions:
gsub('([[:alpha:]])\\1+', '\\1', 'Buenaaaaaaaaa Suerrrrte')
# [1] "Buena Suerte"
() captures a letter first, \\1 refers to that letter, + means to match it once or more; put all these pieces together, we can match a letter two or more times.
To include other characters besides alphanumerics, replace [[:alpha:]] with a regex matching whatever you wish to include.
I think you should pay attention to the ambiguities in your problem description. This is a first stab, but it clearly does not work with "Good Luck" in the manner you desire:
removeRS <- function(str) paste(rle(strsplit(str, "")[[1]])$values, collapse="")
removeRS('Buenaaaaaaaaa Suerrrrte')
#[1] "Buena Suerte"
Since you want to replace letters that appear AT LEAST 3 times, here is my solution:
gsub("([[:alpha:]])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
#[1] "Buenna Suertee"
As you can see the 4 "a" have been reduced to only 1 a, the 3 r have been reduced to 1 r but the 2 n and the 2 e have not been changed.
As suggested above you can replace the [[:alpha:]] by any combination of [a-zA-KM-Z] or similar, and even use the "or" operator | inside the squre brackets [y|Q] if you want your code to affect only repetitions of y and Q.
gsub("([a|e])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
# [1] "Buenna Suerrrtee"
# triple r are not affected and there are no triple e.

Representing the strings we use in programming in math notation

Now I'm a programmer who's recently discovered how bad he is when it comes to mathematics and decided to focus a bit on it from that point forward, so I apologize if my question insults your intelligence.
In mathematics, is there the concept of strings that is used in programming? i.e. a permutation of characters.
As an example, say I wanted to translate the following into mathematical notation:
let s be a string of n number of characters.
Reason being I would want to use that representation in find other things about string s, such as its length: len(s).
How do you formally represent such a thing in mathematics?
Talking more practically, so to speak, let's say I wanted to mathematically explain such a function:
fitness(s,n) = 1 / |n - len(s)|
Or written in more "programming-friendly" sort of way:
fitness(s,n) = 1 / abs(n - len(s))
I used this function to explain how a fitness function for a given GA works; the question was about finding strings with 5 characters, and I needed the solutions to be sorted in ascending order according to their fitness score, given by the above function.
So my question is, how do you represent the above pseudo-code in mathematical notation?
You can use the notation of language theory, which is used to discuss things like regular languages, context free grammars, compiler theory, etc. A quick overview:
A set of characters is known as an alphabet. You could write: "Let A be the ASCII alphabet, a set containing the 128 ASCII characters."
A string is a sequence of characters. ε is the empty string.
A set of strings is formally known as a language. A common statement is, "Let s ∈ L be a string in language L."
Concatenating alphabets produces sets of strings (languages). A represents all 1-character strings, AA, also written A2, is the set of all two character strings. A0 is the set of all zero-length strings and is precisely A0 = {ε}. (It contains exactly one string, the empty string.)
A* is special notation and represents the set of all strings over the alphabet A, of any length. That is, A* = A0 ∪ A1 ∪ A2 ∪ A3 ... . You may recognize this notation from regular expressions.
For length use absolute value bars. The length of a string s is |s|.
So for your statement:
let s be a string of n number of characters.
You could write:
Let A be a set of characters and s ∈ An be a string of n characters. The length of s is |s| = n.
Mathematically, you have explained fitness(s, n) just fine as long as len(s) is well-defined.
In CS texts, a string s over a set S is defined as a finite ordered list of elements of S and its length is often written as |s| - but this is only notation, and doesn't change the (mathematical) meaning behind your definition of fitness, which is pretty clear just how you've written it.

Resources