What does "?\s" mean in Elixir? - string

In the Elixir-documentation covering comprehensions I ran across the following example:
iex> for <<c <- " hello world ">>, c != ?\s, into: "", do: <<c>>
"helloworld"
I sort of understand the whole expression now, but I can't figure out what the "?\s" means.
I know that it somehow matches and thus filters out the spaces, but that's where my understanding ends.
Edit: I have now figured out that it resolves to 32, which is the character code of a space, but I still don't know why.

erlang has char literals denoted by a dollar sign.
Erlang/OTP 22 [erts-10.6.1] [...]
Eshell V10.6.1 (abort with ^G)
1> $\s == 32.
%%⇒ true
The same way elixir has char literals that according to the code documentation act exactly as erlang char literals:
This is exactly what Erlang does with Erlang char literals ($a).
Basically, ?\s is exactly the same as ?  (question mark followed by a space.)
# ⇓ space here
iex|1 ▶ ?\s == ?
warning: found ? followed by code point 0x20 (space), please use ?\s instead
There is nothing special with ?\s, as you might see:
for <<c <- " hello world ">>, c != ?o, into: "", do: <<c>>
#⇒ " hell wrld "
Also, ruby as well uses ?c notation for char literals:
main> ?\s == ' '
#⇒ true

? is a literal that gives you the following character's codepoint( https://elixir-lang.org/getting-started/binaries-strings-and-char-lists.html#utf-8-and-unicode). For characters that cannot be expressed literally (space is just one of them, but there are more: tab, carriage return, ...) the escaped sequence should be used instead. So ?\s gives you a codepoint for space:
iex> ?\s
32

Related

python Using variable in re.search source.error("bad escape %s" % escape, len(escape)) [duplicate]

I want to use input from a user as a regex pattern for a search over some text. It works, but how I can handle cases where user puts characters that have meaning in regex?
For example, the user wants to search for Word (s): regex engine will take the (s) as a group. I want it to treat it like a string "(s)" . I can run replace on user input and replace the ( with \( and the ) with \) but the problem is I will need to do replace for every possible regex symbol.
Do you know some better way ?
Use the re.escape() function for this:
4.2.3 re Module Contents
escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
A simplistic example, search any occurence of the provided string optionally followed by 's', and return the match object.
def simplistic_plural(word, text):
word_or_plural = re.escape(word) + 's?'
return re.match(word_or_plural, text)
You can use re.escape():
re.escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
>>> import re
>>> re.escape('^a.*$')
'\\^a\\.\\*\\$'
If you are using a Python version < 3.7, this will escape non-alphanumerics that are not part of regular expression syntax as well.
If you are using a Python version < 3.7 but >= 3.3, this will escape non-alphanumerics that are not part of regular expression syntax, except for specifically underscore (_).
Unfortunately, re.escape() is not suited for the replacement string:
>>> re.sub('a', re.escape('_'), 'aa')
'\\_\\_'
A solution is to put the replacement in a lambda:
>>> re.sub('a', lambda _: '_', 'aa')
'__'
because the return value of the lambda is treated by re.sub() as a literal string.
Usually escaping the string that you feed into a regex is such that the regex considers those characters literally. Remember usually you type strings into your compuer and the computer insert the specific characters. When you see in your editor \n it's not really a new line until the parser decides it is. It's two characters. Once you pass it through python's print will display it and thus parse it as a new a line but in the text you see in the editor it's likely just the char for backslash followed by n. If you do \r"\n" then python will always interpret it as the raw thing you typed in (as far as I understand). To complicate things further there is another syntax/grammar going on with regexes. The regex parser will interpret the strings it's receives differently than python's print would. I believe this is why we are recommended to pass raw strings like r"(\n+) -- so that the regex receives what you actually typed. However, the regex will receive a parenthesis and won't match it as a literal parenthesis unless you tell it to explicitly using the regex's own syntax rules. For that you need r"(\fun \( x : nat \) :)" here the first parens won't be matched since it's a capture group due to lack of backslashes but the second one will be matched as literal parens.
Thus we usually do re.escape(regex) to escape things we want to be interpreted literally i.e. things that would be usually ignored by the regex paraser e.g. parens, spaces etc. will be escaped. e.g. code I have in my app:
# escapes non-alphanumeric to help match arbitrary literal string, I think the reason this is here is to help differentiate the things escaped from the regex we are inserting in the next line and the literal things we wanted escaped.
__ppt = re.escape(_ppt) # used for e.g. parenthesis ( are not interpreted as was to group this but literally
e.g. see these strings:
_ppt
Out[4]: '(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
__ppt
Out[5]: '\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
print(rf'{_ppt=}')
_ppt='(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
print(rf'{__ppt=}')
__ppt='\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
the double backslashes I believe are there so that the regex receives a literal backslash.
btw, I am surprised it printed double backslashes instead of a single one. If anyone can comment on that it would be appreciated. I'm also curious how to match literal backslashes now in the regex. I assume it's 4 backslashes but I honestly expected only 2 would have been needed due to the raw string r construct.

Why does "\1" inside a triple-quoted string evaluate to a unicode 0x1 code point

I wanted a String containing a text \1.
What I did was (the real string was longer but it's not important):
'''
\1
'''
Which resulted in a String containing a unicode 0x1 codepoint.
I think what I should've done is just escape the backslash like this:
'''
\\1
'''
What I don't understand is why Groovy didn't report an error here. I thought unicode escapes are supposed to look like \u1?
Instead of a syntax error I got a runtime exception when I tried to put this String into an XML element:
An invalid XML character (Unicode: 0x1) was found in the element content of the document.
The \ (backward slash) symbol is an escape symbol. If you mean to use it literally, you must escape it itself: \\.
When you escape any character, the character is interpreted to have special meaning. In the case of the \1 sequence, it just happens that this can be interpreted as the 0x01 codepoint.
This is the same in Java Strings.
If you want to not have to escape characters in Groovy, use slashy strings:
def x = /\1/
assert x == "\\1"
which also works as multiline:
def x = /
\1
/

SWI Prolog escape quotes

I need to put " " around a String in prolog.
I get the input from another program and as it looks I can't escape the " in this program, so i have to add the " in prolog otherwise the prolog statement doesn't work.
Thanks for your help!
For a discussion of strings see here, they are SWI-Prolog specific but use the same escape rules as atoms. There are many ways to enter quotes into an atom in a Prolog text:
1) Doubling them. So for example 'can''t be' is an atom,
with a single quote as the fourth character, and no other
single quotes in it.
2) Escaping them, with the backslash. So for example
'can\'t be' is the same atom as 'can''t be'.
3) Character coding them, using octal code and a closing back slash.
So for example 'can\47\t be' is the same atom as 'can''t be'.
4) Character coding them, using hex code and a closing back slash.
So for example 'can\x27\t be' is the same atom as 'can''t be'.
The above possibilities are all defined in the ISO standard. A
Prolog implementation might define further possibilities.
Bye
P.S.: Here is an example run in SWI-Prolog, using a different
example character. In the first example query below, you don't
need doubling, doubling can only be done for the surrounding quote.
The last example query below shows a SWI-Prolog specific syntax
which is not found in the ISO standard, namely using a backslash
u with a fixed width hex code:
Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 7.1.33)
Copyright (c) 1990-2015 University of Amsterdam, VU Amsterdam
?- X = 'she said "bye"'.
X = 'she said "bye"'.
?- X = 'she said \"bye\"'.
X = 'she said "bye"'.
?- X = 'she said \42\bye\42\'.
X = 'she said "bye"'.
?- X = 'she said \x22\bye\x22\'.
X = 'she said "bye"'.
?- X = 'she said \u0022bye\u0022'.
X = 'she said "bye"'.

define a character string containing "

I wish to define a character variable as: a"", as in: my.string <- 'a""' Nothing I have tried works. I always get: "a\"\"", or some variation thereof.
I have been reading the documentation for: grep, strsplit, regex, substr, gregexpr and other functions for clues on how to tell R that " is a character I want to keep unchanged, and I have tried maybe a hundred variations of a"" by adding \\, \, /, //, [], _, $, [, #.
The only potential example I can find on the internet of a string including " is: ‘{}>=40*" years"’ from here: http://cran.r-project.org/doc/manuals/R-lang.html However, that example is for performing a mathematical operation.
Sorry for such a very basic question. Thank you for any advice.
The backslashes is an artifact of the print method. In fact the default print surrounds your string with quotes. You can disable this by setting argument quote to FALSE.
For example You can use :
print(my.string,quote=FALSE)
[1] a""
But I would use cat or write like this :
cat(my.string)
a""
write(my.string,"")
a""
Using substr, one sees that the backslashes seem just to be an artefact of printing:
substr(my.string,2,2)
gives
[1] "\""
also, the string length is as you want it:
> nchar(my.string)
[1] 3
if you want to print your string without the backslashes, use noquote :
> noquote(my.string)
[1] a""

How to break a big lua string into small ones

I have a big string (a base64 encoded image) and it is 1050 characters long. How can I append a big string formed of small ones, like this in C
function GetIcon()
return "Bigggg string 1"\
"continuation of string"\
"continuation of string"\
"End of string"
According to Programming in Lua 2.4 Strings:
We can delimit literal strings also by matching double square brackets [[...]]. Literals in this bracketed form may run for several lines, may nest, and do not interpret escape sequences. Moreover, this form ignores the first character of the string when this character is a newline. This form is especially convenient for writing strings that contain program pieces; for instance,
page = [[
<HTML>
<HEAD>
<TITLE>An HTML Page</TITLE>
</HEAD>
<BODY>
Lua
[[a text between double brackets]]
</BODY>
</HTML>
]]
This is the closest thing to what you are asking for, but using the above method keeps the newlines embedded in the string, so this will not work directly.
You can also do this with string concatenation (using ..):
value = "long text that" ..
" I want to carry over" ..
"onto multiple lines"
Most answers here solves this issue at run-time and not at compile-time.
Lua 5.2 introduces the escape sequence \z to solve this problem elegantly without incurring any run-time expense.
> print "This is a long \z
>> string with \z
>> breaks in between, \z
>> and is spanning multiple lines \z
>> but still is a single string only!"
This is a long string with breaks in between, and is spanning multiple lines but still is a single string only!
\z skips all subsequent characters in a string literal1 until the first non-space character. This works for non-multiline literal text too.
> print "This is a simple \z string"
This is a simple string
From Lua 5.2 Reference Manual
The escape sequence '\z' skips the following span of white-space characters, including line breaks; it is particularly useful to break and indent a long literal string into multiple lines without adding the newlines and spaces into the string contents.
1: All escape sequences, including \z, work only on short literal strings ("…", '…') and, understandably, not on long literal strings ([[...]], etc.)
I'd put all chunks in a table and use table.concat on it. This avoids the creation of new strings at every concatenation. for example (without counting overhead for strings in Lua):
-- bytes used
foo="1234".. -- 4 = 4
"4567".. -- 4 + 4 + 8 = 16
"89ab" -- 16 + 4 + 12 = 32
-- | | | \_ grand total after concatenation on last line
-- | | \_ second operand of concatenation
-- | \_ first operand of concatenation
-- \_ total size used until last concatenation
As you can see, this explodes pretty rapidly. It's better to:
foo=table.concat{
"1234",
"4567",
"89ab"}
Which will take about 3*4+12=24 bytes.
Have you tried the
string.sub(s, i [, j]) function.
You may like to look here:
http://lua-users.org/wiki/StringLibraryTutorial
This:
return "Bigggg string 1"\
"continuation of string"\
"continuation of string"\
"End of string"
C/C++ syntax causes the compiler to see it all as one large string. It is generally used for readability.
The Lua equivalent would be:
return "Bigggg string 1" ..
"continuation of string" ..
"continuation of string" ..
"End of string"
Do note that the C/C++ syntax is compile-time, while the Lua equivalent likely does the concatenation at runtime (though the compiler could theoretically optimize it). It shouldn't be a big deal though.

Resources