What is \0 in a string? - string

I am aware of string escapements which can be used to insert special characters, for example there is the \u---- escapement, where - is a hexadecimal digit, which can be used to insert any utf-16 character. I mean can be used to insert any character with its utf-16 code. Except emoji which take up two "characters" which arent really characters because the emojis are actually just utf-32 characters. Except the emojis which are a pictograph followed by U+FE0F. Anyways, what is the function of \0 in a string?
I have tried searching on google and stackoverflow and even w3school's javascript String lesson and could not find anything.

This is the escape sequence for the null character. More details: https://en.wikipedia.org/wiki/Null_character

Related

Why is .add_reaction not working with unicode emojis?

I canot seem to get this single line of code to output anything except BAD REQUEST Unknown Emoji
await dBot.add_reaction(message,"\\U00000031")
I cannot find any reason online why this shouldnt work. What wonderful noob mistake am I making?
The string you're using isn't an escaped Unicode character, but an escaped backslash character followed by eight digit characters. You probably want only one backslash in the literal, which will let Python parse the literal into a single character as it seems you intend. I'm still not sure that will do what you expect though, since "\U00000031" is the character '1', not an emoji.
From your comment below, it sounds like the emoji you actually want is composed of two Unicode codepoints. The first is just the normal '1' character I discussed above, which you don't need any escapes to write. The second character is U+20E3 ('COMBINING ENCLOSING KEYCAP'), which can be written in a Python literal as '\u20E3' or '\U000020E3'. This puts a keyboard key image around whatever the previous character was, so the sequence "1\u20e3" will render as 1⃣ (which my browser doesn't handle too well, but yours might). I don't know for sure, but I'd be fairly confident that discord would accept that, if it support the 1 key emoji you're looking for at all (which I expect it does).

Literal Strings [Lua 5.1]

So I started to learn Lua(5.1) and I saw that thing called literal strings. And I have no idea what these do. The manual says \a is a bell but when I type
print('hello\athere')
The IDE prints a weird square with 'bel' written on it.
So if someone could help me and explain every one of them[Literal Strings]. that would be really helpful.
p.s. i use Sublime Text 3
Only ASCII between 0x20 and 0x7E are printable characters. How other characters are output, including '\a' and '\b', is up to the implementation.
'\a', the ASCII 7 for BEL, is designed to be used to alert. Typical terminal would make an audible or visible alert when outputing '\a'. Your IDE choose to show a different output other than an alert. That's OK since it's up to the implementation.
Such sequences are called "escape sequences", and are found in many different languages. They are used to encode non-printable characters such as newlines in literal (hardcoded) strings.
Lua supports the following escape sequences:
\a: Bell
\b: Backspace
\f: Form feed
\n: Newline
\r: Carriage return
\t: Tab
\v: Vertical tab
\\: Backslash
\": Double quote
\': Single quote
\nnn: Octal value (nnn is 3 octal digits)
\xNN: Hex value (Lua5.2/LuaJIT, NN is two hex digits)
A literal is not more than a value inside the code, e.g.: 'some text'.
The '\a' is something different. A special "char", that is used to output a sound (was using the pc-speaker some aeons ago).

string.sub issue with non-English characters

I need to get the first char of a text variable. I achieve this with one of the following simple methods:
string.sub(someText,1,1)
or
someText:sub(1,1)
If I do the following, I expect to get 'ñ' as the first letter. However, the result of either of the sub methods is 'Ã'
local someText = 'ñññññññ'
print('Test whole: '..someText)
print('first char: '..someText:sub(1,1))
print('first char with .sub: '..string.sub(someText,1,1))
Here are the results from the console:
2014-03-02 09:08:47.959 Corona Simulator[1701:507] Test whole: ñññññññ
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char: Ã
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char with .sub: Ã
It seems like the string.sub() function is encoding the returned value in UTF-8. Just for kicks I tried using the utf8_decode() function that's provided by Corona SDK. It was not successful. The simulator indicated that the function expected a number but got nil instead.
I also searched the web to see if anyone else had ran into this issue. I found out that there is a fair amount of discussion on Lua, Corona, Unicode and UTF-8 but I did not come across anything that would address this specific problem.
Lua strings are 8-bit clean, which means strings in Lua are a stream of bytes. The UTF-8 character ñ has multiple bytes, but someText:sub(1,1) returns only the first single byte.
For UTF-8 encoding, all characters in the ASCII range have the same representation as in ASCII, that is, a single byte smaller than 128. For other CodePoints, a sequences of bytes where the first byte is in the range 194-244 and continuation bytes are in the range 128-191.
Because of this, you can use the pattern ".[\128-\191]*" to match a single UTF-8 CodePoint (not Grapheme):
for c in "ñññññññ":gmatch(".[\128-\191]*") do -- pretend the first string is in NFC
print(c)
end
Output:
ñ
ñ
ñ
ñ
ñ
ñ
ñ
Regarding the used character set:
Just know what requirements you bake into your own code, and make sure those are actually satisfied.
There are various typical requirements:
ASCII-compatible (aka any byte < 128 represents an ASCII character and all ASCII characters are represented as themselves)
Fixed-Size vs Variable-Width (maybe self-synchronizing) Character Set
No embedded 0-bytes
Write your code so you need as few of those requirements as cannot be avoided, and document them.
match a single UTF-8 character: Be sure what you mean by UTF-8 character. Is it Glyph or CodePoint? AFAIK you need full unicode-tables for glyph-matching. Do you actually have to get to this level at all?

Replace character with a safe character and vice-versa

Here's my problem:
I need to store sentences "somewhere" (it doesn't matter where).
The sentences must not contain spaces.
When I extract the sentences from that "somewhere", I need to restore the spaces.
So, before storing the sentence "I am happy" I could replace the spaces with a safe character, such as &. In C#:
theString.Replace(' ', '&');
This would yield 'I&am&happy'.
And when retrieving the sentence, I would to the reverse:
theString.Replace('&', ' ');
But what if the original sentence already contains the '&' character?
Say I would do the same thing with the sentence 'I am happy & healthy'. With the design above, the string would come back as 'I am happy healthy', since the '&' char has been replaced with a space.
(Of course, I could change the & character to a more unlikely symbol, such as ¤, but I want this to be bullet proof)
I used to know how to solve this, but I forgot how.
Any ideas?
Thanks!
Fredrik
Maybe you can use url encoding (percent encoding) as an inspiration.
Characters that are not valid in a url are escaped by writing %XX where XX is a numeric code that represents the character. The % sign itself can also be escaped in the same way, so that way you never run into problems when translating it back to the original string.
There are probably other similar encodings, and for your own application you can use an & just as well as a %, but by using an existing encoding like this, you can probably also find existing functions to do the encoding and decoding for you.

find non LaTeX characters (eg. acute accents) with regex in vim

I was pasting bibtex references into bibex. some names contain characters that latex skips. for example, á. is there a way in vim or regex to search for all characters that are skipped by latex? one way I would think is to write in regex to search for anything that doesn't contain 0-9, a-z, A-Z and some characters like / \ $
I am not familiar with which characters LaTeX ignores, but if the file you are editing is encoded in UTF-8, you might try searching for characters outside the ASCII repertoire (0–127; or 32–127).
As a search command in Vim:
/[^\d0-\d127]
/[^\d32-\d127]
You can also use hex or octal instead of decimal; see :help /[]. This requires that l and \ not be present in the value of cpoptions (they are not present in the default state).
This should work for any encoding that is “the same as ASCII (where it is defined)” (i.e. UTF-8 and most “latin” encodings). If you are dealing with an encoding that clashes with ASCII, then you will need to refine the range specification.

Resources