are unicode characters u+80 - u+BF valid? [closed] - unicode-string

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 1 year ago.
Improve this question
from Code Point Table
Code point <-> UTF-8 conversion First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
U+0000 U+007F 0xxxxxxx
U+0080 U+07FF 110xxxxx 10xxxxxx
U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000 [nb 2]U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The table is not clear because u+80 binary is 10000000b, which would be invalid (does not start with 110b). I would think to produce that it would be 0xc280?
I was under the impression that u+80 - u+bf were all invalid start sequences. However, unicode tables state they are valid code points reserved for one byte control characters.
Could anyone out there clarify this for me?

You're confusing Unicode code points with their representation in UTF-8.
The Unicode code point U+0080 is represented in UTF-8 by a two-byte sequence, 11000010 10000000 in binary, C2 80 in hex.
(Note we do not write U+xx for the individual bytes of UTF-8).
What this
U+0080 U+07FF 110xxxxx 10xxxxxx
is telling you is that for code points in the range 0080 to 07FF, the 11 significant bits are distributed over the 11 'x' places in the two bytes of the equivalent UTF-8 value.

Related

Python3 decode bytes [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Trying to decode bytes
2k2P3PKIfViQ1L6TTc7kYks6bpeat6pPH9qRrNcj1S2195TYz\x88}\x88\x88JKgqzeXz96zKrTX05D9bkJf1yCf
Is there a way to convert \x88 to letter or hide it.
trying this
s = b'2k2P3PKIfViQ1L6TTc7kYks6bpeat6pPH9qRrNcj1S2195TYz\x88}\x88\x88JKgqzeXz96zKrTX05D9bkJf1yCf'
d = s.decode('utf-8')
but got error
*** UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 64: invalid start byte
Any Help?? Thanks in advance...
Why do you think it's UTF-8? UTF-8 is a specific, self-checking encoding, you can't just decode random bytes with it. If you just want to convert every byte to the equivalent Unicode ordinal, decode with latin-1. Decoding with cp1252 will even make a useful printable character. Or choose any other one byte per character ASCII encoding and see what it looks like. With no idea what it's supposed to mean, any 1-1 bytes to text encoding works, it's the logic of your program that determines if it's correct.

Does there exist unicode "New-Line" (Line-break), like there is unicode `space` ( )? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
Improve this question
There are site where you can get Unicode characters, like unicode space, for example, you can copy-paste that.
For example, inside the brackets below are two different UNICODE spaces, which you can copy-paste:
U+0020: ( )
U+2001: ( )
Does there exist a Unicode new-line, which I could copy-paste? (PLEASE NOTE, I DON'T ASK ABOUT THE CODE, like U+000D or whatever is considered as new line. I want the "copyable" output, like the above space (which I have put above in brackets and can be copied). So, if there is, please paste it in your answer, so I could copy it, like you copy the unicode space above from brackets. I can't explain it better.
Does there exist a Unicode new-line, which I could copy-paste?
Yes, but it depends on the exact circumstances.
There are many Unicode line-terminators. For example NEL U+0085 but these do not survive being cut & pasted into this answer's text-area input field using the web-browser Chrome. However I can successfully copy&paste it back and forth between, for example, Notepad and Vim text editors.
Of course, neither of these applications respect the meaning of this particular character.
You can cut&paste Unicode LF U+000A between, for example, Vim and Notepad and have it be treated appropriately - but I'm sure the two applications are potentially performing some conversion during the paste operations.
The way cut&paste works is platform dependent, the above is true of MS-Windows-10 and may not be true on Android, IOS, Linux, OS/X or other platforms.

Why does the permission expression 653 mean rw- r-x -wx [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 5 years ago.
Improve this question
I'm not sure whether it is chmod or unmask, but this question is from an exam, where it asks the meaning of 653
I thought 6 means Read(4) + Write(2), 5 means Read(4) plus Execute(1) and that 3 means Write(2) plus Execute(1)
So based on my thoughts, I can't find sense in the answer, as I thought it would be something like rw r+x w+x
Why does the supposedly correct answer have the minus(-) signal instead of the plus(+)?
When permissions are displayed as strings, a dash in a given position means that bit isn't set. And for each of the three octal digits that make up the basic mode, the read bit is always shown in the first position, the write bit is always shown in the second position, and the execute bit is always shown in the third position*. So 0653 appears e.g. in the output of ls -l as rw-r-x-wx.
Plusses and minuses can also be used when setting or unsetting bits with chmod, e.g. chmod u+r (set the read bit in the left-most octal digit), chmod g-w (unset the write bit in the middle octal digit). A plus or minus in this syntax has nothing to do with how it's displayed.
* – Note that the character used to represent the execute bit (x) is sometimes overloaded to show additional information, such as the setuid and setgid bits (s or S) and the tacky bit (t).

Bash Character transformations [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
Here's the latest from my terminal.
E┬$?: N⎺ ⎽┤c▒ °☃┌e ⎺⎼ d☃⎼ec├⎺⎼≤
┴▒±⎼▒┼├#└e⎽⎺⎽:·/de┴e┌⎺⎻└e┼├/⎽⎻┌☃├├e⎼$ └▒┼ └▒⎼▒├▒⎺┼
N⎺ └▒┼┤▒┌ e┼├⎼≤ °⎺⎼ └▒⎼▒├▒⎺┼
See '└▒┼ 7 ┤┼d⎺c┤└e┼├ed' °⎺⎼ ▒e┌⎻ ┬▒e┼ └▒┼┤▒┌ ⎻▒±e⎽ ▒⎼e ┼⎺├ ▒┴▒☃┌▒b┌e↓
┴▒±⎼▒┼├#└e⎽⎺⎽:·/de┴e┌⎺⎻└e┼├/⎽⎻┌☃├├e⎼$ ⎻☃┼± ±⎺±⎺┌e↓c⎺└
PING ±⎺±⎺┌e↓c⎺└ (216↓58↓217↓36) 56(84) b≤├e⎽ ⎺° d▒├▒↓
64 b≤├e⎽ °⎼⎺└ de┼▮3⎽1▮↑☃┼↑°36↓1e1▮▮↓┼e├ (216↓58↓217↓36): ☃c└⎻_⎽e─=1 ├├┌=63 ├☃└e=29↓▮ └⎽
64 b≤├e⎽ °⎼⎺└ de┼▮3⎽1▮↑☃┼↑°36↓1e1▮▮↓┼e├ (216↓58↓217↓36): ☃c└⎻_⎽e─=2 ├├┌=63 ├☃└e=32↓4 └⎽
64 b≤├e⎽ °⎼⎺└ de┼▮3⎽1▮↑☃┼↑°36↓1e1▮▮↓┼e├ (216↓58↓217↓36): ☃c└⎻_⎽e─=3 ├├┌=63 ├☃└e=27↓4 └⎽
64 b≤├e⎽ °⎼⎺└ de┼▮3⎽1▮↑☃┼↑°36↓1e1▮▮↓┼e├ (216↓58↓217↓36): ☃c└⎻_⎽e─=4 ├├┌=63 ├☃└e=25↓9 └⎽
^C
↑↑↑ ±⎺±⎺┌e↓c⎺└ ⎻☃┼± ⎽├▒├☃⎽├☃c⎽ ↑↑↑
4 ⎻▒c┐e├⎽ ├⎼▒┼⎽└☃├├ed← 4 ⎼ece☃┴ed← ▮% ⎻▒c┐e├ ┌⎺⎽⎽← ├☃└e 32▮3└⎽
⎼├├ └☃┼/▒┴±/└▒│/└de┴ = 25↓927/28↓721/32↓426/2↓415 └⎽
┴▒±⎼▒┼├#└e⎽⎺⎽:·/de┴e┌⎺⎻└e┼├/⎽⎻┌☃├├e⎼$ E┴e⎼≤├▒☃┼± ☃⎽ ☃┼ ▒ ┼e┬ ┌▒┼±┤▒±e
E┴e⎼≤├▒☃┼±: c⎺└└▒┼d ┼⎺├ °⎺┤┼d
┴▒±⎼▒┼├#└e⎽⎺⎽:·/de┴e┌⎺⎻└e┼├/⎽⎻┌☃├├e⎼$ ┌⎽
▒⎻⎻↓┘⎽ c⎺┼°☃± D⎺c┐e⎼°☃┌e d⎺c┐e⎼☃≥e↓⎽▒ ┼⎺de_└⎺d┤┌e⎽ ⎻▒c┐▒±e↓┘⎽⎺┼ Re▒d└e↓└d README↓└d
┴▒±⎼▒┼├#└e⎽⎺⎽:·/de┴e┌⎺⎻└e┼├/⎽⎻┌☃├├e⎼$
Now, I'm aware that cating binary files causes all kinds of crazy stuff to happen to your terminal. But I've never asked about it before. I'm trying to track down what exactly would cause this character transformation.
Everything seems to work normally. I can't read the output, but ping commands produce output that behaves as I would expect. ls has the same color coding. custom scripts have the same output (just transformed).
What character sequence would cause this consistent transformation?
typing reset puts me back into sanity.
Am I getting a character transformation via console codes? If so, can I prank friends with this? (alias ls=ls #+some character transformation). Note: I don't want this to have a possibility of ls turning into rm -rf or anything else malicious.
This is caused by the smacs (enter_alt_charset_mode) terminfo sequence being entered into the terminal. It can be switched back with the rmacs (exit_alt_charset_mode) terminfo sequence.
echo "$(tput rmacs)"

Converting ASCII text to Unicode with formatting [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
Is there a free tool under linux system, for converting ascii text to unicode by keeping original text formatting ?
iconv can convert between different encodings, if that's what you mean.
Sure, it's called cat:
cat myasciifile > myunicodefile
Now myunicodefile consists of unicode codepoints, encoded in the popular UTF8 encoding. Note that this assumes that myasciifile consists only of legal ASCII characters (i.e. in the range 0-127).
An alternative to this is cp.

Resources