In older versions of heapless there was a from_utf8 method but it has been removed.
How do you convert bytes to string in a embedded system (no std)?
Related
I'm trying to locate a definitive answer to, "What is the max size of a Terraform value type of 'string'"?
Been searching and googling and can't seem to find it defined anywhere. Anyone have any reference they could point me to?
Tia
Bill W
The length of strings in Terraform is constrained in two main ways:
Terraform internally tracks the length of the string, which is stored as an integer type which has a limited range.
Strings need to exist in system memory as a consecutive sequence of bytes.
The first of these is directly answerable: Terraform tracks the length of a string using an integer type large enough to represent all pointers on the host platform. From a practical standpoint then, that means a 64-bit integer when you're using a 64-bit build, and a 32-bit number when you're using a 32-bit build.
That means that there's a hard upper limit imposed by the maximum value of that integer. Terraform is internally tracking the length of the UTF-8 representation of the string in bytes, and so this upper limit is measured in bytes rather than in characters:
32-bit systems: 4,294,967,295 bytes
64-bit systems: 18,446,744,073,709,551,615 bytes
Terraform stores strings in memory using Unicode NFC normal form, UTF-8 encoded, and so the number of characters will vary depending on how many bytes each character takes up in the UTF-8 encoding form. ASCII characters take only one byte, but other characters can require up to four bytes.
A string of the maximum representable length would take up the entire address space of the Terraform process, which is impossible (there needs to be room for the Terraform application code, libraries, and kernel space too!), and so in practice the available memory on your system is the more relevant limit. That limit varies based on various characteristics of the system where you're running Terraform, and so isn't answerable in a general sense.
I want to convert the first 10 bytes of an array to a string.
If I do String::from_utf8_lossy(), this will return &str.
Do I understand correctly that &str is the address of those 10 bytes and in fact the memory will be allocated only to create the link?
Quoting from the docs for String::from_utf8_lossy
This function returns a Cow<'a, str>. If our byte slice is invalid UTF-8, then we need to insert the replacement characters, which will change the size of the string, and hence, require a String. But if it's already valid UTF-8, we don't need a new allocation. This return type allows us to handle both cases.
So it doesn't return a &str, but rather Cow<str>, and only allocates if necessary to replace invalid bytes with "�".
In general, though, if a function actually returns &str, that &str won't be (newly) allocated. It'll either be static (embedded in the binary itself) or will have a lifetime derived from some argument to the function (e.g. String::trim).
I'm currently learning assembly programming by following Kip Irvine's "assembly language x86 programming" book.
In the book, the author states
The most common type of string ends with a null byte (containing 0).
Called a null-terminated string
In the subsequent section of the book, the author had a string example without the null byte
greeting1 \
BYTE "Welcome to the Encryption Demo program "
So I was just wondering, what is the different between a null terminated string and a string that is not terminated by null in x86 assembly language? Are they interchangeable? Or they are not equivalent of each other?
There's nothing specific to asm here; it's the same issue in C. It's all about how you store strings in memory and keep track of where they end.
what is the different between a null terminated string and a string that is not terminated by null?
A null-terminated string has a 0 byte after it, so you can find the end with strlen. (e.g. with a slow repne scasb). This makes is usable as an implicit-length string, like C uses.
NASM Assembly - what is the ", 0" after this variable for? explains the NASM syntax for creating one in static storage with db. db usage in nasm, try to store and print string shows what happens when you forget the 0 terminator.
Are they interchangeable?
If you know the length of a null-terminated string, you can pass pointer+length to a function that wants an explicit-length string. That function will never look at the 0 byte, because you will pass a length that doesn't include the 0 byte. It's not part of the string data proper.
But if you have a string without a terminator, you can't pass it to a function or system-call that wants a null-terminated string. (If the memory is writeable, you could store a 0 after the string to make it into a null-terminated string.)
In Linux, many system calls take strings as C-style implicit-length null-terminated strings. (i.e. just a char* without passing a length).
For example, open(2) takes a string for the path: int open(const char *pathname, int flags); You must pass a null-terminated string to the system call. It's impossible to create a file with a name that includes a '\0' in Linux (same as most other Unix systems), because all the system calls for dealing with files use null-terminated strings.
OTOH, write(2) takes a memory buffer which isn't necessarily a string. It has the signature ssize_t write(int fd, const void *buf, size_t count);. It doesn't care if there's a 0 at buf+count because it only looks at the bytes from buf to buf+count-1.
You can pass a string to write(). It doesn't care. It's basically just a memcpy into the kernel's pagecache (or into a pipe buffer or whatever for non-regular files). But like I said, you can't pass an arbitrary non-terminated buffer as the path arg to open().
Or they are not equivalent of each other?
Implicit-length and explicit-length are the two major ways of keeping track of string data/constants in memory and passing them around. They solve the same problem, but in opposite ways.
Long implicit-length strings are a bad choice if you sometimes need to find their length before walking through them. Looping through a string is a lot slower than just reading an integer. Finding the length of an implicit-length string is O(n), but an explicit-length string is of course O(1) time to find the length. (It's already known!). At least the length in bytes is known, but the length in Unicode characters might not be known, if it's in a variable-length encoding like UTF-8 or UTF-16.
How a string is terminated has nothing to do with assembly. Historically, '$', CRLF [10,13] or [0A,0D] and those are sometimes reversed as with GEDIT under Linux. Conventions are determined by how your system is going to interact with itself or other systems. As an example, my applications are strictly oriented around ASCII, therefore, if I would read a file that's UTF-8 or 16 my application would fail miserably. NULLs or any kind of termination could be optional.
Consider this example
Title: db 'Proto_Sys 1.00.0', 0, 0xc4, 0xdc, 0xdf, 'CPUID', 0, 'A20', 0
db 'AXCXDXBXSPBPSIDIESDSCSSSFSGS'
Err00: db 'Retry [Y/N]', 0
I've implemented a routine where if CX=0 then it's assumed a NULL terminated string is to be displayed, otherwise only one character is read and repeated CX times. That is why 0xc4 0xdc 0xdf are not terminated. Similarly, there isn't a terminator before 'Retry [Y/N]' because the way my algo is designed, there doesn't need to be.
The only thing you need concern yourself with is what is the source of your data or does your application need to be compatible with something else. Then you just simply implement whatever you need to make it work.
In Python 3, the size of a string such as 'test'.__sizeof__() returns 73. However, if I encode it as utf-8, 'test'.encode().__sizeof__() returns 37.
Why does the size of string significantly larger than the size of its encoding in utf-8?
In CPython, up to and including 3.2, unicode characters, which became str characters in 3.x, were stored as either 16 or 32 bit unsigned ints, depending on whether one had a 'narrow' or 'wide' build. (Always narrow on Windows, both used on linux). In 3.3 and following, CPython switched to a flexible string representation (FSR), using 1, 2, or 4 bytes (8, 16, or 32 bits) per char, depending on the width needed for the 'widest' char in the string. See PEP 393
For 64 bit 3.4.3, 'test'.__sizeof__ == 53, while still b'test'.__sizeof__ == 37. Since both are using 1 byte per char, the extra 16 bytes are extra overhead in a string object. Part of that is the hidden specification of whether the string is using 1, 2, or 4 bytes per char. For comparison, 'tes\u1111'.__sizeof__() == 82 and 'tes\U00011111'.__sizeof__() == 92.
(No, I do not know why the jump to 82. One would probably have to check the code to be sure.)
str in python 3 is typically stored as 16-bit integers instead of bytes, unlike the encoded bytes object. This makes the string twice as large. Some extra metadata is probably also present, inflating the object further.
Node.js documentations states that:
A Buffer is similar to an array of integers but corresponds to a raw memory allocation outside the V8 heap.
Am I right that all integers are represented as 64-bit floats internally in javascript?
Does it mean that storing 1 byte in Node.js Buffer actually takes 8 bytes of memory?
Thanks
Buffers are simply an array of bytes, so the length of the buffer is essentially the number of bytes that the Buffer will occupy.
For instance, the new Buffer(size) constructor is documented as "Allocates a new buffer of size octets." Here octets clearly identifies the cells as single-byte values. Similarly buf[index] states "Get and set the octet at index. The values refer to individual bytes, so the legal range is between 0x00 and 0xFF hex or 0 and 255.".
While a buffer is absolutely an array of bytes, you may interact with it as integers or other types using the buf.read* class of functions available on the buffer object. Each of these has a specific number of bytes that are affected by the operations.
For more specifics on the internals, Node just passes the length through to smalloc which just uses malloc as you'd expect to allocate the specified number of bytes.