Is an empty string `[]byte("")` serialized as `\0`

Is an empty string `[]byte("")` serialized as `\0` - string

Our code base is Go and C++. I'm a C++ programmer, so I can sort of follow Go, but I don't understand it well.
We are an embedded shop and the Go folks are serializing a string and sending it on an I2C buffer. It appears an empty Go string is appended to the I2C transaction as \0, instead of "nothing".
Everything I've seen in Go documentation has only described how to test for it (i.e. str == "" or len(str) > 0), but doesn't appear to describe how it is serialized.
As a C++ programmer, \0 makes sense, because it is the null terminator for strings or even simply NULL, which would make sense to store in a variable. Can someone please confirm or deny this?

The Go Language does not specify how Go values are serialized. The encoders in the Go standard library serialize empty strings in different ways. The gob encoder omits empty strings. The JSON encoder writes empty strings as "" (unless told to omit the empty string).
In memory string values do not have a null terminator.

Related

Read substrings from a string containing multiplication [duplicate]

This question already has answers here:
'*' and '/' not recognized on input by a read statement
(2 answers)
Closed 4 years ago.
I am a scientist programming in Fortran, and I came up with a strange behaviour. In one of my programs I have a string containing several "words", and I want to read all words as substrings. The first word starts with an integer and a wildcard, like "2*something".
When I perform an internal read on that string, I expect to read all wods, but instead, the READ function repeatedly reads the first substring. I do not understand why, nor how to avoid this behaviour.
Below is a minimalist sample program that reproduces this behaviour. I would expect it to read the three substrings and to print "3*a b c" on the screen. Instead, I get "a a a".
What am I doing wrong? Can you please help me and explain what is going on?
I am compiling my programs under GNU/Linux x64 with Gfortran 7.3 (7.3.0-27ubuntu1~18.04).
PROGRAM testread
IMPLICIT NONE
CHARACTER(LEN=1024):: string
CHARACTER(LEN=16):: v1, v2, v3
string="3*a b c"
READ(string,*) v1, v2, v3
PRINT*, v1, v2, v3
END PROGRAM testread

You are using list-directed input (the * format specifier). In list-directed input, a number (n) followed by an asterisk means "repeat this item n times", so it is processed as if the input was a a a b c. You would need to have as input '3*a' b c to get what you want.
I will use this as another opportunity to point out that list-directed I/O is sometimes the wrong choice as its inherent flexibility may not be what you want. That it has rules for things like repeat counts, null values, and undelimited strings is often a surprise to programmers. I also often see programmers complaining that list-directed input did not give an error when expected, because the compiler had an extension or the programmer didn't understand just how liberal the feature can be.
I suggest you pick up a Fortran language reference and carefully read the section on list-directed I/O. You may find you need to use an explicit format or change your program's expectations.

Following the answer of #SteveLionel, here is the relevant part of the reference on list-directed sequential READ statements (in this case, for Intel Fortran, but you could find it for your specific compiler and it won't be much different).
A character string does not need delimiting apostrophes or quotation marks if the corresponding I/O list item is of type default character, and the following is true:
The character string does not contain a blank, comma (,), or slash ( / ).
The character string is not continued across a record boundary.
The first nonblank character in the string is not an apostrophe or a quotation mark.
The leading character is not a string of digits followed by an asterisk.
A nondelimited character string is terminated by the first blank, comma, slash, or end-of-record encountered. Apostrophes and quotation marks within nondelimited character strings are transferred as is.
In total, there are 4 forms of sequential read statements in Fortran, and you may choose the option that best fits your need:
Formatted Sequential Read:
To use this you change the * to an actual format specifier. If you know the length of the strings at advance, this would be as easy as '(a3,a2,a2)'. Or, you could come with a format specifier that matches your data, but this generally demands you knowing the length or format of stuff.
Formatted Sequential List-Directed:
You are currently using this option (the * format descriptor). As we already showed you, this kind of I/O comes with a lot of magic and surprising behavior. What is hitting you is the n*cte thing, that is interpreted as n repetitions of cte literal.
As said by Steve Lionel, you could put quotation marks around the problematic word, so it will be parsed as one-piece. Or, as proposed by #evets, you could split or break your string using the intrinsics index or scan. Another option could be changing your wildcard from asterisk to anything else.
Formatted Namelist:
Well, that could be an option if your data was (or could be) presented in the namelist format, but I really think it's not your case.
Unformatted:
This may not apply to your case because you are reading from a character variable, and an internal READ statement can only be formatted.
Otherwise, you could split your string by means of a function instead of a I/O operation. There is no intrinsic for this, but you could come with one without much trouble (see this thread for reference). As you may have noted already, manipulating strings in fortran is... awkward, at least. There are some libraries out there (like this) that may be useful if you are doing lots of string stuff in Fortran.

Conversion of list to string - TCL

I encountered the following problem in TCL. In my application, I read very large text files (some hundreds of MB) into TCl list. The list is then returned by the function to the main context, and then checked for emptiness. Here is the code snapshot:
set merged_trace_list [merge_trace_files $exclude_trace_file $trace_filenames ]
if {$merged_trace_list == ""} {
...
And I get crash at the "if" line. The crash seems to be related to memory overflow. I thought that the comparison to "" forces TCL to convert list to the string, and since the string is too long, this causes crash. I then replaced above "if" line by another one:
if {[lempty $merged_trace_list]} {
and crash indeed disappeared. In the light of the above, I have several questions:
What is the maximum allowed string length in TCL?
What is difference between string and list in TCL in terms of memory allocation? Why I can have very long list, but not corresponding string?
When the list first returned by the function into the main scope (the first line) , is it not converted to the string first? And if yes, why I don't have crash in that line?
Thanks,
I hope the descriptions and the questions are clear.
Konstantin

The current maximum size of individual memory object (e.g., string) is 2GB. This is a known bug (of long standing) on 64-bit platforms, but fixing it requires a significant ABI and API breaking change, so it won't appear until Tcl 9.0.
The difference between strings and lists is that strings are stored in a single block of memory, whereas lists are stored in an array of pointers to elements. You can probably get 256k elements in a list no problem, but after that you might run into problems as the array reaches the 2GB limit.
Tcl's value objects may be simultaneously both lists and strings; the dictum about Tcl that “everything is a string” is not actually true, it's just that everything may be serialized to a string. The returning of a list does not force it to be converted to string — that's actually a fairly slow operation — but comparing the value for equality with a string does force the generation of the string. The lempty command must be instead getting the length of the string (you can use llength to do the same thing) and comparing that to zero.
Can you adjust your program to not need to hold all that data in memory at once? It's living a little dangerously given the bug mentioned above.

This is not really an answer, but it's slightly too much for a comment.
If you want to check if a list is empty, the best option is llength. If the list length is 0, your list has no content. The low-level lookup for this is very cheap.
If you still want to determine if a list is empty by comparing it to the empty string you will have to face the cost of resolving the string representation of the list. In this case, $myLongList eq {} is preferable to $myLongList == {}, since the latter comparison also forces the interpreter to check if the operands are numeric (at least it used to be like that, it might have changed).

why does this string not contain the correct characters?

Written in Delphi XE3, my software is communicating with an instrument that occasionally sends binary data. had expected I should use AnsiString since this data will never be Unicode. I couldn't believe that the following code doesn't work as I had expected. I'm supposing that the characters I'm exposing it to are considered illegitimate...
var
s:AnsiString;
begin
s:='test' + chr(128);
// had expected that since the string was set above to end in #128,
// it should end in #128...it does not.
if ord(s[5])<>128 then
ShowMessage('String ending is not as expected!');
end;
Naturally, I could use a pointer to accomplish this but I would think I should probably be using a different kind of string. of course, I could use a byte array but a string would be more convenient.
really, I'd like to know "why" and have some good alternatives.
thanks!

The behaviour you observe stems from the fact that Chr(128) is a UTF-16 WideChar representing U+0080.
When translated to your ANSI locale this does not map to ordinal 128. I would expect U+0080 to have no equivalent in your ANSI locale and therefore map to ? to indicate a failed translation.
Indeed the compiler will even warn you that this can happen. You code when compiled with default compiler options yields these warnings:
W1058 Implicit string cast with potential data loss from 'string' to 'AnsiString'
W1062 Narrowing given wide string constant lost information
Personally I would use configure the warnings to treat both of those warnings as errors.
The fundamental issue is revealed here:
My software is communicating with an instrument that occasionally sends binary data.
The correct data type for byte oriented binary data is an array of byte. In Delphi that would be TBytes.
It is wrong to use AnsiString since that exposes you to codepage translations. You want to be able to specify ordinal values and you categorically do not want text encodings to play a part. You do not want for your program's behaviour to be determined by the prevailing ANSI locale.
Strings are for text. For binary use byte arrays.

ReadStr() and WriteStr() in Delphi

I have some code that uses ReadStr and WriteStr for what I presume is writing a string to a binary file.
The explanation for WriteStr in the documentation states that it will write raw data in the shape of an AnsiString to the object's stream, which makes sense. But then ReadStr says that it reads a character. So are they not the opposite of each other?
Let say I have,
pName: String[80];
and I use WriteStr on it, what does it actually write? Since WriteStr expects AnsiString, does it cast pName to be such? In that case, does it not write the "Length" field into the stream because an AnsiString pointer points to the first element and not the length field? I was also looking and it seems String == AnsiString these days, but my question about the length field still remains the same.
If lets say it doesn't write the Length field into the file, does it still write the NULL at the end of the data? As such, can I find where the string ends by looking for a '\0'? Does ReadStr read until the NULL character?
Thank you kindly :)

In your pre-Unicode version of Delphi, WriteStr and ReadStr write and read an AnsiString value. The writing code writes the length, and then the string content. The reading code reads the length, allocates the string, and then fills it with the content.
This has the potential of involving a truncation when you assign the result of ReadStr to your 80 character short string.

What defines data that can be stored in strings

A few days ago, I asked why its not possible to store binary data, such as a jpg file into a string variable.
Most of the answers I got said that string is used for textual information such as what I'm writing now.
What is considered textual data though? Bytes of a certain nature represent a jpg file and those bytes could be represented by character byte values...I think. So when we say strings are for textual information, is there some sort of range or list of characters that aren't stored?
Sorry if the question sounds silly. Just trying to 'get it'

I see three major problems with storing binary data in strings:
Most systems assume a certain encoding within string variables - e.g. if it's a UTF-8, UTF-16 or ASCII string. New line characters may also be translated depending on your system.
You should watch out for restrictions on the size of strings.
If you use C style strings, every null character in your data will terminate the string and any string operations performed will only work on the bytes up to the first null.
Perhaps the most important: it's confusing - other developers don't expect to find random binary data in string variables. And a lot of code which works on strings might also get really confused when encountering binary data :)

I would prefer to store binary data as binary, you would only think of converting it to text when there's no other choice since when you convert it to a textual representation it does waste some bytes (not much, but it still counts), that's how they put attachments in email.
Base64 is a good textual representation of binary files.

I think you are referring to binary to text encoding issue. (translate a jpg into a string would require that sort of pre-processing)
Indeed, in that article, some characters are mentioned as not always supported, other can be confusing:
Some systems have a more limited character set they can handle; not only are they not 8-bit clean, some can't even handle every printable ASCII character.
Others have limits on the number of characters that may appear between line breaks.
Still others add headers or trailers to the text.
And a few poorly-regarded but still-used protocols use in-band signaling, causing confusion if specific patterns appear in the message. The best-known is the string "From " (including trailing space) at the beginning of a line used to separate mail messages in the mbox file format.

Whoever told you you can't put 'binary' data into a string was wrong. A string simply represents an array of bytes that you most likely plan on using for textual data... but there is nothing stopping you from putting any data in there you want.
I do have to be careful though, because I don't know what language you are using... and in some languages \0 ends the string.
In C#, you can put any data into a string... example:
byte[] myJpegByteArray = GetBytesFromSomeImage();
string myString = Encoding.ASCII.GetString(myJpegByteArray);

Before internationalization, it didn't make much difference. ASCII characters are all bytes, so strings, character arrays and byte arrays ended up having the same implementation.
These days, though, strings are a lot more complicated, in order to deal with thousands of foreign language characters and the linguistic rules that go with them.
Sure, if you look deep enough, everything is just bits and bytes, but there's a world of difference in how the computer interprets them. The rules for "text" make things look right when it's displayed to a human, but the computer is free to monkey with the internal representation. For example,
In Unicode, there are many encoding systems. Changing between them makes every byte different.
Some languages have multiple characters that are linguistically equivalent. These could switch back and forth when you least expect it.
There are different ways to end a line of text. Unintended translations between CRLF and LF will break a binary file.

Deep down everything is just bytes.
Things like strings and pictures are defined by rules about how to order bytes.
strings for example end in a byte with value 32 (or something else)
jpg's don't

Depends on the language. For example in Python string types (str) are really byte arrays, so they can indeed be used for binary data.
In C the NULL byte is used for string termination, so a sting cannot be used for arbitrary binary data, since binary data could contain null bytes.
In C# a string is an array of chars, and since a char is basically an alias for 16bit int, you can probably get away with storing arbitrary binary data in a string. You might get errors when you try to display the string (because some values might not actually correspond to a legal unicode character), and some operations like case conversions will probably fail in strange ways.
In short it might be possible in some langauges to store arbitrary binary data in strings, but they are not designed for this use, and you may run into all kinds of unforseen trouble. Most languages have a byte-array type for storing arbitrary binary data.

I agree with Jacobus' answer:
In the end all data structures are made up of bytes. (Well, if you go even deeper: of bits). With some abstraction, you could say that a string or a byte array are conventions for programmers, on how to access them.
In this regard, the string is an abstraction for data interpreted as a text. Text was invented for communication among humans, computers or programs do not communicate very well using text. SQL is textual, but is an interface for humans to tell a database what to do.
So in general, textual data, and therefore strings, are primarily for human to human, or human to machine interaction (say for the content of a message box). Using them for something else (e.g. reading or writing binary image data) is possible, but carries lots of risk bacause you are using the data type for something it was not designed to handle. This makes it much more error prone. You may be able to store binary data in strings, mbut just because you are able to shoot yourself in the foot, you should avoid doing so.
Summary: You can do it. But you better don't.

Your original question (c# - What is string really good for?) made very little sense. So the answers didn't make sense, either.
Your original question said "For some reason though, when I write this string out to a file, it doesn't open." Which doesn't really mean much.
Your original question was incomplete, and the answers were misleading and confusing. You CAN store anything in a String. Period. The "strings are for text" answers were there because you didn't provide enough information in your question to determine what's going wrong with your particular bit of C# code.
You didn't provide a code snippet or an error message. That's why it's hard to 'get it' -- you're not providing enough details for us to know what you don't get.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string