why does this string not contain the correct characters? - string

Written in Delphi XE3, my software is communicating with an instrument that occasionally sends binary data. had expected I should use AnsiString since this data will never be Unicode. I couldn't believe that the following code doesn't work as I had expected. I'm supposing that the characters I'm exposing it to are considered illegitimate...
var
s:AnsiString;
begin
s:='test' + chr(128);
// had expected that since the string was set above to end in #128,
// it should end in #128...it does not.
if ord(s[5])<>128 then
ShowMessage('String ending is not as expected!');
end;
Naturally, I could use a pointer to accomplish this but I would think I should probably be using a different kind of string. of course, I could use a byte array but a string would be more convenient.
really, I'd like to know "why" and have some good alternatives.
thanks!

The behaviour you observe stems from the fact that Chr(128) is a UTF-16 WideChar representing U+0080.
When translated to your ANSI locale this does not map to ordinal 128. I would expect U+0080 to have no equivalent in your ANSI locale and therefore map to ? to indicate a failed translation.
Indeed the compiler will even warn you that this can happen. You code when compiled with default compiler options yields these warnings:
W1058 Implicit string cast with potential data loss from 'string' to 'AnsiString'
W1062 Narrowing given wide string constant lost information
Personally I would use configure the warnings to treat both of those warnings as errors.
The fundamental issue is revealed here:
My software is communicating with an instrument that occasionally sends binary data.
The correct data type for byte oriented binary data is an array of byte. In Delphi that would be TBytes.
It is wrong to use AnsiString since that exposes you to codepage translations. You want to be able to specify ordinal values and you categorically do not want text encodings to play a part. You do not want for your program's behaviour to be determined by the prevailing ANSI locale.
Strings are for text. For binary use byte arrays.

Related

How to send a string instead of bytearray in python via USB port?

In labview, I convert an array to string and just output it:
But in python 3.6, when I use serial.write(string) function, string needed to be turned into a bytearray.
Is there anyway I can send a string without convert it to bytearray?
No.
A Python 3.x string is a sequence of Unicode code points. A Unicode code point is an abstract entity, a bit like the colour red: in order to store it or transmit it in digital form it has to be encoded into a specific representation of one or more bytes - a bit like encoding the colour red as #ff0000. To send a string to another computer, you need to encode it into a byte sequence, and because there are many possible encodings you might want to use, you need to specify which one:
bytesToSend = myString.encode(encoding="utf-8")
Why didn't you need to do this in LabVIEW? Many older programming languages, including both LabVIEW and Python before 3.x, are based on the assumption that strings are encoded 1:1 into bytes - every character is one byte and every byte is one character. That's the way it worked in the early days of computing when memory was tight and non-English text was unusual, but it's not good enough now that software has to be used globally in a networked world.
Python 3.x took the step of explicitly breaking the link and making strings and byte sequences distinct and incompatible types: this means you have to deal with the difference, but it's not a big deal, just encode and decode as necessary, and in my opinion it's less confusing than trying to pretend that strings and byte sequences are still the same thing.
LabVIEW is finally catching up with Unicode in the NXG version, although for backward compatibility it lets you get away with wiring strings directly to some functions that really operate on byte sequences.
For more information I recommend reading the Python 3.x Unicode HOWTO.

How can I get a Path from a raw C string (CStr or *const u8)?

What's the most direct way to use a C string as Rust's Path?
I've got const char * from FFI and need to use it as a filesystem path in Rust.
I'd rather not enforce UTF-8 on the path, so converting through str/String is undesirable.
It should work on Windows at least for ASCII paths.
To clarify: I'm just replacing an existing C implementation that passes the path to fopen with a Rust stdlib implementation. It's not my problem whether it's a valid path or encoded properly for a given filesystem, as long as it's not worse than fopen (and I know fopen basically doesn't work on Windows).
Here's what I've learned:
Path/OsStr always use WTF-8 on Windows, and are an encoding-ignorant bag of bytes on Unix.
They never ever store any paths using any "wide" encoding like UTF-16 or UCS-2. The Windows-only masquerade of OsStr is to hide the WTF-8 encoding, nothing more.
It is extremely unlikely to ever change, because the standard library API supports creation of Path and OsStr from UTF-8 &str without any allocation or mutation of memory (i.e. as_ref() is supported, and its strict API doesn't leave room to implement it as anything other than a pointer cast).
Unix-only zero-copy version (it doesn't even depend on any implementation details):
use std::ffi::{CStr,OsStr};
use std::path::Path;
use std::os::unix::ffi::OsStrExt;
let slice = CStr::from_ptr(c_null_terminated_string_ptr_here);
let osstr = OsStr::from_bytes(slice.to_bytes());
let path: &Path = osstr.as_ref();
On Windows, converting only valid UTF-8 is the best Rust can do without a charade of creating WTF-8 OsString from code units:
…
let str = ::std::str::from_utf8(slice.to_bytes()).expect("keep your surrogates paired");
let path: &Path = str.as_ref();
Safely and portably? Insofar as I'm aware, there isn't a way. My advice is to demand UTF-8 and just pray it never breaks.
The problem is that the only thing you can really say about a "C string" is that it's NUL-terminated. You can't really say anything meaningful about how it's encoded. At least, not with any real certainty.
Unsafely and/or non-portably? If you're running on Linux (and possibly other modern *NIXen), you can maybe use OsStrExt to do the conversion. This only works assuming the C string was a valid path in the first place. If it came from some string processing code that wasn't using the same encoding as the filesystem (which these days is generally "arbitrary bytes that look like UTF-8 but might not be")... well, you'll have to convert it yourself, first.
On Windows? Hahahaha. This depends on where the string came from. C strings embedded in an executable can be in a variety of encodings depending on how the code was compiled. If it came from the OS itself, it could be in one of two different encodings: the thread's OEM codepage, or the thread's ANSI codepage. I never worked out how to check which it's set to. If it came from the console, it would be in whatever the console's input encoding was set to when you received it... assuming it wasn't piped in from something else that was using a different encoding (hi there, PowerShell!). All of the above require you to roll your own transcoding code, since Rust itself avoids this by never, ever using non-Unicode APIs on Windows.
Oh, and don't forget that there is no 8-bit encoding that can properly store Windows paths, since Windows paths are "arbitrary 16-bit words that look like UTF-16 but might not be". [1]
... so, like I said: demand UTF-8 and just pray it never breaks, because trying to do it "correctly" leads to madness.
[1]: I should clarify: there is such an encoding: WTF-8, which is what Rust uses for OsStr and OsString on Windows. The catch is that nothing else on Windows uses this, so it's never going to be how a C string is encoded.

Erlang binary strings by default

I am writing an erlang module that has to deal a bit with strings, not too much, however, I do some tcp recv and then some parsing over the data.
While matching data and manipulating strings, I am using binary module all the time like binary:split(Data,<<":">>) and basically using <<"StringLiteral">> all the time.
Till now I have not encounter difficulties or missing methods from the alternative( using lists) and everything is coming out quite naturally except maybe for adding the <<>>, but I was wondering if this way of dealing with strings might have drawbacks I am not aware of.
Any hint?
As long as you and your team remember that your strings are binaries and not lists, there are no inherent problems with this approach. In fact, Couch DB took this approach as an optimization which apparently paid nice dividends.
You do need to be very aware of how your string is encoded in your binaries. When you do <<"StringLiteral">> in your code, you have to be aware that this is simply a binary serialization of the list of code-points. Your Erlang compiler reads your code as ISO-8859-1 characters, so as long as you only use Latin-1 characters and do this consistently, you should be fine, But this isn't very friendly to internationalization.
Most application software these day should prefer a unicode encoding. UTF-8 is compatible with your <<"StringLiteral">> for the first 128 codepoints, but not for the second 128, so be careful. You might be surprised what you see on your UTF-8 encoded web applications if you use <<"StrïngLïteral">> in your code.
There was an EEP proposal for binary support in the form of <<"StrïngLïteral"/utf8>>, but I don't think this is finalized.
Also be aware that your binary:split/2 function may have unexpected results in UTF-8 if there is a multi-byte character that contains the IS0-8859-1 byte that to are splitting on.
Some would argue that UTF-16 is a better encoding to use because it can be parsed more efficiently and can be more easily split by index, if you are assuming or verify that there are no 32-bit characters.
The unicode module should be use, but tread carefully when you use literals.
The only thing to be aware of is that a binary is a slice of bytes, whereas a list is a list of unicode codepoints. In other words, the latter is naturally unicode whereas the former requires you to do some sort of encoding, usually UTF-8.
To my knowledge, there is no drawbacks to your method.
Binaries are very efficient structures to store strings. If they are longer than 64B they are also stored outside process heap so they are not object of GC (still GC'ed by ref counting when last ref lost). Don't forget use iolists for concatenation them to avoid copying when performance matter.

Native newline characters?

What's the best way to determine the native newline characters such as '\n' or '\r\n' in Haskell?
I see there is a "nativeNewline" function in GHC.IO:Handle, but assume that it is both a private API and most of all non-standard Haskell.
You should think of the newline representation as part of the encoding of a text file that is stored in the filesystem, just like UTF-8. A text file is normally decoded when you read it into your program, and encoded when written -- converting to and from the native newline representation is done as part of this encoding and decoding. Inside your Haskell program, just as characters are represented by their Unicode code points, the newline character is always \n.
To tell the I/O system about the newline encoding you want to use, see the section on Newline Conversion in the documentation for System.IO.
System.IO.nativeNewline is not private - you can access it
to find out what GHC considers the native "newline" to be
on the current platform.
Note that the type of this variable, System.IO.Newline, does
not have a Show instance as of GHC 6.12.3. So you can't
easily print its value. Instead, check to see if it is equal
to System.IO.LF or System.IO.CRLF.
However, as Simon pointed out, you shouldn't need
to know about the native newline sequence with normal
usage of the text-oriented IO functions
in GHC.
This variable, together with the rest of the new Unicode-aware
capabilities of the IO system, is not yet part of the Haskell standard.
It was not included in the
Haskell 2010 report.
However, since it is already implemented in GHC,
and there is quite a widespread consensus that it is
important and useful, expect it to be included in one of the
upcoming yearly revisions of the standard.

What defines data that can be stored in strings

A few days ago, I asked why its not possible to store binary data, such as a jpg file into a string variable.
Most of the answers I got said that string is used for textual information such as what I'm writing now.
What is considered textual data though? Bytes of a certain nature represent a jpg file and those bytes could be represented by character byte values...I think. So when we say strings are for textual information, is there some sort of range or list of characters that aren't stored?
Sorry if the question sounds silly. Just trying to 'get it'
I see three major problems with storing binary data in strings:
Most systems assume a certain encoding within string variables - e.g. if it's a UTF-8, UTF-16 or ASCII string. New line characters may also be translated depending on your system.
You should watch out for restrictions on the size of strings.
If you use C style strings, every null character in your data will terminate the string and any string operations performed will only work on the bytes up to the first null.
Perhaps the most important: it's confusing - other developers don't expect to find random binary data in string variables. And a lot of code which works on strings might also get really confused when encountering binary data :)
I would prefer to store binary data as binary, you would only think of converting it to text when there's no other choice since when you convert it to a textual representation it does waste some bytes (not much, but it still counts), that's how they put attachments in email.
Base64 is a good textual representation of binary files.
I think you are referring to binary to text encoding issue. (translate a jpg into a string would require that sort of pre-processing)
Indeed, in that article, some characters are mentioned as not always supported, other can be confusing:
Some systems have a more limited character set they can handle; not only are they not 8-bit clean, some can't even handle every printable ASCII character.
Others have limits on the number of characters that may appear between line breaks.
Still others add headers or trailers to the text.
And a few poorly-regarded but still-used protocols use in-band signaling, causing confusion if specific patterns appear in the message. The best-known is the string "From " (including trailing space) at the beginning of a line used to separate mail messages in the mbox file format.
Whoever told you you can't put 'binary' data into a string was wrong. A string simply represents an array of bytes that you most likely plan on using for textual data... but there is nothing stopping you from putting any data in there you want.
I do have to be careful though, because I don't know what language you are using... and in some languages \0 ends the string.
In C#, you can put any data into a string... example:
byte[] myJpegByteArray = GetBytesFromSomeImage();
string myString = Encoding.ASCII.GetString(myJpegByteArray);
Before internationalization, it didn't make much difference. ASCII characters are all bytes, so strings, character arrays and byte arrays ended up having the same implementation.
These days, though, strings are a lot more complicated, in order to deal with thousands of foreign language characters and the linguistic rules that go with them.
Sure, if you look deep enough, everything is just bits and bytes, but there's a world of difference in how the computer interprets them. The rules for "text" make things look right when it's displayed to a human, but the computer is free to monkey with the internal representation. For example,
In Unicode, there are many encoding systems. Changing between them makes every byte different.
Some languages have multiple characters that are linguistically equivalent. These could switch back and forth when you least expect it.
There are different ways to end a line of text. Unintended translations between CRLF and LF will break a binary file.
Deep down everything is just bytes.
Things like strings and pictures are defined by rules about how to order bytes.
strings for example end in a byte with value 32 (or something else)
jpg's don't
Depends on the language. For example in Python string types (str) are really byte arrays, so they can indeed be used for binary data.
In C the NULL byte is used for string termination, so a sting cannot be used for arbitrary binary data, since binary data could contain null bytes.
In C# a string is an array of chars, and since a char is basically an alias for 16bit int, you can probably get away with storing arbitrary binary data in a string. You might get errors when you try to display the string (because some values might not actually correspond to a legal unicode character), and some operations like case conversions will probably fail in strange ways.
In short it might be possible in some langauges to store arbitrary binary data in strings, but they are not designed for this use, and you may run into all kinds of unforseen trouble. Most languages have a byte-array type for storing arbitrary binary data.
I agree with Jacobus' answer:
In the end all data structures are made up of bytes. (Well, if you go even deeper: of bits). With some abstraction, you could say that a string or a byte array are conventions for programmers, on how to access them.
In this regard, the string is an abstraction for data interpreted as a text. Text was invented for communication among humans, computers or programs do not communicate very well using text. SQL is textual, but is an interface for humans to tell a database what to do.
So in general, textual data, and therefore strings, are primarily for human to human, or human to machine interaction (say for the content of a message box). Using them for something else (e.g. reading or writing binary image data) is possible, but carries lots of risk bacause you are using the data type for something it was not designed to handle. This makes it much more error prone. You may be able to store binary data in strings, mbut just because you are able to shoot yourself in the foot, you should avoid doing so.
Summary: You can do it. But you better don't.
Your original question (c# - What is string really good for?) made very little sense. So the answers didn't make sense, either.
Your original question said "For some reason though, when I write this string out to a file, it doesn't open." Which doesn't really mean much.
Your original question was incomplete, and the answers were misleading and confusing. You CAN store anything in a String. Period. The "strings are for text" answers were there because you didn't provide enough information in your question to determine what's going wrong with your particular bit of C# code.
You didn't provide a code snippet or an error message. That's why it's hard to 'get it' -- you're not providing enough details for us to know what you don't get.

Resources