Are extended attributes name and value guaranteed to be UTF-8 encoded - linux

I am trying to implement Rusty wrappers for those extended attributes syscalls, if they are guaranteed to be UTF-8 encoded, then I can use the type String, or I have to use OsString.
I googled a lot, but only found these two pages:
freedesktop: CommonExtendedAttributes says that:
Attribute strings should be in UTF-8 encoding.
macOS man page for setxattr(2) says that: The extended attribute names are simple NULL-terminated UTF-8 strings
Seems that this tells us the name is guaranteed to be UTF-8 encoded on macOS,
I would like to know information on as many platforms as possible since I try to cover them all in my implementation.

No, in Linux they are absolutely not guaranteed to be in UTF-8. Attribute values are not even guaranteed to be strings at all. They are just arrays of bytes with no constraints on them.
int setxattr(const char *path, const char *name,
const void *value, size_t size, int flags);
const void *value, size_t size is not a good way to pass a string to a function. const char* name is, and attribute names are indeed strings, but they are null-terminated byte strings.
Freedesktop recommendations are just that, recommendations. They don't prevent anyone from creating any attribute they want.

Related

Is an explicit NUL-byte necessary at the end of a bytearray for cython to be able to convert it to a null-terminated C-string

When converting a bytearray-object (or a bytes-object for that matter) to a C-string, the cython-documentation recommends to use the following:
cdef char * cstr = py_bytearray
there is no overhead, as cstr is pointing to the buffer of the bytearray-object.
However, C-strings are null-terminated and thus in order to be able to pass cstr to a C-function it must also be null-terminated. The cython-documentation doesn't provide any information, whether the resulting C-strings are null-terminated.
It is possible to add a NUL-byte explicitly to the byarray-object, e.g. by using b'text\x00' instead of just `b'text'. Yet this is cumbersome, easy to forget, and there is at least experimental evidence, that the explicit NUL-byte is not needed:
%%cython
from libc.stdio cimport printf
def printit(py_bytearray):
cdef char *ptr = py_bytearray
printf("%s\n", ptr)
And now
printit(bytearray(b'text'))
prints the desired "text" to stdout (which, in the case an IPython-notebook, is obviously not the output shown in the browser).
But is this a lucky coincidence or is there a guarantee, that the buffer of a bytearray-object (or a bytes-object) is null-terminated?
I think it's safe (at least in Python 3), however I'd be a bit wary.
Cython uses the C-API function PyByteArray_AsString. The Python3 documentation for it says "The returned array always has an extra null byte appended." The Python2 version does not have that note so it's difficult to be sure if it's safe.
Practically speaking, I think Python deals with this by always over-allocating bytearrays by one and NULL terminating them (see source code for one example of where this is done).
The only reason to be a bit cautious is that it's perfectly acceptable for bytearrays (and Python strings for that matter) to contain a 0 byte within the string, so it isn't a good indicator of where the end is. Therefore, you should really be using their len anyway. (This is a weak argument though, especially since you're probably the one initializing them, so you know if this should be true)
(My initial version of this answer had something about _PyByteArray_empty_string. #ead pointed out in the comments that I was mistaken about this and hence it's edited out...)

What are UInt16LE, UInt16BE, etc. in Node JS?

In all of my time programming I have squeaked by without ever learning this stuff. Would love to know more about what these are and how they are used:
UInt8
UInt16LE
UInt16BE
UInt32LE
UInt32BE
Int8
Int16LE
Int16BE
Int32LE
Int32BE
FloatLE
FloatBE
DoubleLE
DoubleBE
See https://nodejs.org/api/buffer.html#buffer_buf_readuint8_offset_noassert for where Node uses these.
This datatypes are related to number representation in appropriate byte-order. It typically essential for:
Network protocols
Binary file formats
It is essential because one system should write integers/floats in such way that will give the same value on reader side. So what format to be used is just convention between two sides (writer and reader).
What acronyms means:
BE suffix stands for BigEndian
LE stands for LittleEndian
Int is Integer
Uint is Unsigned Integer
Appropriate number in integers is number of bits in the word.

Convert from a LPCTSTR to a wchar*

I have a c++ application, and i need to convert a LPCTSTR to a wchar*.
Is there function to perform this conversion?
Using Visual Studio 2k8.
Thank you
From the comments, you are compiling for Unicode. In which case LPCTSTR evaluates as const wchar_t* and so no conversion is necessary. If you need a modifiable buffer, then you can allocate one and perform a memory copy. That works because the string is already encoded in UTF-16.
Since you are using C++ it makes sense to store strings in string classes rather than using raw C strings. For example you could use std::wstring. Or your could use the MFC/ATL string classes. Exactly which of these options is best for you depends on the specifics of the rest of your code base.
LPCTSTR may either be multibyte or Unicode, determined at compile-time. WinNT.h defines it as follows
#ifdef UNICODE
typedef LPCWSTR LPCTSTR;
#else
typedef LPCSTR LPCTSTR
#endif
meaning that it is already composed of wchar as Rup points out in a comment. So you might want to check UNICODE and use MultiByteToWideChar() if undefined. Of course, you'd need to know the code page the string is using, which depends on where and how it originates. The MultiByteToWideChar documentation has good code samples.

wstring string converter in Boost

I use the Boost library to implement my application. All the string characters in the data model of my application are wide chars (wchar_t type). But in the boost library, some classes only hand the narrow char (char type), i.e. "address boost::asio::ip::address::from_string(const char* str)". So I need to make the conversion between std::string and std::wstring when call the boost functions.
Is there performance issue due to the string conversions?
In there the converter in Boost, which makes the conversion between std::wstring and std::string with good performance?
UPDATE
Regarding the converter function. I find the code below works.
std::wstring wstr(L"Hello World");
const std::string nstr( wstr.begin(), wstr.end());
const std::wstring wstr2(nstr.begin(), nstr.end());
Add the research conclusion myself.
Regarding the performance overhead of the string conversion. I debugged into the functions above. The conversion is implemented by the C-cast char by char. The time complexity is O(L), L is the length of the string. In my application, the strings required to be converted are not very long. So I don't think there is any obviously performance latency due to the conversions.

libspotify and const char * lifetimes / encoding

Are the various libspotify APIs that return const char* returning caller owned strings or callee owned strings?
The normal convention, as far as I know, is that const char* means the callee owns it and the caller can use it but not necessarily rely on its lifetime and is not expected to free it.
Is this the pattern Spotify follows?
Also I saw some mention in the api.h file that the strings are UTF8 encoded? I assume this is true on all APIs not just the one or two that explicitly mention it?
1) const char * returns are owned by libSpotify unless stated otherwise. You don't need to free() them, and if you want them to stick around you should copy them - for example, a playlist name's const char * will be freed by libSpotify when the playlist's name changes. The "Add your own locks" section of the libSpotify FAQ discusses this a little bit.
2) All strings are UTF-8.

Resources