In Python 3, executing:
memoryview("this is a string")
produces the error:
TypeError: memoryview: str object does not have the buffer interface
What should I do in order to make memoryview accept strings or what transformation should I do to my strings in order to be accepted by memoryview?
From the docs, memoryview works only over objects which support the bytes or bytearray interface. (These are similar types except the former is read-only.)
Strings in Python 3 are not raw byte buffers we can directly manipulate, rather they are immutable sequences of Unicode runes or characters. A str can be converted to a buffer, though, by encoding it with any of the supported string encodings like 'utf-8', 'ascii', etc.
memoryview(bytes("This is a string", encoding='utf-8'))
Note that the bytes() call necessarily involves converting and copying the string data into a new buffer accessible to memoryview. As should be evident from the preceding paragraph, is not possible to directly create a memoryview over the str's data.
Considering that the error already provides clarity on the issue and camflint's answer, I'll only add that you could succinctly create a memoryview instance from a string as follows:
memoryview( b"this is a string" )
Related
''' Set up '''
s= open("Bilion_of_UTF-8_chars.txt",encoding="UTF-8").read()
'''
The following doesn't look like a cheap operation
because Python3 `str`-s are UTF-8 encoded (EDIT: in some implementations only).
'''
my_char= s[453_452_345]
However, many people write loops like this:
for i in range(len(s)):
do_something_with(s[i])
using indexing operation up to n times or more.
How does Python3 resolve the problem of indexing UTF-8 characters in strings for both code snippets?
Does it always perform a linear look-up for nth char (which is both simple & expensive resolution)?
Or maybe it stores some additional C pointers to perform smart index calculations?
What is Python 3 str.__getitem__ computional complexity?
A: O(1)
Python strings are not utf-8 internally: in Python 3 when getting text from any external source, the text is decoded according to a given codec. This text decoding defaults to utf-8 in most sources/platforms, but varying accordingly to the S.O.'s default - anyway, all relevant "text importing" APIs, like opening a file, or connecting to a DB, allow you to specify the text encoding to use.
Inner strings use one of "Latin-1", "UCS-2" or "UCS-4" according to the needs of the "widest" codepoint in the text string.
This is new from Python 3.3 onwards (prior to that, all internal string representation would default to 32bit UCS-4, even for ASCII-only text). The spec is documented on PEP-393.
Therefore, Python can just zero-in the correct character given its index.
As an anecdote, Luciano Ramalho (author of Fluent Python book), wrote Leanstr, a learning-purpose implementation of a string class that will hold utf-8 internally. Of course, then your worries about __getitem__ complexity apply: https://github.com/ramalho/leanstr
Unfortunatelly, (or fortunatelly, in this case), a lot of the standard library and native code extensions to Python will not accept a class similar to str, even if it inherits from str and keeps its data separetely, re-implementing all dunder methods. But if all str methods are in place, any pure-python code dealing with strings should accept a LeanStr instance.
Other implementations: Pypy
So, it happens that how text is used internally is an "implementation detail", and Pypy from version 7.1 onwards does use utf-8 byte strings internally for its text objects.
Unlike Ramalho's naive "leanstr" above, however, they do keep an index for each 4th utf-8 char so that char access by index can still be made in O(1). I did not find any docs about it, but the code for creating the index is here.
I've mentioned this question on twiter, as I am an acquittance of Ramalho, and eventually Carl Friederich Bolz-Terich, one of Pypy developers, reached back:
It's worked really quite well for us! Most Unicode strings don't need this index, and zero copy utf-8 decoding is quite cool. What's most annoying is actually str.find, because there you need the reverse conversion, from byte index to char index. we don't have an index for that.
Tweet
I have a bytes-like object something like:
aa = b'abc\u6df7\u5408def.mp3'
I want to save it into a file in binary mode. the codes are below, but not work well
if __name__=="__main__":
aa = b'abc\u6df7\u5408def.mp3'
print(aa.decode('unicode-escape'))
with open('database.bin', "wb") as datafile:
datafile.write(aa)
the data in file is like that:
enter image description here
but i want the right format is like this, unicodes in binary data:
enter image description here
How can i convert the bytes to save it in file?
\uNNNN escapes do not make sense in byte strings because they do not specify a sequence of bytes. Unicode code points are conceptually abstract representations of strings, and do not straightforwardly map to a serialization format (consisting of bytes, or, in principle, any other sort of concrete symbolic representation).
There are well-defined serialization formats for Unicode; these are known as "encodings". You seem to be looking for the UTF-16 big-endian encoding of these characters.
aa = 'abc\u6df7\u5408def.mp3'.encode('utf-16-be')
With that out of the way, I believe the rest of your code should work as expected.
Unicode on disk is always encoded but you obviously have to know the encoding in order to read it correctly. An optional byte-order mark (BOM) is sometimes written to the beginning of serialized Unicode text files to help the reader discover the encoding; this is a single non-printing character whose sole purpose is to help disambiguate the encoding, and in particular its byte order (big-endian vs little-endian).
However, many places are standardizing on UTF-8 which doesn't require a BOM. The encoding itself is byte-oriented, so it is immune to byte order issues. Perhaps see also https://utf8everywhere.org/
Maybe a stupid question, but if I have some arbitrary binary data, can I cast it to string and back to byte array without corrupting it?
Is []byte(string(byte_array)) always the same as byte_array?
The expression []byte(string(byte_slice)) evaluates to a slice with the same length and contents as byte_slice. The capacity of the two slices may be different.
Although some language features assume that strings contain valid UTF-8 encoded text, a string can contain arbitrary bytes.
I have some code that uses ReadStr and WriteStr for what I presume is writing a string to a binary file.
The explanation for WriteStr in the documentation states that it will write raw data in the shape of an AnsiString to the object's stream, which makes sense. But then ReadStr says that it reads a character. So are they not the opposite of each other?
Let say I have,
pName: String[80];
and I use WriteStr on it, what does it actually write? Since WriteStr expects AnsiString, does it cast pName to be such? In that case, does it not write the "Length" field into the stream because an AnsiString pointer points to the first element and not the length field? I was also looking and it seems String == AnsiString these days, but my question about the length field still remains the same.
If lets say it doesn't write the Length field into the file, does it still write the NULL at the end of the data? As such, can I find where the string ends by looking for a '\0'? Does ReadStr read until the NULL character?
Thank you kindly :)
In your pre-Unicode version of Delphi, WriteStr and ReadStr write and read an AnsiString value. The writing code writes the length, and then the string content. The reading code reads the length, allocates the string, and then fills it with the content.
This has the potential of involving a truncation when you assign the result of ReadStr to your 80 character short string.
I have a trie-based dictionary on my drive that is encoded as a contiguous array of bit-packed 4-byte trie nodes. In Python I would read it to an actual array of 4-byte integers the following way:
import array
trie = array.array('I')
try:
trie.fromfile(open("trie.dat", "rb"), some_limit)
except EOFError:
pass
How can I do the same in Haskell (reading from a file to an Array or Vector)? The best I could come up with is to read the file as usual and then take the bytes in chunks of four and massage them together arithmetically, but that's horribly ugly and also introduces a dependency on endianness.
encoded as a contiguous array of bit-packed 4-byte trie nodes
I presume the 'encoding' here is some Python format? You say "raw C-style array"?
To load the data of this binary (or any other format) into Haskell you can use the Data.Binary library, and provide an instance of Binary for your custom format.
For many existing data interchange formats there are libraries on Hackage, however you would need to specify the format. For e.g. image data, there is repa-devil.
For truly raw data, you can mmap it to a bytestring, then process it further into a data structure.