I have a trie-based dictionary on my drive that is encoded as a contiguous array of bit-packed 4-byte trie nodes. In Python I would read it to an actual array of 4-byte integers the following way:
import array
trie = array.array('I')
try:
trie.fromfile(open("trie.dat", "rb"), some_limit)
except EOFError:
pass
How can I do the same in Haskell (reading from a file to an Array or Vector)? The best I could come up with is to read the file as usual and then take the bytes in chunks of four and massage them together arithmetically, but that's horribly ugly and also introduces a dependency on endianness.
encoded as a contiguous array of bit-packed 4-byte trie nodes
I presume the 'encoding' here is some Python format? You say "raw C-style array"?
To load the data of this binary (or any other format) into Haskell you can use the Data.Binary library, and provide an instance of Binary for your custom format.
For many existing data interchange formats there are libraries on Hackage, however you would need to specify the format. For e.g. image data, there is repa-devil.
For truly raw data, you can mmap it to a bytestring, then process it further into a data structure.
Related
This document suggests that JSONEachRow Format can handle binary data.
http://www.devdoc.net/database/ClickhouseDocs_19.4.1.3-docs/interfaces/formats/#jsoneachrow
From my understanding binary data can contain any bytes and because of this data can represent anything and it can break the JSON structure.
For example, a string may contain some bytes that may represent a quote if interpreted in UTF-8 or some invalid bytes.
So,
How are they achieving it?
How do they know when that string is actually ending?
At the end the DB needs to interpret the command and values, they must be using some encoding to do that.
Please correct me if I am wrong somehow.
I have a bytes-like object something like:
aa = b'abc\u6df7\u5408def.mp3'
I want to save it into a file in binary mode. the codes are below, but not work well
if __name__=="__main__":
aa = b'abc\u6df7\u5408def.mp3'
print(aa.decode('unicode-escape'))
with open('database.bin', "wb") as datafile:
datafile.write(aa)
the data in file is like that:
enter image description here
but i want the right format is like this, unicodes in binary data:
enter image description here
How can i convert the bytes to save it in file?
\uNNNN escapes do not make sense in byte strings because they do not specify a sequence of bytes. Unicode code points are conceptually abstract representations of strings, and do not straightforwardly map to a serialization format (consisting of bytes, or, in principle, any other sort of concrete symbolic representation).
There are well-defined serialization formats for Unicode; these are known as "encodings". You seem to be looking for the UTF-16 big-endian encoding of these characters.
aa = 'abc\u6df7\u5408def.mp3'.encode('utf-16-be')
With that out of the way, I believe the rest of your code should work as expected.
Unicode on disk is always encoded but you obviously have to know the encoding in order to read it correctly. An optional byte-order mark (BOM) is sometimes written to the beginning of serialized Unicode text files to help the reader discover the encoding; this is a single non-printing character whose sole purpose is to help disambiguate the encoding, and in particular its byte order (big-endian vs little-endian).
However, many places are standardizing on UTF-8 which doesn't require a BOM. The encoding itself is byte-oriented, so it is immune to byte order issues. Perhaps see also https://utf8everywhere.org/
In Python 3, executing:
memoryview("this is a string")
produces the error:
TypeError: memoryview: str object does not have the buffer interface
What should I do in order to make memoryview accept strings or what transformation should I do to my strings in order to be accepted by memoryview?
From the docs, memoryview works only over objects which support the bytes or bytearray interface. (These are similar types except the former is read-only.)
Strings in Python 3 are not raw byte buffers we can directly manipulate, rather they are immutable sequences of Unicode runes or characters. A str can be converted to a buffer, though, by encoding it with any of the supported string encodings like 'utf-8', 'ascii', etc.
memoryview(bytes("This is a string", encoding='utf-8'))
Note that the bytes() call necessarily involves converting and copying the string data into a new buffer accessible to memoryview. As should be evident from the preceding paragraph, is not possible to directly create a memoryview over the str's data.
Considering that the error already provides clarity on the issue and camflint's answer, I'll only add that you could succinctly create a memoryview instance from a string as follows:
memoryview( b"this is a string" )
I want to convert an array into xdr format and save it in binary format. Here's my code:
# myData is Pandas data frame, whose 3rd col is int (but it could anything else)
import xdrlib
p=xdrlib.Packer()
p.pack_list(myData[2],p.pack_int)
newFile=open("C:\\Temp\\test.bin","wb")
# not sure what to put
# p.get_buffer() returns a string as per document, but how can I provide xdr object?
newFile.write(???)
newFile.close()
How can I provide the XDR-"packed" data to newFile.write function?
Thanks
XDR is a pretty raw data format. Its specification (RFC 1832) doesn't specify any file headers, or anything else, beyond the encoding of various data types.
The binary string you get from p.get_buffer() is the XDR encoding of the data you've fed to p. There is no other kind of XDR object.
I suspect that what you want is simply newFile.write(p.get_buffer()).
Unrelated to the XDR question, I'd suggest using a with statement to take care of closing the file.
im running analysis on a huge file which take several hours to finish and result is a dictionary which i need for next steps and i want to save the output in a file to keep it. but when i write the output in file, it converts the dictionary output to str and saves it, but python can not interpret the saved str as dictionary in future
for example my output dictionary is
output={a:[1,2]}
when i save it , its being saved as :
'{a:[1,2]}' #can not be interpreted as dictionary by python anymore for further use in future!
is there anyway so i could save my output as dictionary in a file or is there any way python could convert string back to dictionary from a file?!
If the dictionary contains solely of values of simple types, then you can use dump() and load() from the json module to produce and retrieve a text representation.
The representations produced by str() are not meant to be valid Python in all cases. It might be possible to reparse the representations produced by repr(), but json is a safe bet, and it's cross-programming-language and cross platform.
If the dictionary contains non-simple types, the json module has provisions allowing yout to provide your own marshalling/unmarshalling for them.