how to write xdr packed data in binary files - python-3.x

I want to convert an array into xdr format and save it in binary format. Here's my code:
# myData is Pandas data frame, whose 3rd col is int (but it could anything else)
import xdrlib
p=xdrlib.Packer()
p.pack_list(myData[2],p.pack_int)
newFile=open("C:\\Temp\\test.bin","wb")
# not sure what to put
# p.get_buffer() returns a string as per document, but how can I provide xdr object?
newFile.write(???)
newFile.close()
How can I provide the XDR-"packed" data to newFile.write function?
Thanks

XDR is a pretty raw data format. Its specification (RFC 1832) doesn't specify any file headers, or anything else, beyond the encoding of various data types.
The binary string you get from p.get_buffer() is the XDR encoding of the data you've fed to p. There is no other kind of XDR object.
I suspect that what you want is simply newFile.write(p.get_buffer()).
Unrelated to the XDR question, I'd suggest using a with statement to take care of closing the file.

Related

How JSONEachRow can support binary data?

This document suggests that JSONEachRow Format can handle binary data.
http://www.devdoc.net/database/ClickhouseDocs_19.4.1.3-docs/interfaces/formats/#jsoneachrow
From my understanding binary data can contain any bytes and because of this data can represent anything and it can break the JSON structure.
For example, a string may contain some bytes that may represent a quote if interpreted in UTF-8 or some invalid bytes.
So,
How are they achieving it?
How do they know when that string is actually ending?
At the end the DB needs to interpret the command and values, they must be using some encoding to do that.
Please correct me if I am wrong somehow.

Is there any difference between using/not using "astype(np.float)" for narray?

I'm going to import the txt file which contains Numbers only, for some coding practice.
Noticed that i can get the same result with either code_1 or code_2:
code_1 = np.array(pd.read_csv('e:/data.txt', sep='\t', header=None)).astype(np.float)
code_2 = np.array(pd.read_csv('e:/data.txt', sep='\t', header=None))
So I wonder if there is any difference between using or not using .astype(np.float)?
please tell me if there is an similar question. thx a lot.
DataFrame.astype() method is used to cast a pandas object to a specified dtype. astype() function also provides the capability to convert any suitable existing column to categorical type.
The DataFrame.astype() function comes very handy when we want to case a particular column data type to another data type.
In your case, the file is loaded as a DataFrame. The numbers will be loaded as integers or floats depending on the numbers. The astype(np.float) method converts the numbers to floats. On the other hand if the numbers are already of float type, then as you saw there will not be any difference between the two.

Reading from a binary file with a given data structure

I want to read the 'Last Traded Price' from the given binary file. How do I extract a specific data out of the file by using notations like 'hhl10s6sc'. I know I have to use the struct.unpack method, but where can I learn to write such formatting (with some illustrations) so that I can extract any data that I want from such a binary file.
The thing that is troubling me is the unpacking that the writer of the code (that I'm trying to understand) has written - 'hlhcl6s10s11s10s2s1s10s12schc' . I understood what 6s...12s mean, but what's the significance of the 'hlhcl' (5 characters in the beginning) and 'chc' (3 characters in the last). The writer has tried to retrieve the 'Last traded price' from the data structure.
If you could give some examples and/or some sources for the same, it'd be very helpful. Attached the image which contains the data structure of the given file.
This image shows the data structure
struct format strings are fields described in order. Every letter is a format character, so hlhcl translates to "short, long, short, character, long". This doesn't resemble the image you linked (which is a tad impractical as it's off-site and another step to look up), which starts with a single long and otherwise holds only strings. It might apply to a protocol wrapping that packet.

pySpark DataFrame FloatType() with file coming in as unicode

Hello I have the following schema:
[StructField(record_id,StringType,true), StructField(offer_id,FloatType,true)]
The file I am importing is coming in as unicode.
For sc.textFiles turning unicode to false still pulls a string error. My question is before I load the data into the dataframe do I have to cleanse it (convert unicode to float before saying it is FloatType?
What is the most efficient way to do this especially as a I scale to 1000's of fields.
It is NOT good practice to convert implicitly between unrelated data types. So (almost) no system can help you to do it automagically. Yes, you have to tell the system and system will accept you are taking the risk of failure in future (what happens if the string field contains "abc" suddenly?)
You should use a map function as translation layer between your sc.textfile and createDataFrame or apply schema step. All casting to correct data types should happen there.
If you have 1000s of fields, you may want to implement an infer-schema mechanism and take some sample of data to decide the schema to use, and then apply it to whole data.
(Assuming Spark 1.3.1 release)

Reading from files to arrays in Haskell

I have a trie-based dictionary on my drive that is encoded as a contiguous array of bit-packed 4-byte trie nodes. In Python I would read it to an actual array of 4-byte integers the following way:
import array
trie = array.array('I')
try:
trie.fromfile(open("trie.dat", "rb"), some_limit)
except EOFError:
pass
How can I do the same in Haskell (reading from a file to an Array or Vector)? The best I could come up with is to read the file as usual and then take the bytes in chunks of four and massage them together arithmetically, but that's horribly ugly and also introduces a dependency on endianness.
encoded as a contiguous array of bit-packed 4-byte trie nodes
I presume the 'encoding' here is some Python format? You say "raw C-style array"?
To load the data of this binary (or any other format) into Haskell you can use the Data.Binary library, and provide an instance of Binary for your custom format.
For many existing data interchange formats there are libraries on Hackage, however you would need to specify the format. For e.g. image data, there is repa-devil.
For truly raw data, you can mmap it to a bytestring, then process it further into a data structure.

Resources