im running analysis on a huge file which take several hours to finish and result is a dictionary which i need for next steps and i want to save the output in a file to keep it. but when i write the output in file, it converts the dictionary output to str and saves it, but python can not interpret the saved str as dictionary in future
for example my output dictionary is
output={a:[1,2]}
when i save it , its being saved as :
'{a:[1,2]}' #can not be interpreted as dictionary by python anymore for further use in future!
is there anyway so i could save my output as dictionary in a file or is there any way python could convert string back to dictionary from a file?!
If the dictionary contains solely of values of simple types, then you can use dump() and load() from the json module to produce and retrieve a text representation.
The representations produced by str() are not meant to be valid Python in all cases. It might be possible to reparse the representations produced by repr(), but json is a safe bet, and it's cross-programming-language and cross platform.
If the dictionary contains non-simple types, the json module has provisions allowing yout to provide your own marshalling/unmarshalling for them.
Related
I want to store the output of a Python function's locals() in a persistent object in utf-8 so that a person can access human-readable parameters and also use them again, as in func(**ast.literal_eval(stored_parameters)).
Is there a way to do this "safely", or is literal_eval sufficient to ensure the stored parameters probably aren't malicious?
If there are no safeguards to prevent stored_parameters from being modified after it was created, what risks are there when using the parameters in a function after they've been extracted with literal_eval (in terms of security, or data integrity, or some other sense I haven't thought of)?
''' Set up '''
s= open("Bilion_of_UTF-8_chars.txt",encoding="UTF-8").read()
'''
The following doesn't look like a cheap operation
because Python3 `str`-s are UTF-8 encoded (EDIT: in some implementations only).
'''
my_char= s[453_452_345]
However, many people write loops like this:
for i in range(len(s)):
do_something_with(s[i])
using indexing operation up to n times or more.
How does Python3 resolve the problem of indexing UTF-8 characters in strings for both code snippets?
Does it always perform a linear look-up for nth char (which is both simple & expensive resolution)?
Or maybe it stores some additional C pointers to perform smart index calculations?
What is Python 3 str.__getitem__ computional complexity?
A: O(1)
Python strings are not utf-8 internally: in Python 3 when getting text from any external source, the text is decoded according to a given codec. This text decoding defaults to utf-8 in most sources/platforms, but varying accordingly to the S.O.'s default - anyway, all relevant "text importing" APIs, like opening a file, or connecting to a DB, allow you to specify the text encoding to use.
Inner strings use one of "Latin-1", "UCS-2" or "UCS-4" according to the needs of the "widest" codepoint in the text string.
This is new from Python 3.3 onwards (prior to that, all internal string representation would default to 32bit UCS-4, even for ASCII-only text). The spec is documented on PEP-393.
Therefore, Python can just zero-in the correct character given its index.
As an anecdote, Luciano Ramalho (author of Fluent Python book), wrote Leanstr, a learning-purpose implementation of a string class that will hold utf-8 internally. Of course, then your worries about __getitem__ complexity apply: https://github.com/ramalho/leanstr
Unfortunatelly, (or fortunatelly, in this case), a lot of the standard library and native code extensions to Python will not accept a class similar to str, even if it inherits from str and keeps its data separetely, re-implementing all dunder methods. But if all str methods are in place, any pure-python code dealing with strings should accept a LeanStr instance.
Other implementations: Pypy
So, it happens that how text is used internally is an "implementation detail", and Pypy from version 7.1 onwards does use utf-8 byte strings internally for its text objects.
Unlike Ramalho's naive "leanstr" above, however, they do keep an index for each 4th utf-8 char so that char access by index can still be made in O(1). I did not find any docs about it, but the code for creating the index is here.
I've mentioned this question on twiter, as I am an acquittance of Ramalho, and eventually Carl Friederich Bolz-Terich, one of Pypy developers, reached back:
It's worked really quite well for us! Most Unicode strings don't need this index, and zero copy utf-8 decoding is quite cool. What's most annoying is actually str.find, because there you need the reverse conversion, from byte index to char index. we don't have an index for that.
Tweet
I'm trying to make a function that can take an argument and return a unique, short expression of that data.
A hash.
There's a whole hashlib package for doing this, but hashlib only takes strings. I want to easily hash anything: lists, functions, classes, anything.
How can I either convert anything into a unique string representation so I can hash it, or better yet, directly hash anything?
I thought you might be able to get the bytes() representation of an object but this needs special encodings for whatever it's given, and whatnot. so I'm not sure if there's a solution there.
hash_any(thing):
# convert thing to a string of it's unique byte data
# return hashlib.sha256(byte_data_str)
How would you go about doing this?
Edit: I've found the correct vernacular to find what I'm looking for. This is what I mean:
Alternative to python hash function for arbitrary objects
What is the quickest way to hash a large arbitrary object?
Create Hash for Arbitrary Objects?
I'm sure on of these contains a solution I seek.
This is working for now, not optimal or efficient, but its fine for what I need.
def string_this(thing):
'''
https://stackoverflow.com/questions/60103855/how-to-convert-anything-to-a-
string-bytes-object-so-it-can-be-hashed
attempts to turn anything into a string that represents its underlying data
most accurately such that it can be hashed.
'''
if isinstance(thing, str):
return thing
try:
# objects like DataFrames
return thing.to_json()
except Exception:
try:
# other things without built in to_json serializable functions
return json.dump(thing)
except Exception:
# hopefully its a python primative type
return str(thing)
i'm trying to preset zlib's dictionary for compression. as of python 3.3 zlib.compressobj function offers the option. the docs say it should be some bytesarray or a bytes object e.g. b"often-found".
now: how to pass multiple strings ordered ascending by their likeliness to occur as suggested in the docs? is there a secret delimiter e.g. b"likely,more-likely,most-likely"?
No, there is no delimiter needed. All the dictionary is is a resource in which to look for strings that match portions of the data to be compressed. Therefore strings that are likely to occur can simply be concatenated. Or even overlapped if starts and ends match. For example if you want the words lighthouse and household to be available, you can just put lighthousehold in the dictionary.
Since it takes more bits to represent matches that are further back, you would put the most likely matches at the end of the dictionary.
I want to convert an array into xdr format and save it in binary format. Here's my code:
# myData is Pandas data frame, whose 3rd col is int (but it could anything else)
import xdrlib
p=xdrlib.Packer()
p.pack_list(myData[2],p.pack_int)
newFile=open("C:\\Temp\\test.bin","wb")
# not sure what to put
# p.get_buffer() returns a string as per document, but how can I provide xdr object?
newFile.write(???)
newFile.close()
How can I provide the XDR-"packed" data to newFile.write function?
Thanks
XDR is a pretty raw data format. Its specification (RFC 1832) doesn't specify any file headers, or anything else, beyond the encoding of various data types.
The binary string you get from p.get_buffer() is the XDR encoding of the data you've fed to p. There is no other kind of XDR object.
I suspect that what you want is simply newFile.write(p.get_buffer()).
Unrelated to the XDR question, I'd suggest using a with statement to take care of closing the file.