Unpickling from converted string in python/numpy - python-3.x

I have a ton of numpy ndarrays that are stored picked to strings. That may have been a poor design choice but it's what I did, and now the picked strings seem to have been converted or something along the way, when I try to unpickle I notice they are of type str and I get the following error:
TypeError: 'str' does not support the buffer interface
when I invoke
numpy.loads(bin_str)
Where bin_str is the thing I'm trying to unpickle. If I print out bin_strit looks like
b'\x80\x02cnumpy.core.multiarray\n_reconstruct\nq\x00cnumpy\nndarray\nq\x01K\x00\x85q\x02c_codecs\nencode\nq\x03X\x01\x00\x00\ ...
continuing for some time, so the info seems to be there, I'm just not quite sure how to convert it into whatever string format numpy/pickle need. On a whim I tried
numpy.loads( bytearray(bin_str, encoding='utf-8') )
and
numpy.loads( bin_str.encode() )
which both throw an error _pickle.UnpicklingError: unpickling stack underflow. Any ideas?
PS: I'm on python 3.3.2 and numpy 1.7.1
Edit
I discovered that if I do the following:
open('temp.txt', 'wb').write(...)
return numpy.load( 'temp.txt' )
I get back my array, and ... denotes copying and pasting the output of print(bin_str) from another window. I've tried writing bin_str to a file directly to unpickle but that doesn't work, it complains that TypeError: 'str' does not support the buffer interface. A few sane ways of converting bin_str to something that can be written directly to a binary file result in pickle errors when trying to read it back.
Edit 2
So I guess what's happened is that my binary pickle string ended up encoded inside of a normal string, something like:
"b'pickle'"
which is unfortunate and I haven't figured out how to deal with that, except this ridiculous and convoluted way to get it back:
open('temp.py', 'w').write('foo = ' + bin_str)
from temp import foo
numpy.loads( foo )
This seems like a very shameful solution to the problem, so please give me a better one!

It sounds like your saved strings are the reprs of the original bytes instances returned by your pickling code. That's a bit unfortunate, but not too bad. repr is intended to return a "machine friendly" representation of an object, and it can often be reversed by using eval:
import numpy as np
import pickle
# this part has already happened
orig_obj = np.array([1,2,3])
orig_pickle = pickle.dumps(orig_obj)
saved_str = repr(orig_pickle) # this was a mistake, but it's already done
# this is what you need to do to get something equivalent to orig_obj back
reconstructed_pickle = eval(saved_str)
reconstructed_obj = pickle.loads(reconstructed_pickle)
# test
if np.all(reconstructed_obj == orig_obj):
print("It worked!")
Obligatory note that using eval can be dangerous: Be aware that eval can run any Python code it wants, so don't call it with untrusted data. However, pickle data has the same risks (a malicious Pickle string can run arbitrary code upon unpickling), so you're not losing much safety in this situation. I'm guessing that you trust your data in this case anyway.

Related

Python3 - Reading mixed data from a file and convert the read values to float

I have the following data stored in a file (tn.csv).
The content is as follows (Original file is very large but I simplified it):
file_content.csv
I read the data using the following source code:
python_script.py
It nicely produces the following which is a numpy array of strings.
sample_output.png
Ultimately I want to use the read data as a numpy array of floats in my python script.
All the entries should be float values. Hence, I would like to cast the entries from string to float. As some entries are not numbers but strings of text (they are variables in my script), I could not perform type casting directly. Python throws a ValueError.
Note: I define the (vector) variable nv before I read this matrix. Hence, it is well defined. So, the complexity is that some entries of the numpy arrays are variables.
What I would like to have is the following:
If I had directly typed the file (tn.csv) contents in the form of numpy array in my python script, I would have been able to use it directly without any conversion. But the matrix is very big. So, I store it in an external file. Now, I want to read the file and store it in a numpy array. The end result should be the same as if I had typed it directly in my script. Could someone help me to tackle this issue or provide me relevant links?
This is my first post in stackoverflow. If my question is not clear, please let me know.
I will rewrite it.

Python 3.7, Feedparser module cannot parse BBC weather feed

When I parse the example rss link provided by BBC weather it gives only an empty feed, the example link is: "https://weather-broker-cdn.api.bbci.co.uk/en/forecast/rss/3day/2643123"
Ive tried using the feedparser module in python, I would like to do this in either python or c++ but python seemed easier. Ive also tried rewriting the URL without https:// and with .xml and it still doesn't work.
import feedparser
d = feedparser.parse('https://weather-broker-cdn.api.bbci.co.uk/en/forecast/rss/3day/2643123')
print(d)
Should give a result similar to the RSS feed which is on the link, but it just gets an empty feed
First, I know you got no result - not an error like me. Perhaps you are running a different version. As I mentioned, it yields a result on an older version in Python 2, using a program that has been running solidly every night for about 5 years, but it throws an exception on a freshly installed feedparser 5.2.1 on Python 3.7.4 64 bit.
I'm not entirely sure what is going on, but the function called _gen_georss_coords which is throwing a StopIteration on the first call. I have noted some references to this error due to the implementation of PEP479. It is written as a generator, but for your rss feed it only has to return 1 tuple. Here is the offending function.
def _gen_georss_coords(value, swap=True, dims=2):
# A generator of (lon, lat) pairs from a string of encoded GeoRSS
# coordinates. Converts to floats and swaps order.
latlons = map(float, value.strip().replace(',', ' ').split())
nxt = latlons.__next__
while True:
t = [nxt(), nxt()][::swap and -1 or 1]
if dims == 3:
t.append(nxt())
yield tuple(t)
There is something curious going on, perhaps to do with PEP479 and the fact that there are two separate generators happening in the same function, that is causing StopIteration to bubble up to the calling function. Anyway, I rewrote it is a somewhat more straightforward way.
def _gen_georss_coords(value, swap=True, dims=2):
# A generator of (lon, lat) pairs from a string of encoded GeoRSS
# coordinates. Converts to floats and swaps order.
latlons = list(map(float, value.strip().replace(',', ' ').split()))
for i in range(0, len(latlons), 3):
t = [latlons[i], latlons[i+1]][::swap and -1 or 1]
if dims == 3:
t.append(latlons[i+2])
yield tuple(t)
You can define the above new function in your code, then execute the following to patch it into feedparser
saveit, feedparser._gen_georss_coords = (feedparser._gen_georss_coords, _gen_georss_coords)
Once you're done with it, you can restore feedparser to its previous state with
feedparser._gen_georss_coords, _gen_georss_coords = (saveit, feedparser._gen_georss_coords)
Or if you're confident that this is solid, you can modify feedparser itself. Anyway I did this trick and your rss feed suddenly started working. Perhaps in your case it will also result in some improvement.

How to remove special usual characters from a pandas dataframe using Python

I have a file some crazy stuff in it. It looks like this:
I attempted to get rid of it using this:
df['firstname'] = map(lambda x: x.decode('utf-8','ignore'), df['firstname'])
But I wound up with this in my dataframe: <map object at 0x0000022141F637F0>
I got that example from another question and this seems to be the Python3 method for doing this but I'm not sure what I'm doing wrong.
Edit: For some odd reason someone thinks that this has something to do with getting a map to return a list. The central issue is getting rid of non UTF-8 characters. Whether or not I'm even doing that correctly has yet to be established.
As I understand it, I have to apply an operation to every character in a column of the dataframe. Is there another technique or is map the correct way and if it is, why am I getting the output I've indicated?
Edit2: For some reason, my machine wouldn't let me create an example. I can now. This is what i'm dealing with. All those weird characters need to go.
import pandas as pd
data = [['🦎Ale','Αλέξανδρα'],['��Grain','Girl🌾'],['Đỗ Vũ','ên Anh'],['Don','Johnson']]
df = pd.DataFrame(data,columns=['firstname','lastname'])
print(df)
Edit 3: I tired doing this using a reg ex and for some reason, it still didn't work.
df['firstname'] = df['firstname'].replace('[^a-zA-z\s]',' ')
This regex works FINE in another process, but here, it still leaves the ugly characters.
Edit 4: It turns out that it's image data that we're looking at.

Python 3 pickle load from Python 2

I have a pickle file that was created (I don't know how exactly) in python 2. It is intended to be loaded by the following python 2 lines, which when used in python 3 (unsurprisingly) do not work:
with open('filename','r') as f:
foo, bar = pickle.load(f)
Result:
'ascii' codec can't decode byte 0xc2 in position 1219: ordinal not in range(128)
Manual inspection of the file indicates it is utf-8 encoded, therefore:
with open('filename','r', encoding='utf-8') as f:
foo, bar = pickle.load(f)
Result:
TypeError: a bytes-like object is required, not 'str'
With binary encoding:
with open('filename','rb', encoding='utf-8') as f:
foo, bar = pickle.load(f)
Result:
ValueError: binary mode doesn't take an encoding argument
Without binary encoding:
with open('filename','rb') as f:
foo, bar = pickle.load(f)
Result:
UnpicklingError: invalid load key, '
'.
Is this pickle file just broken? If not, how can I pry this thing open in python 3? (I have browsed the extensive collection of related questions and not found anything that works yet.)
Finally, note that the original
import cPickle as pickle
has been replaced with
import _pickle as pickle
The loading of python2 pickles in python3 (version 3.7.2 in this example) can be helped using the fix_imports parameter in the pickle.load function, but in my case it also worked without setting that parameter to True.
I was attempting to load a scipy.sparse.csr.csr_matrix contained in pickle generated using Python2.
When inspecting the file format using the UNIX command file it says:
>file -bi python2_generated.pckl
application/octet-stream; charset=binary
I could load the pickle in Python3 using the following code:
with open("python2_generated.pckl", "rb") as fd:
bh01 = pickle.load(fd, fix_imports=True, encoding="latin1")
Note that the loading was successful with and without setting fix_imports to True
As for the "latin1" encoding, the Python3 documentation (version 3.7.2) for the pickle.load function says:
Using encoding='latin1' is required for unpickling NumPy arrays and instances of datetime, date and time pickled by Python 2
Although this is specifically for scipy matrixes (or Numpy arrays), and since Novak is not clarifing what his pickle file contained,
I hope this could of help to other users :)
Two errors were conflating each other.
First: By the time the .p file reached me, it had almost certainly been corrupted in transit, likely by FTP-ing (or similar) in ASCII rather than binary mode. I was able to get my hands on a properly transmitted copy, which allowed me to discover...
Second: Whatever the file might have implied on the inside, the proper encoding was 'latin1' not 'utf-8'.
So in a sense, yes, the file was broken, and even after that I was doing it wrong. I leave this here as a reminder to whoever eventually has the next bizarre pickle/python2/python3 issue that there can be multiple things gone wrong, and they have to be solved in the correct orderr.

How do I save a scipy distribution in a list or array to call? [duplicate]

I wonder, how to save and load numpy.array data properly. Currently I'm using the numpy.savetxt() method. For example, if I got an array markers, which looks like this:
I try to save it by the use of:
numpy.savetxt('markers.txt', markers)
In other script I try to open previously saved file:
markers = np.fromfile("markers.txt")
And that's what I get...
Saved data first looks like this:
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
But when I save just loaded data by the use of the same method, ie. numpy.savetxt() it looks like this:
1.398043286095131769e-76
1.398043286095288860e-76
1.396426376485745879e-76
1.398043286055061908e-76
1.398043286095288860e-76
1.182950697433698368e-76
1.398043275797188953e-76
1.398043286095288860e-76
1.210894289234927752e-99
1.398040649781712473e-76
What am I doing wrong? PS there are no other "backstage" operation which I perform. Just saving and loading, and that's what I get. Thank you in advance.
The most reliable way I have found to do this is to use np.savetxt with np.loadtxt and not np.fromfile which is better suited to binary files written with tofile. The np.fromfile and np.tofile methods write and read binary files whereas np.savetxt writes a text file.
So, for example:
a = np.array([1, 2, 3, 4])
np.savetxt('test1.txt', a, fmt='%d')
b = np.loadtxt('test1.txt', dtype=int)
a == b
# array([ True, True, True, True], dtype=bool)
Or:
a.tofile('test2.dat')
c = np.fromfile('test2.dat', dtype=int)
c == a
# array([ True, True, True, True], dtype=bool)
I use the former method even if it is slower and creates bigger files (sometimes): the binary format can be platform dependent (for example, the file format depends on the endianness of your system).
There is a platform independent format for NumPy arrays, which can be saved and read with np.save and np.load:
np.save('test3.npy', a) # .npy extension is added if not given
d = np.load('test3.npy')
a == d
# array([ True, True, True, True], dtype=bool)
np.save('data.npy', num_arr) # save
new_num_arr = np.load('data.npy') # load
The short answer is: you should use np.save and np.load.
The advantage of using these functions is that they are made by the developers of the Numpy library and they already work (plus are likely optimized nicely for processing speed).
For example:
import numpy as np
from pathlib import Path
path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)
lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2
np.save(path/'x', x)
np.save(path/'y', y)
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
print(x is x_loaded) # False
print(x == x_loaded) # [[ True True True True True]]
Expanded answer:
In the end it really depends in your needs because you can also save it in a human-readable format (see Dump a NumPy array into a csv file) or even with other libraries if your files are extremely large (see best way to preserve numpy arrays on disk for an expanded discussion).
However, (making an expansion since you use the word "properly" in your question) I still think using the numpy function out of the box (and most code!) most likely satisfy most user needs. The most important reason is that it already works. Trying to use something else for any other reason might take you on an unexpectedly LONG rabbit hole to figure out why it doesn't work and force it work.
Take for example trying to save it with pickle. I tried that just for fun and it took me at least 30 minutes to realize that pickle wouldn't save my stuff unless I opened & read the file in bytes mode with wb. It took time to google the problem, test potential solutions, understand the error message, etc... It's a small detail, but the fact that it already required me to open a file complicated things in unexpected ways. To add to that, it required me to re-read this (which btw is sort of confusing): Difference between modes a, a+, w, w+, and r+ in built-in open function?.
So if there is an interface that meets your needs, use it unless you have a (very) good reason (e.g. compatibility with matlab or for some reason your really want to read the file and printing in Python really doesn't meet your needs, which might be questionable). Furthermore, most likely if you need to optimize it, you'll find out later down the line (rather than spending ages debugging useless stuff like opening a simple Numpy file).
So use the interface/numpy provide. It might not be perfect, but it's most likely fine, especially for a library that's been around as long as Numpy.
I already spent the saving and loading data with numpy in a bunch of way so have fun with it. Hope this helps!
import numpy as np
import pickle
from pathlib import Path
path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)
lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2
# using save (to npy), savez (to npz)
np.save(path/'x', x)
np.save(path/'y', y)
np.savez(path/'db', x=x, y=y)
with open(path/'db.pkl', 'wb') as db_file:
pickle.dump(obj={'x':x, 'y':y}, file=db_file)
## using loading npy, npz files
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
db = np.load(path/'db.npz')
with open(path/'db.pkl', 'rb') as db_file:
db_pkl = pickle.load(db_file)
print(x is x_loaded)
print(x == x_loaded)
print(x == db['x'])
print(x == db_pkl['x'])
print('done')
Some comments on what I learned:
np.save as expected, this already compresses it well (see https://stackoverflow.com/a/55750128/1601580), works out of the box without any file opening. Clean. Easy. Efficient. Use it.
np.savez uses a uncompressed format (see docs) Save several arrays into a single file in uncompressed .npz format. If you decide to use this (you were warned about going away from the standard solution so expect bugs!) you might discover that you need to use argument names to save it, unless you want to use the default names. So don't use this if the first already works (or any works use that!)
Pickle also allows for arbitrary code execution. Some people might not want to use this for security reasons.
Human-readable files are expensive to make etc. Probably not worth it.
There is something called hdf5 for large files. Cool! https://stackoverflow.com/a/9619713/1601580
Note that this is not an exhaustive answer. But for other resources check this:
For pickle (guess the top answer is don't use pickle, use np.save): Save Numpy Array using Pickle
For large files (great answer! compares storage size, loading save and more!): https://stackoverflow.com/a/41425878/1601580
For matlab (we have to accept matlab has some freakin' nice plots!): "Converting" Numpy arrays to Matlab and vice versa
For saving in human-readable format: Dump a NumPy array into a csv file
np.fromfile() has a sep= keyword argument:
Separator between items if file is a text file. Empty (“”) separator means the file should be treated as binary. Spaces (” ”) in the separator match zero or more whitespace characters. A separator consisting only of spaces must match at least one whitespace.
The default value of sep="" means that np.fromfile() tries to read it as a binary file rather than a space-separated text file, so you get nonsense values back. If you use np.fromfile('markers.txt', sep=" ") you will get the result you are looking for.
However, as others have pointed out, np.loadtxt() is the preferred way to convert text files to numpy arrays, and unless the file needs to be human-readable it is usually better to use binary formats instead (e.g. np.load()/np.save()).

Resources