Python 3.7, Feedparser module cannot parse BBC weather feed - python-3.x

When I parse the example rss link provided by BBC weather it gives only an empty feed, the example link is: "https://weather-broker-cdn.api.bbci.co.uk/en/forecast/rss/3day/2643123"
Ive tried using the feedparser module in python, I would like to do this in either python or c++ but python seemed easier. Ive also tried rewriting the URL without https:// and with .xml and it still doesn't work.
import feedparser
d = feedparser.parse('https://weather-broker-cdn.api.bbci.co.uk/en/forecast/rss/3day/2643123')
print(d)
Should give a result similar to the RSS feed which is on the link, but it just gets an empty feed

First, I know you got no result - not an error like me. Perhaps you are running a different version. As I mentioned, it yields a result on an older version in Python 2, using a program that has been running solidly every night for about 5 years, but it throws an exception on a freshly installed feedparser 5.2.1 on Python 3.7.4 64 bit.
I'm not entirely sure what is going on, but the function called _gen_georss_coords which is throwing a StopIteration on the first call. I have noted some references to this error due to the implementation of PEP479. It is written as a generator, but for your rss feed it only has to return 1 tuple. Here is the offending function.
def _gen_georss_coords(value, swap=True, dims=2):
# A generator of (lon, lat) pairs from a string of encoded GeoRSS
# coordinates. Converts to floats and swaps order.
latlons = map(float, value.strip().replace(',', ' ').split())
nxt = latlons.__next__
while True:
t = [nxt(), nxt()][::swap and -1 or 1]
if dims == 3:
t.append(nxt())
yield tuple(t)
There is something curious going on, perhaps to do with PEP479 and the fact that there are two separate generators happening in the same function, that is causing StopIteration to bubble up to the calling function. Anyway, I rewrote it is a somewhat more straightforward way.
def _gen_georss_coords(value, swap=True, dims=2):
# A generator of (lon, lat) pairs from a string of encoded GeoRSS
# coordinates. Converts to floats and swaps order.
latlons = list(map(float, value.strip().replace(',', ' ').split()))
for i in range(0, len(latlons), 3):
t = [latlons[i], latlons[i+1]][::swap and -1 or 1]
if dims == 3:
t.append(latlons[i+2])
yield tuple(t)
You can define the above new function in your code, then execute the following to patch it into feedparser
saveit, feedparser._gen_georss_coords = (feedparser._gen_georss_coords, _gen_georss_coords)
Once you're done with it, you can restore feedparser to its previous state with
feedparser._gen_georss_coords, _gen_georss_coords = (saveit, feedparser._gen_georss_coords)
Or if you're confident that this is solid, you can modify feedparser itself. Anyway I did this trick and your rss feed suddenly started working. Perhaps in your case it will also result in some improvement.

Related

Loading .npz with Python 3.5 always crashes

In this simple tutorial written in Python 2.7, they have a line loading the numpy array.
train_data = np.load(open('../musicnet.npz','rb'))
Then, they get the data by calling different keys
X,Y = train_data['2494']
Everything works well in python 2.7
Data type of train_data is numpy.lib.npyio.NpzFile
My problem
However, whenever I try to do the same in Python 3.5, most of the lines work fine, except when it comes to the line of X,Y = train_data['2494'], it just freezes there forever. I would like to use Python 3.5 because my other projects are written in python 3.5.
How to rewrite this line so that it runs with Python 3.5?
Error Message
I finally managed to get the error message in terminal
It freezes there because there's tons of output right after the error message, my jupyter notebook just cannot handle that much information.
Solution
Change the encoding to 'bytes'
train_data = np.load('../musicnet.npz', encoding='bytes')
Then everything works fine.
You first said things crashed, now you say it freezes when trying to access a specific array. numpy has the same syntax in 3.5 compared to 2.7. You shouldn't have to rewrite anything.
np.load does have a couple of parameters that deal with differences between Py2 and Py3. But I'm not sure these are an issue for you.
fix_imports : bool, optional
Only useful when loading Python 2 generated pickled files on Python 3,
which includes npy/npz files containing object arrays. If `fix_imports`
is True, pickle will try to map the old Python 2 names to the new names
used in Python 3.
encoding : str, optional
What encoding to use when reading Python 2 strings. Only useful when
loading Python 2 generated pickled files in Python 3, which includes
npy/npz files containing object arrays. Values other than 'latin1',
'ASCII', and 'bytes' are not allowed, as they can corrupt numerical
data. Default: 'ASCII'
Try
print(list(train_data.keys()))
This should show the array names that were saved to the zip archive. Do they match the names in the Py2 load? Do they include the '2494' name?
A couple of things are unusual about:
X,Y = train_data['2494']
Naming an array in the zip archive by a string number, and unpacking the load into two variables.
Do you know anything about how this was savez? What was saved?
Another question - are you loading this file from the same machine that Py2 worked on? Or has the file been transferred from another machine, and possibly corrupted?
As those parameters indicate, there are differences in the pickle code between Py2 and Py3. If the original save included object dtype arrays, or non-array objects, then they will be pickled and there might be incompatibilities in the pickle versions.
Try this,
with np.load('../musicnet.npz') as train_data:
X,Y = train_data['2494']
There are 2 ways out in my point of view:
re-edit your code
train_data = np.load(open('../musicnet.npz','rb'))
to
train_data = np.load(open('../musicnet.npz','r'))
Because the mode of r/rb in python2.7 / 3.5 is a difference in your situation.
Using the default debugger to pointing the significant error. (Usually, work on my experience)

How to remove special usual characters from a pandas dataframe using Python

I have a file some crazy stuff in it. It looks like this:
I attempted to get rid of it using this:
df['firstname'] = map(lambda x: x.decode('utf-8','ignore'), df['firstname'])
But I wound up with this in my dataframe: <map object at 0x0000022141F637F0>
I got that example from another question and this seems to be the Python3 method for doing this but I'm not sure what I'm doing wrong.
Edit: For some odd reason someone thinks that this has something to do with getting a map to return a list. The central issue is getting rid of non UTF-8 characters. Whether or not I'm even doing that correctly has yet to be established.
As I understand it, I have to apply an operation to every character in a column of the dataframe. Is there another technique or is map the correct way and if it is, why am I getting the output I've indicated?
Edit2: For some reason, my machine wouldn't let me create an example. I can now. This is what i'm dealing with. All those weird characters need to go.
import pandas as pd
data = [['🦎Ale','Αλέξανδρα'],['��Grain','Girl🌾'],['Đỗ Vũ','ên Anh'],['Don','Johnson']]
df = pd.DataFrame(data,columns=['firstname','lastname'])
print(df)
Edit 3: I tired doing this using a reg ex and for some reason, it still didn't work.
df['firstname'] = df['firstname'].replace('[^a-zA-z\s]',' ')
This regex works FINE in another process, but here, it still leaves the ugly characters.
Edit 4: It turns out that it's image data that we're looking at.

Python Multiprocessing How can I make script faster?

Python 3.6
I am writing a script to automate me checking to make sure all the links on a website for work.
I have a version of it but it runs slow because the python interpreter is only running one request at a time.
I imported selenium to pull the links down in a list. I started with a list of 41000 links. I got rid of the duplicates now I am down to 7300 links in my list. I am using the request module to just check for the response code. I know multiprocessing is the answer just see a bunch of different methods. Which is the best for my needs?
The only thing I need to keep in mind I can't run to many threads at once so I don't send our webserver threads on our server sky high with request.
Here is the function that checks the links with the python requests module that I am trying to speed up:
def check_links(y):
for ii in y:
try:
r = requests.get(ii.get_attribute('href'))
rc = r.status_code
print(ii.get_attribute('href'), ' ', rc)
except Exception as e:
logf.write(str(e))
finally:
pass
If you just need to apply the same function to all the items in a list, you just need to use a process pool, and map over you inputs. Here is a simple example:
from multiprocessing import pool
def square(x):
return {x: x**2}
p = pool.Pool()
results = p.imap_unordered(square, range(10))
for r in results:
print(r)
In the example I use imap_unordered, but also look at map and imap. You should choose the one that matches your needs the best.

Pyx unicode text

So I am trying to generate postscript from Python.
Currently trying with PyX 0.14.1 on Python3.4.2,
but I am open to suggestions, if you know something simpler.
I was following mostly the suggestions found on the PyX
mailing list in this thread. This was Python2 and is quite old.
The following shows my current code after many changes:
from pyx import *
text.set(cls=text.LatexRunner, texenc='utf-8')
text.preamble(r'\usepackage{ucs}')
text.preamble(r'\usepackage[utf8x]{inputenc}')
c = canvas.canvas()
c.text(5, 5, "Sören Sundstrøm".encode("utf8"))
p = document.page(c, paperformat=document.paperformat.A4,
centered=0)
d = document.document([p])
d.writePSfile('test.ps')
PyX stops with a TexResultError. The interesting part of the error
shows what's happening in TeX:
pyx.text.TexResultError: unhandled TeX response (might be an error)
The expression passed to TeX was:
\ProcessPyXBox{b'S\xc3\xb6ren Sundstr\xc3\xb8m'%
}{1}%
\PyXInput{7}%
After parsing the return message from TeX, the following was left:
*
*! Undefined control sequence.
<argument> b'S\xc
3\xb 6ren Sundstr\xc 3\xb 8m'
<*> }{1}
(cut after 5 lines; use errordetail.full for all output)
So it looks like latex is receiving not utf-8,
but an escaped representation of utf-8.
My question: How do I pass the string to canvas.text correctly?
Or is my preamble wrong?
I also tried to follow this answer by wobsta here on SO,
but besides being much too complicated, it does not work for me either.
(Looks like PyX does not understand a metafont message in this case).
Running latex directly on a simple utf-8 input file with the same preamble
works fine by the way.
Looking into the PyX code revealed the problem.
The text module prepares an io.TextIOWrapper with utf-8 encoding to be used for TeX input. The string parameters in text.preamble and canvas.text are passed verbatim to the wrapper, so in Python 3 you just pass a string without any encoding necessary. Encoding will be done by the wrapper.
My original unsimplified code had another problem which made it difficult to solve this first problem. So for completeness here's the second problem and its solution. My original code had this order of operations:
from pyx import *
c = canvas.canvas()
# doing other stuff with canvas
text.set(cls=text.LatexRunner, texenc='utf-8')
text.preamble(r'\usepackage{ucs}')
text.preamble(r'\usepackage[utf8x]{inputenc}')
c.text(5, 5, "Sören Sundstrøm")
p = document.page(c, paperformat=document.paperformat.A4,
centered=0)
d = document.document([p])
d.writePSfile('test.ps')
This does not work either, because when a canvas is created it keeps a reference to a text.defaulttexrunner which is set up with the current settings of the text module. The changed text module settings never influence the canvas instance. So you have to set-up the text module before you create the canvas where you want to draw text into.
Thanks to anyone who looked into this.

Unpickling from converted string in python/numpy

I have a ton of numpy ndarrays that are stored picked to strings. That may have been a poor design choice but it's what I did, and now the picked strings seem to have been converted or something along the way, when I try to unpickle I notice they are of type str and I get the following error:
TypeError: 'str' does not support the buffer interface
when I invoke
numpy.loads(bin_str)
Where bin_str is the thing I'm trying to unpickle. If I print out bin_strit looks like
b'\x80\x02cnumpy.core.multiarray\n_reconstruct\nq\x00cnumpy\nndarray\nq\x01K\x00\x85q\x02c_codecs\nencode\nq\x03X\x01\x00\x00\ ...
continuing for some time, so the info seems to be there, I'm just not quite sure how to convert it into whatever string format numpy/pickle need. On a whim I tried
numpy.loads( bytearray(bin_str, encoding='utf-8') )
and
numpy.loads( bin_str.encode() )
which both throw an error _pickle.UnpicklingError: unpickling stack underflow. Any ideas?
PS: I'm on python 3.3.2 and numpy 1.7.1
Edit
I discovered that if I do the following:
open('temp.txt', 'wb').write(...)
return numpy.load( 'temp.txt' )
I get back my array, and ... denotes copying and pasting the output of print(bin_str) from another window. I've tried writing bin_str to a file directly to unpickle but that doesn't work, it complains that TypeError: 'str' does not support the buffer interface. A few sane ways of converting bin_str to something that can be written directly to a binary file result in pickle errors when trying to read it back.
Edit 2
So I guess what's happened is that my binary pickle string ended up encoded inside of a normal string, something like:
"b'pickle'"
which is unfortunate and I haven't figured out how to deal with that, except this ridiculous and convoluted way to get it back:
open('temp.py', 'w').write('foo = ' + bin_str)
from temp import foo
numpy.loads( foo )
This seems like a very shameful solution to the problem, so please give me a better one!
It sounds like your saved strings are the reprs of the original bytes instances returned by your pickling code. That's a bit unfortunate, but not too bad. repr is intended to return a "machine friendly" representation of an object, and it can often be reversed by using eval:
import numpy as np
import pickle
# this part has already happened
orig_obj = np.array([1,2,3])
orig_pickle = pickle.dumps(orig_obj)
saved_str = repr(orig_pickle) # this was a mistake, but it's already done
# this is what you need to do to get something equivalent to orig_obj back
reconstructed_pickle = eval(saved_str)
reconstructed_obj = pickle.loads(reconstructed_pickle)
# test
if np.all(reconstructed_obj == orig_obj):
print("It worked!")
Obligatory note that using eval can be dangerous: Be aware that eval can run any Python code it wants, so don't call it with untrusted data. However, pickle data has the same risks (a malicious Pickle string can run arbitrary code upon unpickling), so you're not losing much safety in this situation. I'm guessing that you trust your data in this case anyway.

Resources