OSM XML to GEOJSON in Python - python-3.x

I am trying to learn Python and Open Street Map API. I want to download with python a small region and then convert it into Geojson. I tried with two python libraries(osm2geojson and osmtogeojson) to convert OSM xml to Geojson so far but I am getting errors in both of them. My code is the following using osm2geojson:
from OSMPythonTools.api import Api
api = Api()
bbox = api.query('map?bbox=-0.08918,51.47980,-0.08496,51.48128')
import osm2geojson
geojson = osm2geojson.xml2geojson(bbox.toXML())
My exception with osm2geojson is the following:
line 264, in multiline_realation_to_shape
refs_index[member['ref']]['used'] = rel['id']
KeyError: 8835435
and with osmtogeojson the exception is:
line 21, in _preprocess
for elem in j["elements"]:
TypeError: string indices must be integers
What am I doing wrong?

If you print the type of bbox:
print(type(bbox))
<class 'OSMPythonTools.api.ApiResult'>
You can try it:
print(bbox.tags())
{'alt_name': 'London Cycle Network route 23',
'colour': '#0198E1',
'cycle_network': 'GB:London Cycle Network',
'name': 'LCN 23',
'network': 'lcn',
'ref': '23',
'ref:colour': '#0198E1',
'ref:colour_bg': 'white',
'ref:colour_tx': 'white',
'route': 'bicycle',
'type': 'route'}
print(type(bbox.tags()))
<class 'dict'>
https://github.com/mocnik-science/osm-python-tools/blob/master/docs/element.md

One way to do this (although not through python) is to use the ogr library.
Download and install the library, then:
ogr2ogr -f "GeoJSON" ".outputFile.geojson" "inputFile.osm" lines
Please note that due to the different way each format handles layers, you may have to change the last argument from lines to whichever layer you are interested in.
Also see this link.

Related

What is *.subwords file in natural language processing to use as vocabulary file?

I have been trying to create a vocab file in a nlp task to use in tokenize method of trax to tokenize the word but i can't find which module/library to use to create the *.subwords file. Please help me out?
The easiest way to use the trax.data.Tokenize with your own data and a subword vocabulary it's using Google's Sentencepiece python module
import sentencepiece as spm
spm.SentencePieceTrainer.train('--input=data/my_data.csv --model_type=bpe --model_prefix=my_model --vocab_size=32000')
This creates two files:
my_model.model
my_model.vocab
We'll use this model in trax.data.Tokenize and we'll add the parameter vocab_type with the value "sentencepiece"
trax.data.Tokenize(vocab_dir='vocab/', vocab_file='my_model.model', vocab_type='sentencepiece')
I think it's the best way since you can load the model and use it to get control ids while avoiding hardcode
sp = spm.SentencePieceProcessor()
sp.load('my_model.model')
print('bos=sp.bos_id()=', sp.bos_id())
print('eos=sp.eos_id()=', sp.eos_id())
print('unk=sp.unk_id()=', sp.unk_id())
print('pad=sp.pad_id()=', sp.pad_id())
sentence = "hello world"
# encode: text => id
print("Pieces: ", sp.encode_as_pieces(sentence))
print("Ids: ", sp.encode_as_ids(sentence))
# decode: id => text
print("Decode Pieces: ", sp.decode_pieces(sp.encode_as_pieces(sentence)))
print("Decode ids: ", sp.decode_ids(sp.encode_as_ids(sentence)))
print([sp.bos_id()] + sp.encode_as_ids(sentence) + [sp.eos_id()])
If still you want to have a subword file, try this:
python trax/data/text_encoder_build_subword.py \
--corpus_filepattern=data/data.txt --corpus_max_lines=40000 \
--output_filename=data/my_file.subword
I hope this can help since there is no clear literature to see how to create compatible subword files out there
You can use tensorflow API SubwordTextEncoder
Use following code snippet -
encoder = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
(text_row for text_row in text_dataset), target_vocab_size=2**15)
encoder.save_to_file(vocab_fname)
Tensorflow will append .subwords extension to above vocab file.

AttributeError: 'str' object has no attribute 'parameters' due to new version of sklearn

I am doing topic modeling using sklearn. While trying to get the log-likelihood from Grid Search output, I am getting the below error:
AttributeError: 'str' object has no attribute 'parameters'
I think I understand the issue which is: 'parameters' is used in the older version and I am using the new version (0.22) of sklearn and that is giving error. I also search for the term which is used in the new version but couldn't find it. Below is the code:
# Get Log Likelyhoods from Grid Search Output
n_components = [10, 15, 20, 25, 30]
log_likelyhoods_5 = [round(gscore.mean_validation_score) for gscore in model.cv_results_ if gscore.parameters['learning_decay']==0.5]
log_likelyhoods_7 = [round(gscore.mean_validation_score) for gscore in model.cv_results_ if gscore.parameters['learning_decay']==0.7]
log_likelyhoods_9 = [round(gscore.mean_validation_score) for gscore in model.cv_results_ if gscore.parameters['learning_decay']==0.9]
# Show graph
plt.figure(figsize=(12, 8))
plt.plot(n_components, log_likelyhoods_5, label='0.5')
plt.plot(n_components, log_likelyhoods_7, label='0.7')
plt.plot(n_components, log_likelyhoods_9, label='0.9')
plt.title("Choosing Optimal LDA Model")
plt.xlabel("Num Topics")
plt.ylabel("Log Likelyhood Scores")
plt.legend(title='Learning decay', loc='best')
plt.show()
Thanks in advance!
There is key 'params' which is used to store a list of parameter settings dicts for all the parameter candidates. You can see the GridSearchCv doc here from sklearn documentation.
In your code, gscore is a string key value of cv_results_.
Output of cv_results_ is a dictionary of string key like 'params','split0_test_score' etc(you can refer the doc) and their value as list or array etc.
So, you need to make following change to your code :
log_likelyhoods_5 = [round(model.cv_results_['mean_test_score'][index]) for index, gscore in enumerate(model.cv_results_['params']) if gscore['learning_decay']==0.5]

Looking for documentation on ‘path’ argument in TabularDataBunch.from_df()

I'm not sure I understand the purpose of path (mandatory) argument in TabularDataBunch.from_df(path=path,df=df,...) of fast.ai library in Python 3.6.
I checked documentation, but can't seem to find the details there.
In particular, I have a pd.DataFrame that does not have an associated CSV file on a disk. How do I go about applying .from_df method to it?
Does anyone have more info or links to references?
Found an example here that helped with path value as 'output'. Also, the fast.ai lecture 4 video (43rd minute) defines path as the output location for the output results.
df = pd.DataFrame({'A': list('aabbccabca'), 'B': np.random.normal(size=10).round(2), 'Y': list('aabbccabca')})
tfms = [Categorify]
tblrData = TabularDataBunch.from_df('output', df, dep_var='Y', valid_idx=[7,8], procs=tfms, cat_names=['A'], bs=4)
(cat_x,cont_x),y = next(iter(tblrData.train_dl))
for o in (cat_x, cont_x, y): print(to_np(o[:5]))
bs is a batch size parameter here.

Specifying the order of encoding in Ordinal Encoder

I'm using OrdinalEncoder, and I cannot find how to specify the encoding order. I mean that I have categories like "bad", "average", "good" which naturally have an order. But I want to specify that order, since the encoder cannot know itself the meaning of categories. Indeed, with categories='auto', some categories are encoded in wrong direction with respect to some others and I do not want this because I know, at least for some of them, if the correlation is positive or negative.
But specifying the categories results in an error during fitting:
'OrdinalEncoder' object has no attribute 'handle_unknown'.
If I do not specify the categories, fitting process goes well, and I do not understand why (the attribute "categories_", after fitting, shows me the same categories I enter by hand when I try to specify them).
I specify the categories as a list of lists. Here what happens without specifying categories.
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame(np.array([['a','a','a'], ['b','c','c']]).transpose())
oE = OrdinalEncoder(categories='auto')
oE.fit(df)
print(oE.categories_)
Resulting in: [array(['a'], dtype=object), array(['b', 'c'], dtype=object)]
Specifying categories explicitely:
df = pd.DataFrame(np.array([['a','a','a'], ['b','c','c']]).transpose())
oE = OrdinalEncoder(categories=[['a'], ['b', 'c']])
oE.fit(df)
The result is this error:
Traceback (most recent call last):
File "", line 3, in
oE.fit(df)
File
"/home/alessio/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py",
line 774, in fit
self._fit(X)
File
"/home/alessio/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py",
line 85, in _fit
if self.handle_unknown == 'error':
AttributeError: 'OrdinalEncoder' object has no attribute
'handle_unknown'
I had the same problem. This is bug in scikit-learn, already fixed and added to version 0.20.1, which is still not released.
https://github.com/scikit-learn/scikit-learn/issues/12365
I solved it temporarily by copying fixed _encoders.py to my project and using.
from _encoders import OrdinalEncoder

Return contents of geometry() as a list in PySide

I'm currently writing an application in PySide, and I want it to save the window dimensions upon exiting. The geometry() method retuns something like PySide.QtCore.QRect(300, 300, 550, 150) but all I want is (300, 300, 550, 150). I could find a way to parse it, but I want a cleaner method. Any suggestions?
The getRect method returns a tuple of the values:
>>> widget.geometry().getRect()
(0, 0, 640, 480)
The cleaner way, without any parsing, would be to use QSettings to store and retrieve the QRect returned by geometry to/from the native application settings storage (Windows registry, .ini file, .plist file...).
For example:
settings = QSettings(...);
settings.setValue("lastGeometry", self.geometry())
# and to retrieve the value
lastGeometry = settings.value("lastGeometry")
if lastGeometry.isValid():
self.setGeometry(lastGeometry)
You can also binary serialize or deserialize a QRect with QDataStream to a 16 byte array representing the 4 32-bit integers.
Considering OP has accepted the one from #alexisdm this might be interesting:
I was looking into using restoreGeometry() as it handles recovering outside of screen windows and ones that are out of the top border. BUT: it needs a QByteArray and I can only save plain Python data in my case. So I tried to turn the byte array into a string:
encoded = str(self.saveGeometry().toPercentEncoding())
print('encoded: %s' % encoded)
>>> encoded: %01%D9%D0%CB%00%01%00%00%FF%F...
geometry = QtCore.QByteArray().fromPercentEncoding(encoded)
self.restoreGeometry(geometry)
Voilà!

Resources