PyYaml parses '9:00' as int - python-3.x

I have a file with the following data:
classes:
- 9:00
- 10:20
- 12:10
(and so on up to 21:00)
I use python3 and yaml module to parse it. Precisely, the source is config = yaml.load (open (filename, 'r')). But then, when I print config, I get the following output for this part of data:
'classes': [540, 630, 730, 820, 910, 1000, 1090, 1180],
The values in the list are ints.
While previously, when I used python2 (and BaseLoader for YAML), I got the values as strings, and I use them as such. BaseLoader is now not acceptable since I want to read unicode strings from file, and it gives me byte-strings.
So, first, why pyyaml does parse my data as ints?
And, second, how do I prevent pyyaml from doing this? Is it possible to do that without changing data file (e.g. without adding !!str)?

The documentation of YAML is a bit difficult to "parse" so I can imagine you missed this little bit of info about colons:
Normally, YAML insists the “:” mapping value indicator be separated from the value by white space. A benefit of this restriction is that the “:” character can be used inside plain scalars, as long as it is not followed by white space. This allows for unquoted URLs and timestamps. It is also a potential source for confusion as “a:1” is a plain scalar and not a key: value pair.
And what you have there in your input is a sexagesimal and your 9:00 is considered to be similar to 9 minutes and 0 seconds, equalling a total of 540 seconds.
Unfortunately this doesn't get constructed as some special Sexagesimal instance that can be used for calculations as if it were an integer but can be printed in its original form. Therefore, if you want to use this as a string internally you have to single quote them:
classes:
- '9:00'
- '10:20'
- '12:10'
which is what you would get if you dump {'classes': ['9:00', '10:20', '12:10']} (and note that the unambiguous classes doesn't get any quotes).
That the BaseLoader gives you strings is not surprising. The BaseConstructor that is used by the BaseLoader handles any scalar as string, including integers, booleans and "your" sexagesimals:
import ruamel.yaml as yaml
yaml_str = """\
classes:
- 12345
- 10:20
- abc
- True
"""
data = yaml.load(yaml_str, Loader=yaml.BaseLoader)
print(data)
data = yaml.load(yaml_str, Loader=yaml.SafeLoader)
gives:
{u'classes': [u'12345', u'10:20', u'abc', u'True']}
{'classes': [12345, 620, 'abc', True]}
If you really don't want to use quotes, then you have to "reset" the implicit resolver for scalars that start with numbers:
import ruamel.yaml as yaml
from ruamel.yaml.resolver import Resolver
import re
yaml_str = """\
classes:
- 9:00
- 10:20
- 12:10
"""
for ch in list(u'-+0123456789'):
del Resolver.yaml_implicit_resolvers[ch]
Resolver.add_implicit_resolver(
u'tag:yaml.org,2002:int',
re.compile(u'''^(?:[-+]?0b[0-1_]+
|[-+]?0o?[0-7_]+
|[-+]?(?:0|[1-9][0-9_]*)
|[-+]?0x[0-9a-fA-F_]+)$''', re.X), # <- copy from resolver.py without sexagesimal support
list(u'-+0123456789'))
data = yaml.load(yaml_str, Loader=yaml.SafeLoader)
print(data)
gives you:
{'classes': ['9:00', '10:20', '12:10']}

You should probably check the documentation of YAML
The colon are for mapping values.
I presume you want a string and not an integer, so you should double quote your strings.

Related

Python translate a column with multiple languages to english

I have a dataset where there are multiple comments columns having multiple languages and I want to translate these columns into English and create new columns with all the english translations.
Accountability_COMMENT is the column which has multiple comments in different language in every row. I want to create a new column and translate all such comments to English.
I have tried the following code :
from googletrans import Translator
from textblob import TextBlob
translator = Translator()
data_merge['Accountability_COMMENT'] = data_merge['Accountability_COMMENT'].apply(lambda x:
TextBlob(x).translate(to='en'))
The error that I am getting is :
TypeError: The text argument passed to __init__(text) must be a string, not class 'float'
My column has objet format which is correct
You most probably have some comments that only consists of a float (i.e. a decimal number), that even if they are type: object according to pandas they are still interpreted as float by TextBlob. This leads to the error:
TypeError: The text argument passed to __init__(text) must be a string, not <class 'float'>
One solution is to make sure that the input x of TextBlob(x) is a string. You could do this by modifying the apply row like:
data_merge['Accountability_COMMENT'] = data_merge['Accountability_COMMENT'].apply(lambda x: TextBlob(str(x)).translate(to='en'))
Unfortunately this will probably also rais an error like:
raise NotTranslated('Translation API returned the input string unchanged.')
textblob.exceptions.NotTranslated: Translation API returned the input string unchanged.
This is due to the fact that when translating a number, the translation and the original text will be exactly the same, and apparently TextBlob doesn't like that.
What you can do to avoid this is to catch that exception NotTranslated and just return the untranslated TextBlob, like this:
from textblob import TextBlob
from textblob.exceptions import NotTranslated
def translate_comment(x):
try:
# Try to translate the string version of the comment
return TextBlob(str(x)).translate(to='en')
except NotTranslated:
# If the output is the same as the input just return the TextBlob version of the input
return TextBlob(str(x))
data_merge['Accountability_COMMENT'] = data_merge['Accountability_COMMENT'].apply(translate_comment)
EDIT:
If you get the HTTP error Too Many Requests it's probably because you are being kicked out by the Google Translate API. Instead of using apply, you can make your translation "extra-slow" by using a for loop with some sleep in-between cycles. In this case you should import another package (time) and substitute the last line:
from time import sleep
from textblob import TextBlob
from textblob.exceptions import NotTranslated
def translate_comment(x):
try:
# Try to translate the string version of the comment
return TextBlob(str(x)).translate(to='en')
except NotTranslated:
# If the output is the same as the input just return the TextBlob version of the input
return TextBlob(str(x))
for i in range(len(data_merge['Accountability_COMMENT'])):
# Translate one comment at a time
data_merge['Accountability_COMMENT'].iloc[i] = translate_comment(data_merge['Accountability_COMMENT'].iloc[i])
# Sleep for a quarter of second
sleep(0.25)
You can then experiment with different values for the sleep function. Of course the longer the sleep the slower the translation! N.B. sleep argument is in seconds.

How to safe_dump the dictionary and list into YAML?

I want the output as the YAML below:
- item: Food_eat
Food:
itemId: 42536216
category: fruit
moreInfo:
- "organic"
I have used the following code to print in the same order as above but output is coming not as expected.
Code:
import yaml
yaml_result = [{'item': 'Food_eat', 'Food': {'foodNo': 42536216,'type': 'fruit','moreInfo': ['organic']}}]
print(yaml.safe_dump(yaml_result))
print(yaml_test)
Output:
- Food:
moreInfo:
- organic
category: fruit
itemId: 42536216
item: Food_eat
Not sure how to get the desired output.
ruamel.yaml (disclaimer: I am the author of that package) does have
this feature built-in, as it is necessary to support its capability to
round-trip (load, modify, dump) YAML data without introducing spurious
changes. Apart from that it defaults to YAML 1.2, whereas PyYAML only
supports YAML 1.1 (outdated more than 10 years ago).
import sys
import ruamel.yaml
data = [{'item': 'Food_eat', 'Food': {'foodNo': 42536216,'type': 'fruit','moreInfo': ['organic']}}]
yaml = ruamel.yaml.YAML()
yaml.indent(sequence=4, offset=2)
yaml.dump(data, sys.stdout)
which gives:
- item: Food_eat
Food:
foodNo: 42536216
type: fruit
moreInfo:
- organic
This relies on a modern Python's ability to keep the insertion ordering of a dict. For
older versions, like Python 2.7, you'll have to explicitly make an
object CommentedMap (as imported from ruamel.yaml.comments and
either give it a list of tuples (in the right order), or assign the
key value pairs in the order you want them to be dumped.
As you can see within the indentation of the sequence the dash has an offset, this is something you
cannot achieve using PyYAML without rewriting its emitter.
Within PyYAML you don't want to do print(yaml.safe_dump(data)) as
that is inefficient both wrt. memory and time, always use yaml.safe_dump(data, sys.stdout) instead.

Python GDAL, SetAttributeFilter not working

I am trying to use GDAL's SetAttributeFilter() to filter the features in a layer of my shapefile, but the filter seems to have no effect.
My current data is a shapefile from the US Census Bureau, but I have tried with other shapefiles and get a similar result.
For example
from osgeo import ogr
shapefile_path = '../input/processed/shapefile/'
shapefile_ds = ogr.Open(shapefile_path)
cbsa = shapefile_ds.GetLayer('cb_2016_us_cbsa_500k')
print(cbsa.GetFeatureCount())
cbsa.SetAttributeFilter('NAME = "Chicago-Naperville-Elgin, IL-IN-WI"')
feat = cbsa.GetNextFeature()
print(feat.GetField('NAME'))
print(cbsa.GetFeatureCount())
Yields
945
Platteville, WI
945
I'm using Python 3.6 and GDAL 2.2.1
You can capture the return value of the SetAttributeFilter statement and make sure its 0, otherwise something went wrong.
In this particular case, its probably due to the quoting. Single quotes refer to string literals (a value), and double quotes refer to a column/table name.
Depending on how you run this Python code, somewhere in the stdout/stderr GDAL prints something like:
ERROR 1: "Chicago-Naperville-Elgin, IL-IN-WI" not recognised as an available field.
More details can be found at:
https://trac.osgeo.org/gdal/wiki/rfc52_strict_sql_quoting
To get it working, simply swap the single/double quoting, so:
cbsa.SetAttributeFilter("NAME='Chicago-Naperville-Elgin, IL-IN-WI'")
While this was a while ago, when I learn something I like to say what worked in my case in case I search this again.
For me, I had to have the syntax like:
cbsa.SetAttributeFilter('"NAME" = \'Chicago-Naperville-Elgin\'') # I didn't test multiple values
where the referenced page of the accepted answer says:
<delimited identifier> ::= <double quote> <delimited identifier body> <double quote>
<character string literal> ::= <quote> [ <character representation> ... ] <quote>
It may be that there has been an update to ogr changing this since '17.

PyParsing: Is it possible to globally suppress all Literals?

I have a simple data set to parse with lines like the following:
R1 (a/30) to R2 (b/30), metric 30
The only data that I need from the above is as follows:
R1, a, 30, R2, 192.168.0.2, 30, 30
I can parse all of this easily with pyparsing, but I either end up with a bunch of literals in my output, or I have to specifically say Literal(thing).suppress() in my parsing grammar, which gets tiresome.
Ideally, I'd like to write a grammar for the above like:
Word(alphanums) + '(' + Word(alphanums) + '/' + Word(nums) + ... etc.
and have the literal tokens get ignored. Can I say anything like .suppressAllLiterals()?
Notes:
new to PyParsing
I've read the docs and 5 or 6 examples
searched google
Thanks!
You can use this method on ParserElement - call it immediately after importing pyparsing:
from pyparsing import ...whatever...
ParserElement.inlineLiteralsUsing(Suppress)
Now all the string literals in your parser will be wrapped in Suppress objects, and left out of the results, rather than the default Literal.
(I will probably make this the default in v3.0, someday, when I can break backward compatibility.)

how use struct.pack for list of strings

I want to write a list of strings to a binary file. Suppose I have a list of strings mylist? Assume the items of the list has a '\t' at the end, except the last one has a '\n' at the end (to help me, recover the data back). Example: ['test\t', 'test1\t', 'test2\t', 'testl\n']
For a numpy ndarray, I found the following script that worked (got it from here numpy to r converter):
binfile = open('myfile.bin','wb')
for i in range(mynpdata.shape[1]):
binfile.write(struct.pack('%id' % mynpdata.shape[0], *mynpdata[:,i]))
binfile.close()
Does binfile.write automatically parses all the data if variable has * in front it (such in the *mynpdata[:,i] example above)? Would this work with a list of integers in the same way (e.g. *myIntList)?
How can I do the same with a list of string?
I tried it on a single string using (which I found somewhere on the net):
oneString = 'test'
oneStringByte = bytes(oneString,'utf-8')
struct.pack('I%ds' % (len(oneString),), len(oneString), oneString)
but I couldn't understand why is the % within 'I%ds' above replaced by (len(oneString),) instead of len(oneString) like the ndarray example AND also why is both len(oneString) and oneString passed?
Can someone help me with writing a list of string (if necessary, assuming it is written to the same binary file where I wrote out the ndarray) ?
There's no need for struct. Simply join the strings and encode them using either a specified or an assumed text encoding in order to turn them into bytes.
''.join(L).encode('utf-8')

Resources