Python GDAL, SetAttributeFilter not working - python-3.x

I am trying to use GDAL's SetAttributeFilter() to filter the features in a layer of my shapefile, but the filter seems to have no effect.
My current data is a shapefile from the US Census Bureau, but I have tried with other shapefiles and get a similar result.
For example
from osgeo import ogr
shapefile_path = '../input/processed/shapefile/'
shapefile_ds = ogr.Open(shapefile_path)
cbsa = shapefile_ds.GetLayer('cb_2016_us_cbsa_500k')
print(cbsa.GetFeatureCount())
cbsa.SetAttributeFilter('NAME = "Chicago-Naperville-Elgin, IL-IN-WI"')
feat = cbsa.GetNextFeature()
print(feat.GetField('NAME'))
print(cbsa.GetFeatureCount())
Yields
945
Platteville, WI
945
I'm using Python 3.6 and GDAL 2.2.1

You can capture the return value of the SetAttributeFilter statement and make sure its 0, otherwise something went wrong.
In this particular case, its probably due to the quoting. Single quotes refer to string literals (a value), and double quotes refer to a column/table name.
Depending on how you run this Python code, somewhere in the stdout/stderr GDAL prints something like:
ERROR 1: "Chicago-Naperville-Elgin, IL-IN-WI" not recognised as an available field.
More details can be found at:
https://trac.osgeo.org/gdal/wiki/rfc52_strict_sql_quoting
To get it working, simply swap the single/double quoting, so:
cbsa.SetAttributeFilter("NAME='Chicago-Naperville-Elgin, IL-IN-WI'")

While this was a while ago, when I learn something I like to say what worked in my case in case I search this again.
For me, I had to have the syntax like:
cbsa.SetAttributeFilter('"NAME" = \'Chicago-Naperville-Elgin\'') # I didn't test multiple values
where the referenced page of the accepted answer says:
<delimited identifier> ::= <double quote> <delimited identifier body> <double quote>
<character string literal> ::= <quote> [ <character representation> ... ] <quote>
It may be that there has been an update to ogr changing this since '17.

Related

Python - re.sub return pattern rather than replacing

I am trying to revise a list of dictionary keys in Python 3 so that they are identifiable by the first set of numbers in the dictionary but it appears to be returning the regex pattern rather than the set of numbers.
>>>> import re
>>>>re.sub(r'GraphImages_[0-9]{2}_edge_media_to_caption_edges_0_node_text', '(?<=GraphImages_)\n{3}', 'GraphImages_99_edge_media_to_caption_edges_0_node_text')
'(?<=GraphImages_)\n{3}'
>>>>re.sub(r'GraphImages_[0-9]{2}_edge_media_to_caption_edges_0_node_text', '(?<=GraphImages_)\n{3}', 'GraphImages_123_edge_media_to_caption_edges_0_node_text')
'(?<=GraphImages_)\n{3}'
When the intended output from the above output would be 99 and 123 respectively.
Any guidance would be much appreciated. I am not very adept at the re package
If you just want to extract the numbers, you need to find them, not to replace:
re.findall("GraphImages_([0-9]{2,})", yourstring)[0]
#'99'
In fact, in your case a split may be a better choice:
yourstring.split("_")[1]
#'99'
Found a cumbersome workaround in the following
try_1 = re.sub('[^0-9]', "", 'GraphImages_99_edge_media_to_caption_edges_0_node_text')
try_2 = re.sub( '0$', "" , try_1)
You may use
^\D+(\d+).+
See a demo on regex101.com.

Python: Print entire line of string match and not cut off after the period

See bottom for the solution I came up with.
Hopefully this is a easy question for you guys. Trying to match a string to a list and print just that string matched. I was successful using re, but it is cutting off the rest of the string after the period. The span per re is 0,10 and when i look at the output without using re it is 0,14 not 0,10 so match is cutting off the info after the period. So I would like to learn how to tell it to print the entire span or learn a new way to match a var string to a list and print that exact string. My original attempts printed anything with the TESTPR in it, 3 printed total, the others I do not want printing have a 1 in the front and the last match has an additional R at the end. Here is my current match code:
#OLD See below
for element in catalog:
z = re.match("((TESTPRR )\w+)", element)
if z:
print((z.group()))
Output: TESTPR 105
It should show:
Wanted output: TESTPT 105.465
It will go up to 3 decimal places after the period and no more. I am currently taking a Python class to learn Python and love it so far, but this one has me stumped as I am just now learning about re and matching by reading as we have not gotten to that yet in class.
I am open to learning a different way to search for and match a string and print just that string. For my first attempt that prints 3 results was this:
catalog = [ long list pulled from API then code here to make it a nice column]
prod = 'TESTPR'
print ([s for s in catalog if prod in s])
When I add a space at the end of prod i can get rid of the match with the extra char at the end, but I cannot add a space to do the same thing with the match that has an extra char at the front. This is for the code above and not for the re match code. Thanks!
Answer below!
Since you are interested in learning about ways to match strings and solve your problem: try fuzzywuzzy.
In your case you could try:
from fuzzywuzzy import process
catalog = [long list pulled from API then code here to make it a nice column]
prod = "TESTPR"
hit = process.extractOne(prod, catalog, score_cutoff = 75) #you can adjust this to suit how close the match should be
print(hit[0]) #hit will be sth like ("TESTPT 105.465", 75)
Output: TESTPT 105.465
For information on different ways of using fuzzywuzzy, check out this link.
You can use different ways of matching such as:
fuzz.partial_ratio
fuzz.ratio
token_sort_ratio
fuzz.token_set_ratio
for this from fuzzywuzzy import fuzz
Kept at it with re.match and got the correct regex so the entire match prints and it does not cut off numbers after the period.
my original match as you can see above was re.match("((TESTPRR )\w+)", element), some of the ( were unneeded and needed to add a few more expressions and now it prints the correct match. See above for old code and below for the new code that works.
# New code, replaced w+ with w*\d*[.,]?\d*$
for element in catalog:
z = re.match("STRING\w*\d*[.,]?\d*$", element)
if z:
print(z.group())

PyParsing: Is it possible to globally suppress all Literals?

I have a simple data set to parse with lines like the following:
R1 (a/30) to R2 (b/30), metric 30
The only data that I need from the above is as follows:
R1, a, 30, R2, 192.168.0.2, 30, 30
I can parse all of this easily with pyparsing, but I either end up with a bunch of literals in my output, or I have to specifically say Literal(thing).suppress() in my parsing grammar, which gets tiresome.
Ideally, I'd like to write a grammar for the above like:
Word(alphanums) + '(' + Word(alphanums) + '/' + Word(nums) + ... etc.
and have the literal tokens get ignored. Can I say anything like .suppressAllLiterals()?
Notes:
new to PyParsing
I've read the docs and 5 or 6 examples
searched google
Thanks!
You can use this method on ParserElement - call it immediately after importing pyparsing:
from pyparsing import ...whatever...
ParserElement.inlineLiteralsUsing(Suppress)
Now all the string literals in your parser will be wrapped in Suppress objects, and left out of the results, rather than the default Literal.
(I will probably make this the default in v3.0, someday, when I can break backward compatibility.)

how use struct.pack for list of strings

I want to write a list of strings to a binary file. Suppose I have a list of strings mylist? Assume the items of the list has a '\t' at the end, except the last one has a '\n' at the end (to help me, recover the data back). Example: ['test\t', 'test1\t', 'test2\t', 'testl\n']
For a numpy ndarray, I found the following script that worked (got it from here numpy to r converter):
binfile = open('myfile.bin','wb')
for i in range(mynpdata.shape[1]):
binfile.write(struct.pack('%id' % mynpdata.shape[0], *mynpdata[:,i]))
binfile.close()
Does binfile.write automatically parses all the data if variable has * in front it (such in the *mynpdata[:,i] example above)? Would this work with a list of integers in the same way (e.g. *myIntList)?
How can I do the same with a list of string?
I tried it on a single string using (which I found somewhere on the net):
oneString = 'test'
oneStringByte = bytes(oneString,'utf-8')
struct.pack('I%ds' % (len(oneString),), len(oneString), oneString)
but I couldn't understand why is the % within 'I%ds' above replaced by (len(oneString),) instead of len(oneString) like the ndarray example AND also why is both len(oneString) and oneString passed?
Can someone help me with writing a list of string (if necessary, assuming it is written to the same binary file where I wrote out the ndarray) ?
There's no need for struct. Simply join the strings and encode them using either a specified or an assumed text encoding in order to turn them into bytes.
''.join(L).encode('utf-8')

PyYaml parses '9:00' as int

I have a file with the following data:
classes:
- 9:00
- 10:20
- 12:10
(and so on up to 21:00)
I use python3 and yaml module to parse it. Precisely, the source is config = yaml.load (open (filename, 'r')). But then, when I print config, I get the following output for this part of data:
'classes': [540, 630, 730, 820, 910, 1000, 1090, 1180],
The values in the list are ints.
While previously, when I used python2 (and BaseLoader for YAML), I got the values as strings, and I use them as such. BaseLoader is now not acceptable since I want to read unicode strings from file, and it gives me byte-strings.
So, first, why pyyaml does parse my data as ints?
And, second, how do I prevent pyyaml from doing this? Is it possible to do that without changing data file (e.g. without adding !!str)?
The documentation of YAML is a bit difficult to "parse" so I can imagine you missed this little bit of info about colons:
Normally, YAML insists the “:” mapping value indicator be separated from the value by white space. A benefit of this restriction is that the “:” character can be used inside plain scalars, as long as it is not followed by white space. This allows for unquoted URLs and timestamps. It is also a potential source for confusion as “a:1” is a plain scalar and not a key: value pair.
And what you have there in your input is a sexagesimal and your 9:00 is considered to be similar to 9 minutes and 0 seconds, equalling a total of 540 seconds.
Unfortunately this doesn't get constructed as some special Sexagesimal instance that can be used for calculations as if it were an integer but can be printed in its original form. Therefore, if you want to use this as a string internally you have to single quote them:
classes:
- '9:00'
- '10:20'
- '12:10'
which is what you would get if you dump {'classes': ['9:00', '10:20', '12:10']} (and note that the unambiguous classes doesn't get any quotes).
That the BaseLoader gives you strings is not surprising. The BaseConstructor that is used by the BaseLoader handles any scalar as string, including integers, booleans and "your" sexagesimals:
import ruamel.yaml as yaml
yaml_str = """\
classes:
- 12345
- 10:20
- abc
- True
"""
data = yaml.load(yaml_str, Loader=yaml.BaseLoader)
print(data)
data = yaml.load(yaml_str, Loader=yaml.SafeLoader)
gives:
{u'classes': [u'12345', u'10:20', u'abc', u'True']}
{'classes': [12345, 620, 'abc', True]}
If you really don't want to use quotes, then you have to "reset" the implicit resolver for scalars that start with numbers:
import ruamel.yaml as yaml
from ruamel.yaml.resolver import Resolver
import re
yaml_str = """\
classes:
- 9:00
- 10:20
- 12:10
"""
for ch in list(u'-+0123456789'):
del Resolver.yaml_implicit_resolvers[ch]
Resolver.add_implicit_resolver(
u'tag:yaml.org,2002:int',
re.compile(u'''^(?:[-+]?0b[0-1_]+
|[-+]?0o?[0-7_]+
|[-+]?(?:0|[1-9][0-9_]*)
|[-+]?0x[0-9a-fA-F_]+)$''', re.X), # <- copy from resolver.py without sexagesimal support
list(u'-+0123456789'))
data = yaml.load(yaml_str, Loader=yaml.SafeLoader)
print(data)
gives you:
{'classes': ['9:00', '10:20', '12:10']}
You should probably check the documentation of YAML
The colon are for mapping values.
I presume you want a string and not an integer, so you should double quote your strings.

Resources