I have a simple data set to parse with lines like the following:
R1 (a/30) to R2 (b/30), metric 30
The only data that I need from the above is as follows:
R1, a, 30, R2, 192.168.0.2, 30, 30
I can parse all of this easily with pyparsing, but I either end up with a bunch of literals in my output, or I have to specifically say Literal(thing).suppress() in my parsing grammar, which gets tiresome.
Ideally, I'd like to write a grammar for the above like:
Word(alphanums) + '(' + Word(alphanums) + '/' + Word(nums) + ... etc.
and have the literal tokens get ignored. Can I say anything like .suppressAllLiterals()?
Notes:
new to PyParsing
I've read the docs and 5 or 6 examples
searched google
Thanks!
You can use this method on ParserElement - call it immediately after importing pyparsing:
from pyparsing import ...whatever...
ParserElement.inlineLiteralsUsing(Suppress)
Now all the string literals in your parser will be wrapped in Suppress objects, and left out of the results, rather than the default Literal.
(I will probably make this the default in v3.0, someday, when I can break backward compatibility.)
Related
The problem
I am producing data files with Fortran 90 program and then analyzing it with Python 3.7. The data files are just many numbers, written in scientific notation.
My problem is that at some point the scientific notation inside files is wrong. The E just disappears. See an example from the Fortran output :
...
0.2693606328922176223E+99
0.3769504465049542893E+99
0.5275144982939579450E+99
0.7382178439909821718E+99
0.1033081719932198510+100
0.1445722084267474381+100
0.2023182004494124187+100
...
Infinity
Infinity
Infinity
...
This leads to the following error on my Python analysis program
ValueError: could not convert string to float: '0.1383817626772070210+100'
Kind of solution I'm looking for
I suppose this problem is inherent to Fortran. I am looking for a Python solution that could handle this weird notation, or at least way to just skip those files that may give weird results after all.
Program sample
I give you a little sample from my Python code, this may help. The following code opens a file, reads a column, and load the whole column inside the PMvar. The for loop is here because I read many files.
for rf in range(elem_2):
# load fluorescence and Temperature
PMvar[tr] = some_function(filepath[rf])[0]
The some_function takes as a parameter filepath[rf] which is the file path. I only retrieve the first column thus the [0] .
def some_function(filepath):
out0 = loadtxt('{}.dat'.format(filepath),comments='%')
out1 = loadtxt('{}.dat'.format(filepath),comments='%')
return out0, out1
I can't see any explicit conversions from string to float.
Is there a way to adapt this code in order to handle the weird scientific notation ? Or should I just rewrite some stuff to comply with this case ?
EDIT 1
The Fortran code is not from me, but I think I found where the data is saved.
open(10, file = trim(adjustl(str_file_Temp))//'.dat', status='replace', access='sequential', action='write')
print*, 'j_save_temp Temp',j_save_temp
do i = 1, j_save_temp
write(10,221) save_temperature(i,0),&
m_kb_x_inv_n_ions2(1)*save_temperature(i,1:3), &
m_kb_x_inv_n_ions(1) *save_temperature(i,4:6), &
save_temperature(i,7)/real(n_dt)
enddo
221 format( 8(1X,e26.19))
close(10)
deallocate(save_temperature)
with double precision, allocatable, dimension(:,:) :: save_temperature
I am trying to use GDAL's SetAttributeFilter() to filter the features in a layer of my shapefile, but the filter seems to have no effect.
My current data is a shapefile from the US Census Bureau, but I have tried with other shapefiles and get a similar result.
For example
from osgeo import ogr
shapefile_path = '../input/processed/shapefile/'
shapefile_ds = ogr.Open(shapefile_path)
cbsa = shapefile_ds.GetLayer('cb_2016_us_cbsa_500k')
print(cbsa.GetFeatureCount())
cbsa.SetAttributeFilter('NAME = "Chicago-Naperville-Elgin, IL-IN-WI"')
feat = cbsa.GetNextFeature()
print(feat.GetField('NAME'))
print(cbsa.GetFeatureCount())
Yields
945
Platteville, WI
945
I'm using Python 3.6 and GDAL 2.2.1
You can capture the return value of the SetAttributeFilter statement and make sure its 0, otherwise something went wrong.
In this particular case, its probably due to the quoting. Single quotes refer to string literals (a value), and double quotes refer to a column/table name.
Depending on how you run this Python code, somewhere in the stdout/stderr GDAL prints something like:
ERROR 1: "Chicago-Naperville-Elgin, IL-IN-WI" not recognised as an available field.
More details can be found at:
https://trac.osgeo.org/gdal/wiki/rfc52_strict_sql_quoting
To get it working, simply swap the single/double quoting, so:
cbsa.SetAttributeFilter("NAME='Chicago-Naperville-Elgin, IL-IN-WI'")
While this was a while ago, when I learn something I like to say what worked in my case in case I search this again.
For me, I had to have the syntax like:
cbsa.SetAttributeFilter('"NAME" = \'Chicago-Naperville-Elgin\'') # I didn't test multiple values
where the referenced page of the accepted answer says:
<delimited identifier> ::= <double quote> <delimited identifier body> <double quote>
<character string literal> ::= <quote> [ <character representation> ... ] <quote>
It may be that there has been an update to ogr changing this since '17.
I want to write a list of strings to a binary file. Suppose I have a list of strings mylist? Assume the items of the list has a '\t' at the end, except the last one has a '\n' at the end (to help me, recover the data back). Example: ['test\t', 'test1\t', 'test2\t', 'testl\n']
For a numpy ndarray, I found the following script that worked (got it from here numpy to r converter):
binfile = open('myfile.bin','wb')
for i in range(mynpdata.shape[1]):
binfile.write(struct.pack('%id' % mynpdata.shape[0], *mynpdata[:,i]))
binfile.close()
Does binfile.write automatically parses all the data if variable has * in front it (such in the *mynpdata[:,i] example above)? Would this work with a list of integers in the same way (e.g. *myIntList)?
How can I do the same with a list of string?
I tried it on a single string using (which I found somewhere on the net):
oneString = 'test'
oneStringByte = bytes(oneString,'utf-8')
struct.pack('I%ds' % (len(oneString),), len(oneString), oneString)
but I couldn't understand why is the % within 'I%ds' above replaced by (len(oneString),) instead of len(oneString) like the ndarray example AND also why is both len(oneString) and oneString passed?
Can someone help me with writing a list of string (if necessary, assuming it is written to the same binary file where I wrote out the ndarray) ?
There's no need for struct. Simply join the strings and encode them using either a specified or an assumed text encoding in order to turn them into bytes.
''.join(L).encode('utf-8')
I am new to natural language processing and I want to use it to write a news aggregator(in Node.js in my case). Rather than just use a prepackage framework, I want to learn the nuts and bolts and I am starting with the NLP portion. I found this one tutorial that has been the most helpful so far:
http://www.p-value.info/2012/12/howto-build-news-aggregator-in-100-loc.html
In it, the author gets the RSS feeds and loops through them looking for the elements(or fields) title and description. I know Python and understand the code. But what I don't understand is what NLP is doing here with title and description under the hood(besides scraping and tokenizing, which is apparent...and those tasks don't need a NLP).
import feedparser
import nltk
corpus = []
titles=[]
ct = -1
for feed in feeds:
d = feedparser.parse(feed)
for e in d['entries']:
words = nltk.wordpunct_tokenize(nltk.clean_html(e['description']))
words.extend(nltk.wordpunct_tokenize(e['title']))
lowerwords=[x.lower() for x in words if len(x) > 1]
ct += 1
print ct, "TITLE",e['title']
corpus.append(lowerwords)
titles.append(e['title'])
(reading your question more carefully maybe this was all already obvious to you, but it doesn't look like anything more deep or interesting is going on)
wordpunct_tokenize is set up here here (last line) as
wordpunct_tokenize = WordPunctTokenizer().tokenize
WordPunctTokenizer is implemented by this code:
class WordPunctTokenizer(RegexpTokenizer):
def __init__(self):
RegexpTokenizer.__init__(self, r'\w+|[^\w\s]+')
The heart of this is just the regular expression r'\w+|[^\w\s]+', which defines what strings are considered to be tokens by this tokenizer. There are two options, separated by the |:
\w+, that is, more than one "word" character (alphabetical or numeric)
[^\w\s]+, more than one character that is not either a "word" character or whitespace, thus this matches any string of punctuation
Here is a reference for Python regular expressions.
I have not dug into the RegexpTokenizer, but I assume is set up such that the tokenize function returns an iterator that searches a string for the first match of the regular expression, then the next, etc.
I have a file with the following data:
classes:
- 9:00
- 10:20
- 12:10
(and so on up to 21:00)
I use python3 and yaml module to parse it. Precisely, the source is config = yaml.load (open (filename, 'r')). But then, when I print config, I get the following output for this part of data:
'classes': [540, 630, 730, 820, 910, 1000, 1090, 1180],
The values in the list are ints.
While previously, when I used python2 (and BaseLoader for YAML), I got the values as strings, and I use them as such. BaseLoader is now not acceptable since I want to read unicode strings from file, and it gives me byte-strings.
So, first, why pyyaml does parse my data as ints?
And, second, how do I prevent pyyaml from doing this? Is it possible to do that without changing data file (e.g. without adding !!str)?
The documentation of YAML is a bit difficult to "parse" so I can imagine you missed this little bit of info about colons:
Normally, YAML insists the “:” mapping value indicator be separated from the value by white space. A benefit of this restriction is that the “:” character can be used inside plain scalars, as long as it is not followed by white space. This allows for unquoted URLs and timestamps. It is also a potential source for confusion as “a:1” is a plain scalar and not a key: value pair.
And what you have there in your input is a sexagesimal and your 9:00 is considered to be similar to 9 minutes and 0 seconds, equalling a total of 540 seconds.
Unfortunately this doesn't get constructed as some special Sexagesimal instance that can be used for calculations as if it were an integer but can be printed in its original form. Therefore, if you want to use this as a string internally you have to single quote them:
classes:
- '9:00'
- '10:20'
- '12:10'
which is what you would get if you dump {'classes': ['9:00', '10:20', '12:10']} (and note that the unambiguous classes doesn't get any quotes).
That the BaseLoader gives you strings is not surprising. The BaseConstructor that is used by the BaseLoader handles any scalar as string, including integers, booleans and "your" sexagesimals:
import ruamel.yaml as yaml
yaml_str = """\
classes:
- 12345
- 10:20
- abc
- True
"""
data = yaml.load(yaml_str, Loader=yaml.BaseLoader)
print(data)
data = yaml.load(yaml_str, Loader=yaml.SafeLoader)
gives:
{u'classes': [u'12345', u'10:20', u'abc', u'True']}
{'classes': [12345, 620, 'abc', True]}
If you really don't want to use quotes, then you have to "reset" the implicit resolver for scalars that start with numbers:
import ruamel.yaml as yaml
from ruamel.yaml.resolver import Resolver
import re
yaml_str = """\
classes:
- 9:00
- 10:20
- 12:10
"""
for ch in list(u'-+0123456789'):
del Resolver.yaml_implicit_resolvers[ch]
Resolver.add_implicit_resolver(
u'tag:yaml.org,2002:int',
re.compile(u'''^(?:[-+]?0b[0-1_]+
|[-+]?0o?[0-7_]+
|[-+]?(?:0|[1-9][0-9_]*)
|[-+]?0x[0-9a-fA-F_]+)$''', re.X), # <- copy from resolver.py without sexagesimal support
list(u'-+0123456789'))
data = yaml.load(yaml_str, Loader=yaml.SafeLoader)
print(data)
gives you:
{'classes': ['9:00', '10:20', '12:10']}
You should probably check the documentation of YAML
The colon are for mapping values.
I presume you want a string and not an integer, so you should double quote your strings.