Python 3.3: Process inlineXML

Python 3.3: Process inlineXML - python-3.x

Whilst trying to tag named entities with the stanford NRE tool, I get this kind of output:
A jury in <ORGANIZATION>Marion County Superior Court</ORGANIZATION> was expected to begin deliberations in the case on <DATE>Wednesday</DATE> or <DATE>Thursday</DATE>.
Of course processing any XML without a root does not work, so I added this:
<root>A jury in <ORGANIZATION>Marion County Superior Court</ORGANIZATION> was expected to begin deliberations in the case on <DATE>Wednesday</DATE> or <DATE>Thursday</DATE>.</root>
I tried building a tree with this method: stripping inline tags with python's lxml but it does not work... It yields this error on the line tree = etree.fromstring(text):
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 1, column 1793
Does anyone know a solution for this? Or perhaps another method which allows me to build a tree from any text with inlineXML tags, keeping only the tagged tokens and removing/ignoring the rest of the text.

In the end I did it without using a parser or a tree but just used regular expressions. This is the code that works nice and fast:
import re
NER = ['TIME','LOCATION','ORGANISATION','PERSON','MONEY','PERCENT','DATA']
entities = {}
for cat in NER:
regex_cat = re.compile('<'+cat+'>(.*?)</'+cat+'>')
entities[cat] = re.findall(regex_cat,data)
Here data is just a string of text. It uses regular expressions to find all entities of a category specified in NER and stores it as is list in a dictionary. This could be used for all inlineXML strings where NER is just a list of all possible tags in the string.

Related

How to get a sort of inverse lemmatizations for every language?

I found the spacy lib that allows me to apply lemmatization to words (blacks -> black, EN) (bianchi -> bianco, IT). My work is to analyze entities, not verbs or adjectives.
I'm looking for something that allows me to have all the possible words starting from the caninical form.
Like from "black" to "blacks", for english, or from "bianco" (in italian) and get "bianca", "bianchi", "bianche", etc. Is there any library that do this?

I'm not clear on exactly what you're looking for but if a list of English lemma is all you need you can extract that easily enough from a GitHub library I have. Take a look at lemminflect. Initially, this uses a dictionary approach to lemmatization and there is a .csv file in here with all the different lemmas and their inflections. The file is LemmInflect/lemminflect/resources/infl_lu.csv.gz. You'll have to extract the lemmas from it. Something like...
with gzip.open('LemmInflect/lemminflect/resources/infl_lu.csv.gz)` as f:
for line in f.readlines():
parts = lines.split(',')
lemma = parts[0]
pos = parts[1]
print(lemma, pos)
Alternatively, if you need a system to inflect words, this is what Lemminflect is designed to do. You can use it as a stand-alone library or as an extension to SpaCy. There's examples on how to use it in the README.md or in the ReadTheDocs documentation.
I should note that this is for English only. I haven't seen a lot of code for inflecting words and you may have some difficulty finding this for other languages.

How to build text from mixed xml content using Python?

I have a situation in which an XML document has information in varying depth (according to S1000D schemas), and I'm looking for a generic method to extract correct sentences.
I need to interpret a simple element containing text as one individual part/sentence, and when an element that's containing text contains other elements that in turn contain text, I need to flatten/concatenate it into one string/sentence. The nested elements shall not be visited again if this is done.
Using Pythons lxml library and applying the tostring function works ok if the source XML is pretty-printed, so that I may split the concatenated string into new lines in order to get each sentence. If the source isn't pretty-printed, in one single line, there won't be any newlines to make the split.
I have tried the iter function and applying xpaths to each node, but this often renders other results in Python than what I get when applying the xpath in XMLSpy.
I have started down some of the following paths, and my question is if you have some input on which ones to continue on, or if you have other solutions.
I think I could use XSLT to preprocess the XML file, and then use a simpler Python script to divide the content into a list of sentence for further processing. Using Saxon with Python is now doable, but here I run into problems if the XML source contains entities that I cannot redirect Saxon to resolve (such as & nbsp;). I have no problem parsing files with lxml, so I tend to lean towards a cleaner Python solution.
lxml doesn't seem to have xpath support that can give me all nodes with text that contains one or more children containing text, and all nodes that are simple elements with no parents containing text nodes. Is there way to preprocess the parsed tree so that I can ensure it is pretty printed in memory, so that tostring works the same way for every XML file? Otherwise, my logic gives me one string for a document with no white space, and multiple sentences/strings if the source had been pretty printed. This doesn't feel ok.
What are my options? Use XSLT 1.0 in Python, other parsers to get a better handle on where I am in the tree, ...
Just to reiterate the issue here; I am looking for a generic way to extract text, and the only rules to the XML source are that a sentence may be built from an element with child elements with text, but there won't be additional levels. The other possibility is the simple element, but this one cannot be included in a parent element with text since this is included in the first rule.
Help/thoughts are appreciated.

This is a downright ugly code, a hastily hack with no real thought on form, beauty or finesse. All I am after is one way of doing this in Python. I'll tidy things up when I find a good solution that I want to keep. This is one possible solution so I figured I'd post it to see if someone can be kind enough to show me how to do this instead.
The problems has been to have xpath expressions that could get me all elements with text content, and then to act upon the depending on their context. All my xpath expressions has given me the correct nodes, but also a root, or ancestor that has pulled a more or less complete string at the beginning, so I gave up on those. My xpath functions as they should in XSLT, but not in Python - don't know why...
I had to revert to regex to find nodes that contains strings that are not white space only.
Using lxml with xpath and tostring gives different results depending on how the source XML is formatted, so I had to get around that.
The following formats have been tested:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<subroot>
<a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element</c> and back to b.</b></a>
<!-- Comment -->
<a>Simple element.</a>
<a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a>
</subroot>
</root>
<?xml version="1.0" encoding="UTF-8"?>
<root>
<subroot>
<a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element,
</c> and back to b.</b>
</a>
<!-- Comment -->
<a>Simple element.</a>
<a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a>
</subroot>
</root>
<?xml version="1.0" encoding="UTF-8"?><root><subroot><a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element</c> and back to b.</b></a><!-- Comment --><a>Simple element.</a><a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a></subroot></root>
Python code:
dmParser=ET.XMLParser(resolve_entities=False, recover=True)
xml_doc = r'C:/Temp/xml-testdoc.xml'
parsed = ET.parse(xml_doc)
for elem in parsed.xpath("//*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"}):
tmp = elem.xpath("parent::*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"})
if(tmp and tmp[0].text and tmp[0].text.strip()): #Two first checks can yield None, and if there is something check if only white space
continue #If so, discard this node
elif(elem.xpath("./*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"})): #If a child node also contains text
line =re.sub(r'\s+', ' ',ET.tostring(elem, encoding='unicode', method='text').strip()) #Replace all non wanted whitespace
if(line):
print(line)
else: #Simple element
print(elem.text.strip())
Always yields:
Intro, element a: Nested b to be included in a, and yet another nested c-element, and back to b.
Simple element.
Text with 1st nested b, back in a, and yet another b-element, before ending in a.

how to match the lemma of a string with the lemmas of the list using phrase matcher in spacy

I have a list and a sentence and I want to match the list with lemma of words in the sentence, i.e,
list_words = ['play', 'burn fireworks', 'eat']
sentence = "sita was playing with her friends while her broter was burning fireworks"
I tried,
patterns = [__model.make_doc(text) for text in list_words]
spacy_doc = __model(sentence)
matcher = PhraseMatcher(__model.vocab, attr="LEMMA")
mather.add(id, None, *patterns)
that is adding LEMMA as attr in PhraseMtcher,
but it did not help me
as it should have matched burning fireworks and playing from the sentence and instead of that, i am getting a empty list.

If __model has the tagger enabled (which it probably does by default), this will work if you change __model.make_doc(text) to __model(text) when you create the patterns. make_doc() only works for attr="ORTH" because it doesn't do anything beyond tokenization.
If you have a lot of lemma-based patterns and none of them need parses or named entities, you could disable parser and ner in __model to make things faster, since the lemmatizer only depends on the tagger.
(PhraseMatcher warns you that nlp(text) might be slow for ORTH-only patterns and suggests using nlp.make_doc() instead, but I think it should also try to warn you if your document doesn't have the attributes you're trying to match.)

modgrammar import lists to use as disjunct literals?

I'd like to be able to have files that contain lists of terms that I can read and use in a modgrammar grammar, but OR() doesn't work on a Python list as far as I can tell...
from modgrammar import *
with open(termfile) as f:
terms = [x.strip() for x in f.readlines()]
class SomeGrammar(Grammar):
grammar = (OR(terms))
Trying to parse strings that begin with anything but the first term in the list throws an exception. Is there a way to do this cleanly?

Modgrammar will interpret a list as a series of terms to match in order, so OR(terms) is interpreted as "match these in order OR (nothing else)", which isn't what you're looking for.
Fortunately, Python has a built-in syntax to take a list and pass it as multiple arguments for a function (like OR). You should be able to use OR(*terms) to do what you want instead.

Stanford nndep to get parse trees

Using Stanford CoreNLP, I am trying to parse text using the neural nets dependency parser. It runs really fast (that's why I want to use this and not the LexicalizedParser), and produces high-quality dependency relations. I am also interested in retrieving the parse trees (Penn-tree style) from that too. So, given the GrammaticalStructure, I am getting the root of that (using root()), and then trying to print it out using the toOneLineString() method. However, root() returns the root node of the tree, with an empty/null list of children. I couldn't find anything on this in the instructions or FAQs.
GrammaticalStructure gs = parser.predict(tagged);
// Print typed dependencies
System.err.println(gs);
// get the tree and print it out in the parenthesised form
TreeGraphNode tree = gs.root();
System.err.println(tree.toOneLineString());
The output of this is:
ROOT-0{CharacterOffsetBeginAnnotation=-1, CharacterOffsetEndAnnotation=-1, PartOfSpeechAnnotation=null, TextAnnotation=ROOT}Typed Dependencies:
[nsubj(tell-5, I-1), aux(tell-5, can-2), advmod(always-4, almost-3), advmod(tell-5, always-4), root(ROOT-0, tell-5), advmod(use-8, when-6), nsubj(use-8, movies-7), advcl(tell-5, use-8), amod(dinosaurs-10, fake-9), dobj(use-8, dinosaurs-10), punct(tell-5, .-11)]
ROOT-0
How can I get the parse tree too?

Figured I can use the Shift-Reduce constituency parser made available by Stanford. It's very fast and the results are comparable.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string