I am trying to remove an element from an XML tree. My attempt is based on the example that I found in the Python documentation:
for country in root.findall('country'):
# using root.findall() to avoid removal during traversal
rank = int(country.find('rank').text)
if rank > 50:
root.remove(country)
tree.write('output.xml')
But I'm trying to use the remove() function for a string attribute, not an integer one.
for country in root.findall('country'):
# using root.findall() to avoid removal during traversal
description = country.find('rank').text
root.remove(description)
tree.write('SampleData.xml')
But I get the following error:
TypeError: remove() argument must be xml.etree.ElementTree.Element, not str.
I ultimately added another element under country called description which holds a short description of the country and its features:
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
<description>Liechtenstein has a lot of flowers.</description>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N" />
<description>Singapore has a lot of street markets</description>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W" />
<neighbor name="Colombia" direction="E" />
<description>Panama is predominantly Spanish speaking.</description>
</country>
</data>
I'm trying to use the remove() function to delete that description attribute for all instances.
Indeed, in your code you are passing a string as argument, not a node:
description = country.find('rank').text
root.remove(description)
This is not what happens in the correct examples. The one with the integer does this:
rank = int(country.find('rank').text)
if rank > 50:
root.remove(country)
Note that country is removed (a node), not rank (an int).
It is not clear what you want to do with the description, but make sure to remove a node, not a string. For instance,
description = country.find('rank').text
if description == "delete this": # just an example condition
root.remove(country)
Or, if you just want to remove the "rank" node and keep the country node:
ranknode = country.find('rank')
if ranknode.text == "delete this":
country.remove(ranknode)
As you mention you have actually a description element (you call it an attribute, but that is confusing), you can target that element instead of rank:
descriptionnode = country.find('description')
if descriptionnode.text == "delete this":
country.remove(descriptionnode)
Related
When I format the reference using jm-chinese-gb7714-2005-numeric.csl in Juris-M, the title occurs twice, does someone know the reason?
Many thanks.
Example:
[1] MINEKUS M, ALMINGER M, ALVITO P, et al. A standardised static in vitro digestion method suitable for food – an international consensus
A standardised static in vitro digestion method suitable for food – an international consensus[J]. Food & Function, 2014, 5(6) : 1113–1124
The address of the reference https://pubs.rsc.org/en/content/articlelanding/2014/fo/c3fo60702j#!divAbstract
Gist of the csl file:
https://gist.github.com/redleafnew/6f6fa23c3627c67d968eee38e4d2d40a
This bug has been fixed. Replace
<text macro="title" suffix="[J]."/>
<text value=""/>
with
<text value="[J]."/>
The duplicated reference could be removed.
I am having a hard time using the advantages of beautiful soup for my use case. There are many similar but not always equal nested p tags where I want to get the contents from. Examples as follows:
<p><span class="example" data-location="1:20">20</span>normal string</p>
<p><span class="example" data-location="1:21">21</span>this text <strong>belongs together</strong></p>
<p><span class="example" data-location="1:22">22</span>some text (<span class="referencequote">a reference text</span>)that might continue</p>
<p><span class="example" data-location="1:23">23</span>more text</p><div class="linebreak"></div>
<p><span class="example" data-location="1:22">24</span>text with (<span class="referencequote">first</span>)two references <span class="referencequote">first</span>.</p>
I need to save the string of the span tag as well as the strings inside the p tag, no matter its styling and if applicable the referencequote. So from examples above I would like to extract:
example = 20, text = 'normal string', reference = []
example = 21, text = 'this text belongs together', reference = []
example = 22, text = 'some text that might continue', reference = ['a reference text']
example = 23, text = 'more text', reference = []
example = 24, text = 'text with two references', reference = ['first', 'second']
What I was trying is to collect all items with the "example" class and then looping though its parents contents.
for span in bs.find_all("span", {"class": "example"}):
references = []
for item in span.parent.contents:
if (type(item) == NavigableString):
text= item
elif (item['class'][0]) == 'verse':
number= int(item.string)
elif (item['class']) == 'referencequote':
references.append(item.string)
else:
#how to handle <strong> tags?
verses.append(MyClassObject(n=number, t=text, r=references))
My approach is very prone to error and there might be even more tags like <strong>, <em> that I am ignoring right now. The get_text() method unfortunately gives back sth like '22 some text a reference text that might continue'.
There must be an elegant way to extract this information. Could you give me some ideas for other approaches? Thanks in advance!
Try this.
from simplified_scrapy.core.regex_helper import replaceReg
from simplified_scrapy import SimplifiedDoc,utils
html = '''
<p><span class="example" data-location="1:20">20</span>normal string</p>
<p><span class="example" data-location="1:21">21</span>this text <strong>belongs together</strong></p>
<p><span class="example" data-location="1:22">22</span>some text (<span class="referencequote">a reference text</span>)that might continue</p>
<p><span class="example" data-location="1:23">23</span>more text</p><div class="linebreak"></div>
<p><span class="example" data-location="1:22">24</span>text with (<span class="referencequote">first</span>)two references <span class="referencequote">second</span>.</p>
'''
html = replaceReg(html,"<[/]*strong>","") # Pretreatment
doc = SimplifiedDoc(html)
ps = doc.ps
for p in ps:
text = ''.join(p.spans.nextText())
text = replaceReg(text,"[()]+","") # Remove ()
span = p.span # Get first span
spans = span.getNexts(tag="span").text # Get references
print (span["class"], span.text, text, spans)
Result:
example 20 normal string []
example 21 this text belongs together []
example 22 some text that might continue ['a reference text']
example 23 more text []
example 24 text with two references. ['first', 'second']
Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
A different approach I found out - without regex and maybe more robust to different spans that might come up
for s in bsItem.select('span'):
if s['class'][0] == 'example' :
# do whatever needed with the content of this span
s.extract()
elif s['class'][0] == 'referencequote':
# do whatever needed with the content of this span
s.extract()
# check for all spans with a class where you want the text excluded
# finally get all the text
text = span.parent.text.replace(' ()', '')
maybe that approach is of interest for someone reading this :)
I have a bunch of nodes like this:
<root>
<books>
<book id="1">Book 1</book>
<book id="2">Book 2</book>
<book id="3">Book 3</book>
</books>
</root>
What I want is to get the id of the book with text node "Book 2". How do I do this? I tried this without any result ($doc is my document path):
let $b := $doc/root/books/book[book = "Book 2"]
return data($b/#id)
EDIT: I meant that $doc is the document node, not only the path.
Assuming that $doc is in fact a document-node and not a document path as you described it then you can use the following:
$doc/root/books/book[. = "Book 2"]/data(#id)
Simply put . refers to the current context item, which is already book as that is the last part of the XPath before the predicate.
if $doc is your document path, you'll need to call fn:doc($doc), to get the document-node:
fn:doc($doc)/root/books/book[. = "Book 2"]/data(#id)
I'm Trying to deserialize xml data into an object with c#. I have always done this using the .NET deserialize method, and that has worked well for most of what I have needed.
Now though, I have XML that is created by Sharepoint and the attribute names of the data I need to deserialize have encoded caracters, namely:
*space, º, ç ã, :, * and a hyphen as
x0020, x00ba, x007a, x00e3, x003a and x002d respectivly
I'm trying to figure out what I have to put in the attributeName parameter in the properties XmlAttribute
x0020 converts to a space well, so, for instance, I can use
[XmlAttribute(AttributeName = "ows_Nome Completo")]
to read
ows_Nome_x0020_Completo="MARIA..."
On The other hand, neither
[XmlAttribute(AttributeName = "ows_Motiva_x00e7__x00e3_o_x003a_")]
nor
[XmlAttribute(AttributeName = "ows_Motivação_x003a_")]
nor
[XmlAttribute(AttributeName = "ows_Motivação:")]
allow me to read
ows_Motiva_x00e7__x00e3_o_x003a_="text to read..."
With the first two I get no value returned, and the third gives me a runtime error for invalid caracters (the colon).
Anyway to get this working with .NET Deserialize, or do I have to build a specific deserializer for this?
Thanks!
What you are looking at (the "cryptic" data) is called XML entities. It's used by SharePoint to safekeep attribute names and similar elements.
There are a few ways of dealing with this, the most elegant ways to solve it is by extracting the List schema and match the element towards the schema. The schema contain all meta-data about your list data. A polished example of a Schema can be seen below or here http://www.bendsoft.com/documentation/camelot-php-tools/1_5/packets/schema-and-content-packets/schemas/example-list-view-schema/
If you don't want to walk that path you could start here http://msdn.microsoft.com/en-us/library/35577sxd.aspx
<Field Name="ContentType">
<ID>c042a256-787d-4a6f-8a8a-cf6ab767f12d</ID>
<DisplayName>Content Type</DisplayName>
<Type>Text</Type>
<Required>False</Required>
<ReadOnly>True</ReadOnly>
<PrimaryKey>False</PrimaryKey>
<Percentage>False</Percentage>
<RichText>False</RichText>
<VisibleInView>True</VisibleInView>
<AppendOnly>False</AppendOnly>
<FillInChoice>False</FillInChoice>
<HTMLEncode>False</HTMLEncode>
<Mult>False</Mult>
<Filterable>True</Filterable>
<Sortable>True</Sortable>
<Group>_Hidden</Group>
</Field>
<Field Name="Title">
<ID>fa564e0f-0c70-4ab9-b863-0177e6ddd247</ID>
<DisplayName>Title</DisplayName>
<Type>Text</Type>
<Required>True</Required>
<ReadOnly>False</ReadOnly>
<PrimaryKey>False</PrimaryKey>
<Percentage>False</Percentage>
<RichText>False</RichText>
<VisibleInView>True</VisibleInView>
<AppendOnly>False</AppendOnly>
<FillInChoice>False</FillInChoice>
<HTMLEncode>False</HTMLEncode>
<Mult>False</Mult>
<Filterable>True</Filterable>
<Sortable>True</Sortable>
</Field>
<Field>
...
</Field>
Well... I guess I kind of hacked a way around, which works for now. Just replaced the _x***_ charecters for nothing, and corrected the XmlAttributes acordingly. This replacement is done by first loading the xml as a string, then replacing, then loading the "clean" text as XML.
But I wopuld still like to know if it is possible to use some XmlAttribute Name for a more direct approach...
Try using System.Xml; XmlConvert.EncodeName and XmlConvert.DecodeName
I use a simply function to get the NameCol:
private string getNameCol(string colName) {
if (colName.Length > 20) colName = colName.Substring(0, 20);
return System.Xml.XmlConvert.EncodeName(colName);
}
I'm already searching for replace characters like á, é, í, ó, ú. EncodeName doesn't convert this characters.
Can use Replace:
.Replace("ó","_x00f3_").Replace("á","_x00e1_")
I'm trying to create a dynamic row filter based on a variable. I have the following code:
<xsl:variable name="filter" select="contain(#Title, 'title1') or contain(#Title, 'title2')"/>
<xsl:variable name="Rows" select="/dsQueryResponse/Rows/Row[string($filter)]" />
This unfortunately doesn't seem to work and I end up with all rows. I'm guessing the filter doesn't actually get applied, since I can copy and paste the output of the $filter variable, copy and paste it in the Row[] and it works as expected.
Anyone tried to do this before?
In case you're wondering the filter variable is actually created using a template that splits a string like:
title1 - title2 - title3
and returns a string like:
contain(#Title, 'title1') or contain(#Title, 'title2') or contain(#Title, 'title3')
Any help would be greatly appreciated!
You can't do what you seem to be attempting here. An XPath expression is atomical, you can't save parts of it and re-use them (apart from that it is contains(), not contain()).
You need something like this:
<xsl:variable name="Rows" select="
/dsQueryResponse/Rows/Row[
contains(#Title, 'title1') or contains(#Title, 'title2')
]
" />
Your "filter" does not work because if $filter is a string, then it is a string, nothing else. It does not get a magical meaning just because it looks like XPath. ;-)
This
<xsl:variable name="Rows" select="/dsQueryResponse/Rows/Row[string($filter)]" />
evaluates to a non-empty string as the predicate. And any non-empty string evaluates to true, which makes the expression return every node there is.
If you want a dynamic filter based on an input string, then do this:
<xsl:variable name="filter" select="'|title1|title2|title3|'" />
<xsl:variable name="Rows" select="
/dsQueryResponse/Rows/Row[
contains(
$filter,
concat('|', #Title, '|')
)
]
" />
The use of delimiters also prevents "title11" from showing up if you look for "title1".
Make sure your filter always starts and ends with a delimiter, and use a delimiter that is reasonably unlikely to ever occur as a natural part of #Title. (For example, you could use
. If your title cannot be multi-line this is pretty safe.)