Python lxml.etree: how to add 'xml:lang="en-US"' as a namespace - python-3.x

I am trying to create a xml whose first element is:
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
</speak>
I am able to add the first attributes with...
from lxml.etree import Element, SubElement, QName, tostring
root = Element('speak', version="1.0",
xmlns="http://www.w3.org/2001/10/synthesis")
...but not the namespace xml:lang="en-US". Based on several tuto/question like this and this I tried many solutions but none worked.
For example, I tried this :
class XMLNamespaces:
xml = 'http://www.w3.org/2001/10/synthesis'
root.attrib[QName(XMLNamespaces.xml, 'lang')] = "en-US"
But the ouput is
<speak xmlns:ns0="http://www.w3.org/2001/10/synthesis" version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" ns0:lang="en-US">
How can I create the xml:lang="en-US" of my first xml element?

The special xml: prefix is associated with the http://www.w3.org/XML/1998/namespace URI.
The following code adds xml:lang="en-US" to the root element:
root.attrib[QName("http://www.w3.org/XML/1998/namespace", "lang")] = "en-US"

Related

Parsing XML file to Nested DICT

I'm trying to save data from a XML file into a nested dict. In my XML file, shown bellow, I have multiple tags called DOCUMENT and nested to it I have a variable number of tags called LINK. Then, inside the links I have some URLs inside ADDRESS tags
<document>
<description>blah, blah, blah</description>
<link>
<description>Document1</description>
<address>url 1</address>
</link>
<link>
<description>Document23</description>
<address>url 2</address>
</link>
<link>
<description>Document43</description>
<address>url 3</address>
</link>
<regNum>201801289307</regNum>
<order>3</order>
<seqNum>24447778</seqNum>
<codType>6</codType>
<descType>Blah</descType>
</document>
I have created a dict like this:
op = {}
op['doc_dict'] = {"descriDoc":[], "orderDoc":[], "seqNum":[], "codType":[], "descType":[]}
op['doc_dict']['link_dict'] = {"seqNum":[], "linkUrl":[]}
I would like to achieve a DICT where I can match each URL inside the LINK tags to it's parent DOCUMENT using the value inside the seqNum tag
{'doc_dict': {'descriDoc': ["blah, blah, blah"], 'orderDoc': ["4"], 'seqNum': ["24447779"],
'codType': ["6"], 'descType': ["Blah1"],
'link_dict': {'seqNum': ["24447779"], 'linkUrl': ["url 5", "url 7", "url 9"]}}}
Any idea on how to get the above DICT would be great. All my approaches failed.
Cheers,
I have used the List comprehension and solved the question.
def edicao(filename):
op = []
tree = ET.parse(filename) #read in the XML
for item in tree.iter(tag = 'document'):
doc = {}
doc["descriDoc"] = item.find('description').text
doc["orderDoc"] = item.find('order').text
doc["seqNum"] = item.find('seqNum').text
doc["links"] = [{'seqNum':item.find('seqNum').text,
'descricaoDoc':e.find('description').text,
'url':e.find('address').text} for e in item.findall('link')]
op.append(doc)
return op
Cheers,

Would like to output xml file as in body from python using lxml

Would like to output the following at the head of xml
I can find lots on parsing and validating, but not so much on creation/output
I can find some documentation on QName but how do I output
`
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<gdml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="http://service-spi.web.cern.ch/service-spi/app /releases/GDML/schema/gdml.xsd">`
Use QName to create the attribute (noNamespaceSchemaLocation) that is bound to the http://www.w3.org/2001/XMLSchema-instance namespace.
from lxml.etree import QName, Element, tostring
qname = QName("http://www.w3.org/2001/XMLSchema-instance", "noNamespaceSchemaLocation")
attr_dict = {qname: "http://service-spi.web.cern.ch/service-spi/app /releases/GDML/schema/gdml.xsd"}
gdml = Element("gdml", attr_dict)
print(tostring(gdml, encoding="UTF-8", standalone=False).decode())
Output:
<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<gdml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://service-spi.web.cern.ch/service-spi/app /releases/GDML/schema/gdml.xsd"/>
The namespace declaration (xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance") is created automatically.
Thanks - I already did it another way
NS = 'http://www.w3.org/2001/XMLSchema-instance'
location_attribute = '{%s}noNameSpaceSchemaLocation' % NS
gdml = ET.Element('gdml',attrib={location_attribute: 'http://service-spi.web.cern.ch/service-spi/app/releases/GDML/schema/gdml.xsd'})
print(gdml.tag)

How to access xml field with lxml?

Python 3.6, Lxml, Windows 10
I am getting crazy. I want to access the item field. But I always get the error:
AttributeError: 'cython_function_or_method' object has no attribute'item'
Everything else (address fields etc...) I can access without problems. How can I access the item fields (sku, amount etc...)?
I've used this code:
import requests
from lxml import objectify
url = "URL_TO_XML_FILE"
xml_content = requests.get(url).text.encode('utf-8')
xml = objectify.fromstring(xml_content)
for sale in xml.response.sales.sale:
for item in sale.items.item:
print(item.sku)
Here is the beginning of the xml:
<?xml version="1.0" encoding="ISO-8859-1"?>
<getnewsalesresult xmlns="https://pmcdn.priceminister.com/res/schema/getnewsales">
<request>
<version>2017-08-07</version>
<user>SELLER</user>
</request>
<response>
<lastversion>2017-08-07</lastversion>
<sellerid>95029358</sellerid>
<sales>
<sale>
<purchaseid>297453287592813953</purchaseid>
<purchasedate>15/12/2018-19:10</purchasedate>
<deliveryinformation>
<shippingtype>Normal</shippingtype>
<isfullrsl>N</isfullrsl>
<purchasebuyerlogin><![CDATA[LOGIN]]></purchasebuyerlogin>
<purchasebuyeremail>EMAIL</purchasebuyeremail>
<deliveryaddress>
<civility>Mme</civility>
<lastname><![CDATA[Lastname]]></lastname>
<firstname><![CDATA[Firstname]]></firstname>
<address1><![CDATA[STREET]]></address1>
<address2><![CDATA[]]></address2>
<zipcode>13570</zipcode>
<city><![CDATA[Paris]]></city>
<country><![CDATA[France]]></country>
<countryalpha2>FX</countryalpha2>
<phonenumber1></phonenumber1>
<phonenumber2>PHONENUMBER</phonenumber2>
</deliveryaddress>
</deliveryinformation>
<items>
<item>
<sku><![CDATA[SKU1]]></sku>
<advertid>411812243030</advertid>
<advertpricelisted>
<amount>15.99</amount>
<currency>EUR</currency>
</advertpricelisted>
<itemid>551131040</itemid>
<headline><![CDATA[HEADLINE]]></headline>
<itemstatus><![CDATA[REQUESTED]]></itemstatus>
<ispreorder>N</ispreorder>
<isnego>N</isnego>
<negotiationcomment></negotiationcomment>
<price>
<amount>15.99</amount>
<currency>EUR</currency>
</price>
<isrsl>N</isrsl>
<isbn></isbn>
<ean>4363745894373857474; </ean>
<paymentstatus><![CDATA[INCOMING]]></paymentstatus>
<sellerscore></sellerscore>
</item>
</items>
</sale>
<sale>
The problem is that items is actually a method of ObjectifiedElement, so the expression sale.items actually returns the method, because it has precedence.
To get the 'items' object you want, you have to be more explicit about getting the attribute of sale and not looking for methods of the class first, which is the usual python order. This is what python does behind the scene when you access an attribute, and you can do it too:
sale.__getattr__('items')
This will also work (it's a dictionary-like interface to the attributes of an object):
sale.__dict__['items']
The revised code:
import requests
from lxml import objectify
url = "URL_TO_XML_FILE"
xml_content = requests.get(url).text.encode('utf-8')
xml = objectify.fromstring(xml_content)
for sale in xml.response.sales.sale:
for item in sale.__dict__['items'].item:
print(item.sku)
Another way to deal with this is to avoid using the flaky attribute interface:
for sale in xml['response']['sales']['sale']:
for item in sale['items']['item']:
print(item['sku'])
Using the dict-like indexing interface, you never have to worry about certain attributes names (which includes such common words as items, index, keys, remove, replace, tag, set, text, and values) returning surprising results.

How to get the Structure/Template id by Structure/Template name

I have a requirement that, Need to create JournalArticle with Structure and Template.While creating JournalArticle the method expecting the StructureId and TemplateId but these are generated by Liferay.So by name how can i get Id's of both.
Create and execute a DynamicQuery, like so (just replace Template with Structure to get structures):
DynamicQuery q = DynamicQueryFactoryUtil.forClass(DDMTemplate.class)
.add(PropertyFactoryUtil.forName("name").like("%YOUR NAME%"));
List<DDMTemplate> templates = DDMTemplateLocalServiceUtil.dynamicQuery(q);
You have to use like since the names of the structures/templates are saved like so:
<?xml version='1.0' encoding='UTF-8'?>
<root available-locales="de_DE" default-locale="de_DE">
<Name language-id="de_DE">YOUR NAME</Name>
</root>
There can be different names for different locales.
You can get StructureId (called DDMStructure) with this code
long classNameIdJournalArticle = ClassNameLocalServiceUtil.getClassNameId(JournalArticle.class);
DDMStructure ddmStructure = DDMStructureLocalServiceUtil.getStructure(groupId, classNameIdJournalArticle, "myDDMStructureName");
And TemplateId (called DDMTemplate) with this code
DDMTemplate ddmTemplate = DDMTemplateLocalServiceUtil.getTemplate(groupId, classNameIdDDMStructure, "ddmTemplateName");

Parse XML using Groovy: Override charset in declaration and add XML processing instruction

My initial question have been answered, but that did just open up for further issues.
Example code
Using Groovy 2.0.5 JVM 1.6.0_31
import groovy.xml.*
import groovy.xml.dom.DOMCategory
def xml = '''<?xml version="1.0" encoding="UTF-16"?>
| <?xml-stylesheet type="text/xsl" href="Bp8DefaultView.xsl"?>
|<root>
| <Settings>
| <Setting name="CASEID_SEQUENCE_SIZE">
| <HandlerURL>
| <![CDATA[ admin/MainWindow.jsp ]]>
| </HandlerURL>
| </Setting>
| <Setting name="SOMETHING_ELSE">
| <HandlerURL>
| <![CDATA[ admin/MainWindow.jsp ]]>
| </HandlerURL>
| </Setting>
| </Settings>
|</root>'''.stripMargin()
def document = DOMBuilder.parse( new StringReader( xml ) )
def root = document.documentElement
// Edit: Added the line below
def pi = document.createProcessingInstruction('xml-stylesheet', 'type="text/xsl" href="Bp8DefaultView.xsl"');
// Edit #2: Added line below
document.insertBefore(pi, root)
use(DOMCategory) {
root.Settings.Setting.each {
if( it.'#name' == 'CASEID_SEQUENCE_SIZE' ) {
it[ '#value' ] = 100
}
}
}
def outputfile = new File( 'c:/temp/output.xml' )
XmlUtil.serialize( root , new PrintWriter(outputfile))
// Edit #2: Changed from root to document.documentElement to see if that
// would make any difference
println XmlUtil.serialize(document.documentElement)
Problem description
I'm trying to parse a XML-file exported from a third party tool, and before promoting it to stage and production I need to replace certain attribute values. That is all ok, but in addition I must keep the encoding and ref. to the stylesheet.
Since this is pretty static it is totally ok to have both the encoding and the stylesheet ref. definition in a property-file, meaning: I do not need first to find the declarations in the original file.
Encoding in declaration issue
As shown in this answer found here on StackOverFlow you can do
new File('c:/data/myutf8.xml').write(f,'utf-8')
I have also tried
XmlUtil.serialize( root , new GroovyPrintStream('c:/temp/output.txt', 'utf-16'))
but it did not solve the problem for me either. So I have not understood how to override the UTF-value.
Processing instruction issue
Simply, how do I add
<?xml-stylesheet type="text/xsl" href="Bp8DefaultView.xsl"?>
to the output?
Update - This can be done like this
def pi = document.createProcessingInstruction('xml-stylesheet', 'type="text/xsl" href="Bp8DefaultView.xsl"');
The processing instruction is being added like this, this guideline showed me, but still I do not get the output.
document.insertBefore(pi, root) // Fails
All issues in this question has been answered in another question I raised, see Groovy and XML: Not able to insert processing instruction
The trick is that I expected
document.documentElement
to contain the processing instruction. But that is wrong, documentElement is:
...This is a convenience attribute that allows direct access to the child node that is the document element of the document...
Where the processing instruction is just another child node. So what I instead had to use was either the LSSerializer or the Transfomer. See Serialize XML processing instruction before root element for details.

Resources