Extract data from an XML string with xml.etree.ElementTree

Extract data from an XML string with xml.etree.ElementTree - python-3.x

I have an XML string and I need to extract the first three "col" tag in each group of "row". In other words, the output should be the following:
['1043100330', 'Smith', 'John', '1043100331', 'Swartz', 'Francis', '1043100332', 'Laff', 'Michael']
This is the XML:
data = '''<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Body>
<processOXIMessageResponse xmlns="urn:com:singun:webservice" xmlns:ns="urn:com:singun:webservice">
<ns1:processOXIMessageReturn xmlns:ns1="urn:com:singun:webservice">
<SingunDocument xmlns="C" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" protocol="OXI">
<sessionId xmlns="">1613762599483</sessionId>
<command xmlns="" echo="" xsi:type="ServiceProviderGetListResponse">
<serviceProviderTable>
<colHeading>User Id</colHeading>
<colHeading>Last Name</colHeading>
<colHeading>First Name</colHeading>
<colHeading>Email Address</colHeading>
<colHeading>Phone Number</colHeading>
<colHeading>Extension</colHeading>
<colHeading>Country Code</colHeading>
<colHeading>National Prefix</colHeading>
<row>
<col>1043100330</col>
<col>Smith</col>
<col>John</col>
<col>jsmith#example.com</col>
<col>1043101330</col>
<col>1330</col>
<col>52</col>
<col>52</col>
</row>
<row>
<col>1043100331</col>
<col>Swartz</col>
<col>Francis</col>
<col>fswartz#example.com</col>
<col>1043101331</col>
<col>1331</col>
<col>52</col>
<col>52</col>
</row>
<row>
<row>
<col>1043100332</col>
<col>Laff</col>
<col>Michael</col>
<col>mlaff#example.com</col>
<col>1043101332</col>
<col>1332</col>
<col>52</col>
<col>52</col>
</row>
</serviceProviderTable>
</command>
</SingunDocument>
</ns1:processOXIMessageReturn>
</processOXIMessageResponse>
</soapenv:Body>
</soapenv:Envelope>
The following is my code, but it extracts only the first "col" tag instead of the first three of each group as I would like:
import xml.etree.ElementTree as ET
users = []
root = ET.fromstring(resp)
for col in root.iterfind('.//row/col[1]'):
users.append(col.text)
print(users)
This is the output of my code:
['1043100330', '1043100331', '1043100332']
Please if you can give me a hand. Thanks a lot

You need to take into account the namespaces, as well as the structure of your xml. Try something like:
data = root.findall('.//{C}serviceProviderTable//{C}row')
for datum in data:
entries = datum.findall('.//{C}col')
users.append([entry.text.strip() for entry in entries[:3]])
for user in users:
print(user)
Output:
['1043100330', 'Smith', 'John']
['1043100331', 'Swartz', 'Francis']
['1043100332', 'Laff', 'Michael']

Related

XSLT Can't Read an Excel XML File?

I'm using XSLT / XPath to browse some of the XML files you get when you unzip an Excel file. I found a "relationships" file workbook.xml.rels that I don't seem to be able to read, using code similar to that which successfully read the workbook.xml file.
Here's some of the workbook.xml file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
...
<sheets>
<sheet name="Sheet1"
sheetId="2"
r:id="rId1"/>
<sheet name="Test Sheet"
sheetId="1"
r:id="rId2"/>
</sheets>
...
</workbook>
Here's the workbook.xml.rels file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId3"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme"
Target="theme/theme1.xml"/>
<Relationship Id="rId2"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet"
Target="worksheets/sheet2.xml"/>
<Relationship Id="rId1"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet"
Target="worksheets/sheet1.xml"/>
<Relationship Id="rId5"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/sharedStrings"
Target="sharedStrings.xml"/>
<Relationship Id="rId4"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles"
Target="styles.xml"/>
</Relationships>
Here's some of the XSLT:
<?xml version="1.0"?>
<!-- greeting.xsl -->
<xsl:stylesheet
...
<xsl:output method="text"/>
<xsl:variable name="baseDir" select="replace(document-uri(.), '(.*[\\/]xl).*', '$1/')"/>
<xsl:variable name="workbook" select="concat($baseDir, 'workbook.xml')"/>
<xsl:variable name="theSheetId" select="doc($workbook)/workbook/sheets/sheet[matches(#name, 'Test Sheet')]/#r:id"/>
<xsl:variable name="workbook_rels" select="concat($baseDir, '_rels/workbook.xml.rels')"/>
<!-- code to read workbook.xml.rels -->
<xsl:variable name="theSheet" select="doc($workbook_rels)/Relationships/Relationship[matches(#Id, $theSheetId)]/#Target"/>
<xsl:template match="/">
<xsl:text>
baseDir = </xsl:text><xsl:value-of select="$baseDir"/>
<xsl:text>
workbook = </xsl:text><xsl:value-of select="$workbook"/>
<xsl:text>
workbook_rels = </xsl:text><xsl:value-of select="$workbook_rels"/>
<xsl:text>
theSheetId = </xsl:text><xsl:value-of select="$theSheetId"/>
<xsl:text>
theSheet = </xsl:text><xsl:value-of select="$theSheet"/>
<xsl:text>
end</xsl:text>
</xsl:template>
</xsl:stylesheet>
And the output:
baseDir = file:/C:/Training/sandbox/conv_/xl/
workbook = file:/C:/Training/sandbox/conv_/xl/workbook.xml
workbook_rels = file:/C:/Training/sandbox/conv_/xl/_rels/workbook.xml.rels
theSheetId = rId2
theSheet = **<I get nothing here>**
end
You can see that 'theSheetID' variable is correctly set when reading workbook.xml. But when I use that variable to get the corresponding Target value into 'theSheet' variable from workbook.xml.rels, I get nothing. I tried replacing the matches expression with just a number but I still get nothing. Is there a problem from reading this type of file?
Suggestions? Thanks!

The use of matches and replace suggests you are using an XSLT 2 or 3 processor and that way XSLT 2 or 3 where you can certainly declare xpath-default-namespace, you just have to understand you have to change that in the sections that deal with elements from a different namespace e.g. <xsl:variable name="theSheet" select="doc($workbook_rels)/Relationships/Relationship[matches(#Id, $theSheetId)]/#Target" xpath-default-namespace="http://schemas.openxmlformats.org/package/2006/relationships"/>.
Given the samples I would rather use a key <xsl:key name="rel" match="Relationships/Relationship" use="#Id" xpath-default-namespace="http://schemas.openxmlformats.org/package/2006/relationships"/> and then use <xsl:variable name="theSheet" select="key('rel,$theSheetId, doc($workbook_rels))/#Target"/> but the use of xpath-default-namespace to declare the relevant namespace when selecting elements from a particular document is probably what is missing in your XSLT.

move an element to another element or create a new one if it does not exist using xslt-3

using xslt 3, i need to take all content elements' values, and move them to the title elements (if the title elements already exist in a record, they need to be appended with a separator like -) i now have inputted my real data, since the below solution does not solve the problem when implemented to something like:
example input:
<data>
<RECORD ID="31365">
<no>25099</no>
<seq>0</seq>
<date>2/4/2012</date>
<ver>2/4/2012</ver>
<access>021999</access>
<col>GS</col>
<call>889</call>
<pr>0</pr>
<days>0</days>
<stat>0</stat>
<ch>0</ch>
<title>1 title</title>
<content>1 content</content>
<sj>1956</sj>
</RECORD>
<RECORD ID="31366">
<no>25100</no>
<seq>0</seq>
<date>2/4/2012</date>
<ver>2/4/2012</ver>
<access>022004</access>
<col>GS</col>
<call>8764</call>
<pr>0</pr>
<days>0</days>
<stat>0</stat>
<ch>0</ch>
<sj>1956</sj>
<content>1 title</content>
</RECORD>
</data>
expected output:
<data>
<RECORD ID="31365">
<no>25099</no>
<seq>0</seq>
<date>2/4/2012</date>
<ver>2/4/2012</ver>
<access>021999</access>
<col>GS</col>
<call>889</call>
<pr>0</pr>
<days>0</days>
<stat>0</stat>
<ch>0</ch>
<title>1 title - 1 content</title>
<sj>1956</sj>
</RECORD>
<RECORD ID="31366">
<no>25100</no>
<seq>0</seq>
<date>2/4/2012</date>
<ver>2/4/2012</ver>
<access>022004</access>
<col>ΓΣ</col>
<call>8764</call>
<pr>0</pr>
<days>0</days>
<stat>0</stat>
<ch>0</ch>
<sj>1956</sj>
<title>1 title</title>
</RECORD>
<data>
with my attempt, i did not manage to move the elements, i just got an empty line where the content element existed, so please add the removal of blank lines in the suggested solution.
i believe the removal of blank lines could be fixed with the use of
<xsl:template match="text()"/>

One way to achieve this is the following template. It uses XSLT-3.0 content value templates.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0" expand-text="true">
<xsl:output method="xml" indent="yes" />
<xsl:mode on-no-match="shallow-copy" />
<xsl:strip-space elements="*" /> <!-- Remove space between elements -->
<xsl:template match="RECORD">
<xsl:copy>
<xsl:copy-of select="#*" />
<title>{title[1]}{if (title[1]) then ' - ' else ''}<xsl:value-of select="content" separator=" " /></title>
<xsl:apply-templates select="node() except (title,content)" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
It's output is as desired.
If you want to separate the <content> elements with a -, too, you can simplify the core <title> expression to
<xsl:value-of select="title|content" separator=" - " />
EDIT:
All I changed was replacing chapter with RECORD, and it's working fine with Saxon-HE 9.9.1.4J. The only difference in the output is that the title element is always at the first position, but that shouldn't matter. I also added a directive to remove space between elements.

How to get the required values from the below mentioned xml file?

1) i want to read below mentioned XML file and access the values, i already tried in many ways but not able to access, for example i want 'NightRaidPerformanceCPUScore' value and that is from which passIndex.
<?xml version='1.0' encoding='utf8'?>
<benchmark>
<results>
<result>
<name />
<description />
<passIndex>-1</passIndex>
<sourceId>C:\Users\dgadhipx\Documents\3DMark\3dmark-autosave-20200401155825.3dmark-result</sourceId>
<NightRaidPerformance3DMarkScore>2066</NightRaidPerformance3DMarkScore>
<NightRaidPerformanceCPUScore>1454</NightRaidPerformanceCPUScore>
<NightRaidPerformanceGraphicsScore>2233</NightRaidPerformanceGraphicsScore>
<benchmarkRunId>8045dec5-e97c-452b-abeb-54af187fd50a</benchmarkRunId>
</result>
<result>
<name />
<description />
<passIndex>0</passIndex>
<sourceId>C:\Users\dgadhipx\Documents\3DMark\3dmark-autosave-20200401155825.3dmark-result</sourceId>
<NightRaidPerformanceCPUScoreForPass>1454</NightRaidPerformanceCPUScoreForPass>
<NightRaidPerformance3DMarkScoreForPass>2066</NightRaidPerformance3DMarkScoreForPass>
<NightRaidPerformanceGraphicsScoreForPass>2233</NightRaidPerformanceGraphicsScoreForPass>
<NightRaidPerformanceGraphicsTest1>9.57</NightRaidPerformanceGraphicsTest1>
<NightRaidPerformanceGraphicsTest2>12.18</NightRaidPerformanceGraphicsTest2>
<NightRaidCpuP>395.2</NightRaidCpuP>
<benchmarkRunId>8045dec5-e97c-452b-abeb-54af187fd50a</benchmarkRunId>
</result>
</results>
</benchmark>

You can use BeautifulSoup as fellow:
with open(file_path, "r") as f:
content = f.read()
xml = BeautifulSoup(content, 'xml')
elements = xml.find_all("NightRaidPerformanceCPUScore")
for i in elements:
print(i.text)
That will print you the values of all "NightRaidPerformanceCPUScore" tags.

Modify Excel Sheet with data from XML using XSLT

I need to provide a report in Excel format from a Template Excel and an XML file that contains the data (created from SQL request: report_data.xml). But I can't use XPath expression to navigate through an Excel sheet, and I can't select a specific row in the template to duplicate with the data from the report_data.xml
In order to achieve this, I've first "unzipped" the Excel template in order to have access to the individual sheets in .xml format. At the same time I'm setting "source" files that will be used as the default file (eg: source-sheet1.xml, source-sharedstring.xml, ...) to create the new populated files.
I can't find a way to select a specific row in the template to duplicate with the data from the xml.
TemplateSource
Report
I've tested this using Xalan 2.7.2 and Saxon 9.7.0.15 //
XSLT 1.0
source-sheet1.xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
mc:Ignorable="x14ac"
xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
<dimension ref="A1:G1"/>
<sheetViews>
<sheetView tabSelected="1" workbookViewId="0">
<selection activeCell="B5" sqref="B5"/>
</sheetView>
</sheetViews>
<sheetFormatPr baseColWidth="10" defaultColWidth="9.140625" defaultRowHeight="15" x14ac:dyDescent="0.25"/>
<sheetData>
<row r="1" spans="1:7" x14ac:dyDescent="0.25">
<c r="A1" t="s">
<v>0</v>
</c><c r="B1" t="s">
<v>1</v>
</c><c r="C1" t="s">
<v>2</v>
</c><c r="D1" t="s">
<v>3</v>
</c><c r="E1" t="s">
<v>4</v>
</c><c r="F1" t="s">
<v>5</v>
</c><c r="G1" t="s">
<v>6</v>
</c>
</row>
</sheetData>
<pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
</worksheet>
XSLT:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:math="http://exslt.org/math"
xmlns:set="http://exslt.org/sets"
xmlns:exslt="http://exslt.org/common"
xmlns:redirect="org.apache.xalan.xslt.extensions.Redirect"
xmlns:office="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
extension-element-prefixes="redirect"
exclude-result-prefixes="fo fn math set redirect office r ss">
<xsl:variable name="SrcSheet1" select="document('XLSX/source-sheet1.xml')"/>
<xsl:template match="Report">
<xsl:call-template name="Test1"/>
</xsl:template>
<xsl:template name="Test1">
<xsl:for-each select="$SrcSheet1/Worksheet/sheetData/row">
<xsl:message>
row = <xsl:value-of select="position()"/>
colcount= <xsl:value-of select="count(./c)"/>
</xsl:message>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
report_data.xml:
<?xml version="1.0" encoding="utf-8"?>
<Report attrib1="foo" ...>
<Data attrib1="foo" ...>
<SubData attrib1="foo" ... />
...
</Data>
...
</Report>
I would like to be able to "read" the source-sheet1.xml and copy the rows in it, and the change the values in each columns (when needed; styles and polices depends if it's the 1st , 2nd, ... row for the same data) with the data from report_data.xml
If this is not the way to generate a report in excel from data in xml with different style depending on the "position" of the data in the excel file.

XSLT Count Distinct in a Group (Data View Web Part)

I am working on a data view web part in SP 2010 to rullup some information from sub sites. These sites contain project task lists that have tasks assigned to users. The customer wants to see a rollup that will list projects by user based on whether or not they have tasks assigned. Since a project can have multiple tasks, a project could show up under multiple users in the rollup, depending on which tasks are assigned to which users.
Long story short, I need a way to select the count of distinct projects for each user.
The xml structure is like this:
<Rows>
<Row Project="Project 1" TaskID="1" AssignedTo="Worker A" />
<Row Project="Project 1" TaskID="2" AssignedTo="Worker B" />
</Rows>
From this, I would expect the following for project counts:
Worker A: 1
Worker B: 1
I am trying the following formula:
count($nodeset[not(#Project=preceding-sibling::Row/#Project) and #AssignedTo = current()/#AssignedTo])
This returns a count of 1 project for Worker A, but it returns 0 for Worker B, because Project 1 has already been counted for Worker A.
I don't know much about xslt, but from what I understand, using current() is not the most resource-friendly method. I've tried to do the muenchian grouping but I haven't been able to get it to work. Willing to try again though. Any advice would be helpful, as, again, I stink with xslt.

This transformation:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:key name="kWorker" match="#AssignedTo" use="."/>
<xsl:key name="kProjByWorker" match="#Project" use="../#AssignedTo"/>
<xsl:key name="kProjWorker" match="#Project" use="concat(.,'+',../#AssignedTo)"/>
<xsl:template match="/*">
<xsl:apply-templates select=
"//#AssignedTo
[generate-id()
=
generate-id(key('kWorker',.)[1])
]"/>
</xsl:template>
<xsl:template match="#AssignedTo">
<xsl:value-of select="concat('
',.,': ')"/>
<xsl:value-of select=
"count(
key('kProjByWorker', .)
[generate-id()
=
generate-id(key('kProjWorker', concat(.,'+',current()))[1])]
)"/>
</xsl:template>
</xsl:stylesheet>
when applied on the following XML document (a more challenging version of the provided one):
<Rows>
<Row Project="Project 1" TaskID="1" AssignedTo="Worker A" />
<Row Project="Project 1" TaskID="2" AssignedTo="Worker B" />
<Row Project="Project 1" TaskID="2" AssignedTo="Worker A" />
<Row Project="Project 2" TaskID="1" AssignedTo="Worker A" />
</Rows>
produces the wanted, correct result:
Worker A: 2
Worker B: 1
Explanation:
Use of the Muenchian grouping method twice: to find all different workers, and then to find the distinct projects assigned to this worker.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extract data from an XML string with xml.etree.ElementTree - python-3.x

Related

XSLT Can't Read an Excel XML File?

move an element to another element or create a new one if it does not exist using xslt-3

How to get the required values from the below mentioned xml file?

Modify Excel Sheet with data from XML using XSLT

XSLT Count Distinct in a Group (Data View Web Part)

Categories

Resources