parsing the text in python of the given format - python-3.x

I want to parse a file which looks like this:
<item> <one-of> <item> deepa vats </item> <item> deepa <ruleref uri="#Dg-e_n_t41"/> </item> </one-of> <tag> out = "u-dvats"; </tag> </item>
<item> <one-of> <item> maitha al owais </item> <item> doctor maitha </item> <item> maitha <ruleref uri="#Dg-clinical_nutrition24"/> </item> </one-of> <tag> out = "u-mal_owais"; </tag> </item>
The result should be username:out for example:
deepa vats : u-dvats and maitha al owais : u-mal_owais
to extract the username i tried
print ([j for i,j in re.findall(r"(<item>)\s*(.*?)\s*(?!\1)(?:</item>)",line)])
if len(list1) != 0:
print(list1[0].split("<item>")[-1])

You can parse the xml with objectify from lxml.
To parse an XML string you could use objectify.fromstring(). Then you can use dot notation or square bracket notation to navigate through the element and use the text property to get the text inside the element. Like so:
item = objectify.fromstring(item_str)
item_text = item.itemchild['anotherchild'].otherchild.text
From there you can manipulate the string and format it.
In this case I can see that you want the text inside item >> one-of >> item and the text inside item >> tag. In order to get it we could do something like this:
>>> from lxml import objectify
>>> item_str = '<item> <one-of> <item> maitha al owais </item> <item> doctor maitha </item> <item> maitha <ruleref uri="#Dg-clinical_nutrition24"/> </item> </one-of> <tag> out = "u-mal_owais"; </tag> </item>'
>>> item = objectify.fromstring(item_str)
>>> item_text = item['one-of'].item.text
>>> tag_text = item['tag'].text
>>> item_text
' maitha al owais '
>>> tag_text
' out = "u-mal_owais"; '
Since python doesn't allow hyphens in variable names and since tag is a property of the objectify object you have to use bracket notation instead of dot notation in this case.

I suggest using BeautifulSoup:
import bs4
soup = bs4.BeautifulSoup(your_text, "lxml")
' '.join(x.strip() for x in soup.strings if x.strip())
#'deepa vats deepa out = "u-dvats"; maitha al owais doctor maitha maitha out = "u-mal_owais";'

Related

how to iterate xml nodes with groovy

I'm trying to iterate through an xml file with groovy to get some values.
I found many people with the same problem, but the solution they used doesn't work for me, or it's too complicated.
I'm not a groovy dev, so I need a bullet proof solution which I can implement.
Basically I have an xml response file that looks like this: ( it looks bad but that's what I get)
<Body>
<head>
<Details>
<items>
<item>
<AttrName>City</AttrName>
<AttrValue>Rome</AttrValue>
</item>
<item>
<AttrName>Street</AttrName>
<AttrValue>Via_del_Corso</AttrValue>
</item>
<item>
<AttrName>Number</AttrName>
<AttrValue>34</AttrValue>
</item>
</items>
</Details>
</head>
</Body>
I've already tried this solution I found here on StackOverflow to print the values:
def envelope = new XmlSlurper().parseText("the xml above")
envelope.Body.head.Details.items.item.each(item -> println( "${tag.name}") item.children().each {tag -> println( " ${tag.name()}: ${tag.text()}")} }
the best I get is
ConsoleScript11$_run_closure1$_closure2#2bfec433
ConsoleScript11$_run_closure1$_closure2#70eb8de3
ConsoleScript11$_run_closure1$_closure2#7c0da10
Result: CityRomeStreetVia_del_CorsoNumber34
I can also remove everything after the first println, and anything inside it, the result is the same
My main goal here is not to print the values but to extrapolate those values from the xml and save them as string variables...
I know that using strings is not the best practice but I just need to understand now.
Your code as is had 2 flaws:
with envelope.Body you would NOT find anything
if you fix No. 1, you would run into multiple compile errors for each(item -> println( "${tag.name}"). Here the ( is used instead of { and you use an undefined tag variable here.
The working code would look like:
import groovy.xml.*
def xmlBody = new XmlSlurper().parseText '''\
<Body>
 <head>
  <Details>
<items>
<item>
<AttrName>City</AttrName>
<AttrValue>Rome</AttrValue>
</item>
<item>
<AttrName>Street</AttrName>
<AttrValue>Via_del_Corso</AttrValue>
</item>
<item>
<AttrName>Number</AttrName>
<AttrValue>34</AttrValue>
</item>
</items>
 
  </Details>
 </head>
</Body>'''
xmlBody.head.Details.items.item.children().each {tag ->
println( "  ${tag.name()}: ${tag.text()}")
}
and print:
AttrName: City
AttrValue: Rome
AttrName: Street
AttrValue: Via_del_Corso
AttrName: Number
AttrValue: 34

How to update all sub-sub tag of XML with different values using BeautifulSoup

lets say this is my XML
<sky class="new">
<list name="school">
<p>63</p>
<p>62</p>
<p>61</p>
</list>
</sky>
And this is my values in list.
value = [51,56,87]
Now I what I need is:
<sky class="new">
<list name="school">
<p>51</p>
<p>56</p>
<p>87</p>
</list>
</sky>
So far this is what I did:
for i in soup.find_all('sky', {'class':'new'}):
k = i.find('list',{'name':'school'})
After this I am not getting what to do, could you help here?
EDIT1:
<sky class="new">
<list name="alpha">
<item>
<p unit="kg">63</p>
<p weight="wg">54</p>
</item>
<item>
<p unit="kg">57</p>
<p weight="wg">32</p>
</item>
</list>
</sky>
Another version:
from bs4 import BeautifulSoup
txt = '''<sky class="new">
<list name="school">
<p>63</p>
<p>62</p>
<p>61</p>
</list>
</sky>'''
soup = BeautifulSoup(txt, 'xml')
values = [51, 56, 87]
for p, new_value in zip(soup.select('sky.new > list[name="school"] > p'), values):
p.string = str(new_value)
print(soup)
Prints:
<?xml version="1.0" encoding="utf-8"?>
<sky class="new">
<list name="school">
<p>51</p>
<p>56</p>
<p>87</p>
</list>
</sky>
Try something like this:
targets = soup.select('p')
for target in targets:
repl = str(value[targets.index(target)])
target.string.replace_with(repl)
soup
Output:
<html><body><sky class="new">
<list name="school">
<p>51</p>
<p>56</p>
<p>87</p>
</list>
</sky></body></html>

How to get the required values from the below mentioned xml file?

1) i want to read below mentioned XML file and access the values, i already tried in many ways but not able to access, for example i want 'NightRaidPerformanceCPUScore' value and that is from which passIndex.
<?xml version='1.0' encoding='utf8'?>
<benchmark>
<results>
<result>
<name />
<description />
<passIndex>-1</passIndex>
<sourceId>C:\Users\dgadhipx\Documents\3DMark\3dmark-autosave-20200401155825.3dmark-result</sourceId>
<NightRaidPerformance3DMarkScore>2066</NightRaidPerformance3DMarkScore>
<NightRaidPerformanceCPUScore>1454</NightRaidPerformanceCPUScore>
<NightRaidPerformanceGraphicsScore>2233</NightRaidPerformanceGraphicsScore>
<benchmarkRunId>8045dec5-e97c-452b-abeb-54af187fd50a</benchmarkRunId>
</result>
<result>
<name />
<description />
<passIndex>0</passIndex>
<sourceId>C:\Users\dgadhipx\Documents\3DMark\3dmark-autosave-20200401155825.3dmark-result</sourceId>
<NightRaidPerformanceCPUScoreForPass>1454</NightRaidPerformanceCPUScoreForPass>
<NightRaidPerformance3DMarkScoreForPass>2066</NightRaidPerformance3DMarkScoreForPass>
<NightRaidPerformanceGraphicsScoreForPass>2233</NightRaidPerformanceGraphicsScoreForPass>
<NightRaidPerformanceGraphicsTest1>9.57</NightRaidPerformanceGraphicsTest1>
<NightRaidPerformanceGraphicsTest2>12.18</NightRaidPerformanceGraphicsTest2>
<NightRaidCpuP>395.2</NightRaidCpuP>
<benchmarkRunId>8045dec5-e97c-452b-abeb-54af187fd50a</benchmarkRunId>
</result>
</results>
</benchmark>
You can use BeautifulSoup as fellow:
with open(file_path, "r") as f:
content = f.read()
xml = BeautifulSoup(content, 'xml')
elements = xml.find_all("NightRaidPerformanceCPUScore")
for i in elements:
print(i.text)
That will print you the values of all "NightRaidPerformanceCPUScore" tags.

Amazon returns 0 TotalOffers for valid ASIN

ASIN: B000XETPPY
I'm using the Amazon scratchpad, so I'm not even coding. Here is the response:
<Items>
<Request>
<IsValid>True</IsValid>
<ItemLookupRequest>
<IdType>ASIN</IdType>
<ItemId>B000XETPPY</ItemId>
<ResponseGroup>OfferFull</ResponseGroup>
<VariationPage>All</VariationPage>
</ItemLookupRequest>
</Request>
<Item>
<ASIN>B000XETPPY</ASIN>
<ParentASIN>B000XETPPY</ParentASIN>
<OfferSummary>
<TotalNew>0</TotalNew>
<TotalUsed>0</TotalUsed>
<TotalCollectible>0</TotalCollectible>
<TotalRefurbished>0</TotalRefurbished>
</OfferSummary>
<Offers>
<TotalOffers>0</TotalOffers>
<TotalOfferPages>0</TotalOfferPages>
<MoreOffersUrl>0</MoreOffersUrl>
</Offers>
</Item>
</Items>
However, the ASIN is for this popular product: http://www.amazon.com/Timberland-PRO-Mens-Pitboss-Steel-Toe/dp/B000XETPPY
Why is it saying 0 TotalOffers?
The call I'm using is ItemLookup and the response group is OfferFull.

How to get the attribute value of an xml code using linq

<Shape ID="1" NameU="Start/End" Name="Start/End" Type="Shape" Master="2">
....</Shape>
<Shape ID="2" NameU="Start/End" Name="Start/End" Type="Shape" Master="5">
....</Shape>
I have to return the Master value for every ID value.
How can i achieve it by using LINQ to XMl.
You didn't really present how your XML document looks like, so I assumed it's as follow:
<Shapes>
<Shape ID="1" NameU="Start/End" Name="Start/End" Type="Shape" Master="2">
</Shape>
<Shape ID="2" NameU="Start/End" Name="Start/End" Type="Shape" Master="5">
</Shape>
</Shapes>
You can simply get Master attribute value for all different ID like that:
var xDoc = XDocument.Load("Input.xml");
var masters = xDoc.Root
.Elements("Shape")
.ToDictionary(
x => (int)x.Attribute("ID"),
x => (int)x.Attribute("Master")
);
masters will be Dictionary<int, int> where key is your ID and value is corresponding Master attribute value.

Resources