How to parse big XML file using beautiful soup?

How to parse big XML file using beautiful soup? - python-3.x

I am trying to parse an XML file named document.xml which contains around 400000 character (including tags, breakline, space) init find the code below
document_xml_file_object = open('document.xml', 'r')
document_xml_file_content = document_xml_file_object.read()
xml_content = BeautifulSoup(document_xml_file_content, 'lxml-xml')
print("XML CONTENT: ", xml_content)
when I am printing xml_content below is my output:
XML CONTENT: <?xml version="1.0" encoding="utf-8"?>
For the smaller size of files its printing complete XML code. can anyone help me with this why its happening.
Edit : Click Here to see my XML Content.
Thanks in Advance

For large files it better to use line parser like xml.sax. beautifulsoup will load the whole file in memory and parse, while using xml.sax you will use quite less memory.

Related

When I parse a large XML sitemap on Beautifulsoup in Python, it only parses part of the file

I have written code that pulls out URLs of a very large sitemap xml file (10mb) using Beautiful Soup, and it works exactly how I want it, but it only seems to do a small amount of the overall file. This is my code:
`sitemap = "sitemap1.xml"
from bs4 import BeautifulSoup as bs
import lxml
content = []
with open(sitemap, "r") as file:
# Read each line in the file, readlines() returns a list of lines
content = file.readlines()
# Combine the lines in the list into a string
content = "".join(content)
bs_content = bs(content, "xml")
result = bs_content.find_all("loc")
for result in result:
print(result.text)
`
I have changed my IDE to allow for larger files, it just seems to start the process at a random point towards the end of the XML file and only extracts from there on.

I just wanted to say I ended up sorting this out. I used the read XML function in pandas and it worked well. The original XML file was corrupted.
... I also realised that the console was just printing from a certain point because it's such a large file, and it was still actually processing the whole file.
Sorry about this - I'm new :)

Generate random numbers and update in XML file using Robot Framework

I've two XML files in which I manually change the values before proceeding with further evaluation. I would like to know how should I be able to update the values in the XML file using Robot Framework.
I've used faker library to generate random number but I don't know how to update them in XML. The first XML file is something like this:
<dns:ManageRequest>
<SPResource>
<ID>ORD452257337191</ID>
<interactionDate>2016-09-20T02:35:30Z</interactionDate>
<orderType>Connect</orderType>
<SPResourceComprisedOf>
<DescribedBy>
<value>CLI0000000000191</value>
<Characteristic>
<ID>clientID</ID>
</Characteristic>
</DescribedBy>
<DescribedBy>
<value>TOW566105009191</value>
<Characteristic>
<ID>ticketOfWorkId</ID>
</Characteristic>
</DescribedBy>
</SPResourceComprisedOf>
</SPResource>
</dns:ManageRequest>
and the second xml file looks like this:
<dns:ManageOrder>
<FieldWork>
<ID>WOR140618136785</ID>
<Priority>
<priorityValues>45</priorityValues>
</Priority>
<baseRevisionNumber>-1</baseRevisionNumber>
<FieldWorkSpecifiedBy>
<ID>Activation</ID>
<version>1.0.5</version>
<type>WorkOrder Specification</type>
</FieldWorkSpecifiedBy>
<FieldWorkOverview>
<DescribedBy>
<value>WRQ140618136785</value>
<Characteristic>
<ID>Work Request ID</ID>
<type>Overview</type>
</Characteristic>
</DescribedBy>
<DescribedBy>
<value>ORD452257337191</value>
<Characteristic>
<ID>Reference ID</ID>
<type>Overview</type>
</Characteristic>
</DescribedBy>
</FieldWorkOverview>
</FieldWork>
</dns:ManageOrder>
In the firs XML file the values of ORD, CLI & TOW needs to be changed and in the second XML file WOR & WRQ need to be changed but the value of ORD in the second file needs to same as the value of ORD in first file.
I really appreciate any help, because I am really lost in this now :( Thanks!

you can use lxml library.
Link: https://pypi.org/project/lxml/
This is example for edit element ID with your value ORD452257337191 to value '123456'.
Code:
${file}= get file ${path_to_file} encoding=UTF-8
${xml_file}= parse xml ${file}
set element text ${xml_file} 123456 xpath=ID
save xml ${xml_file} ${path_to_file} encoding=UTF-8

result shows file full of some symbols rather than text when I loop files

I was looping some files to copy the content of somes file to a new file but after I run the code, the result shows lot of symbols in the new file, not the text content of the files I looped.
first, when I ran the code without putting the 'encoding' attribute in open file line, it showed an error message like,
UnicodeEncodeError: 'charmap' codec can't encode character '\x8b' in position 12: character maps to .
I tried various encodings like utf-8,latin1 but nothing worked and when i put 'errors=ignore' in the open file line, then the result showed like I described above.
import os
import glob
folder = os.path.join('R:', os.sep, 'Files')
def notes():
for doc in glob.glob(folder + r'\*'):
if doc.endswith('.pdf'):
with open(doc,'r') as f:
x = f.readlines()
with open('doc1.text', 'w+') as f1:
for line in x:
f1.write(line)
notes()

If I understand your example correctly and you’re trying to read PDF files, your problem is not one of encoding but of file format. PDF files don’t just to store your text in coding materials are unique format that you need to be able to read in order to extract the text. There are a couple of python libraries that can read PDF files (such as Py2PDF), please refer to this thread for more information: How to extract text from a PDF file?

why junk code is appending while writing to text file in python

I have stored txt. file contents in a list like, given below then I am
trying to store it in new.txt file , file writing is working fine but
instead of text I am getting junk code , How can i get rid of this Please
help
ListOfLines =['ÿþGroup:Test\n', '\n', 'Fields:³TESTNO³TESTNUM³\n', '\n',
'³37³DUCK³DAFFY³³³³2']
fileName= open("path"+'//'+'TestFile+'.txt',"w+")
for item in ListOfLines :
x=(''.join([str(item)]))
fileName.write(x)
OutPut
片畯㩰䅐䥔久ൔഊ䘊敩摬㩳傳呁低䲳十乔䵁덅䥆乒䵁덅䥍乄䵁덅䥔䱔덅啓䙆塉傳呁䑉傳呁䥂呒덈䅐協塅亳䵁ㅅ傳呁䑉댱䍏啃䅐더䱈䰷䍏덋䅐䡔卉더䅐䍔䵏䆳䑄䕒卓䖳䡔䥎덃䥍剌

Set Encoding='UTF-8' before reading this file. It will work

Loop and merge XML files in a folder to one HTML file using Groovy script?

I am using ReadyAPI and trying to fetch my reports generation, so I'm at the point where all the xml files for each test case are generated, and I need to merge them.
So basically I only have a path where the files are, let's say "C:\Path", where XML files lie.
I have found parsers for single files, and ways to append some information of one XML file into another XML file, but I have not found the way to loop through all XML files and dump their content into a new file...
Any help or indication could be much appreciated...
Jackson.

There is a working example of this answer here.
Let's assume that we have XML files of this form:
<composer>
<name>Wolfgang Mozart</name>
<born>1756</born>
</composer>
Then, we could build a list of parsed XML documents from each .xml file in the current directory (or whichever you need):
def composers = []
new File(".").eachFile { def file ->
if (file.name ==~ /.*\.xml/) {
composers << new XmlSlurper().parse(file)
}
}
Then, we could use a StreamingMarkupBuilder to create a unified XML document. Note this mixes markup with the composers list built above:
def xml = new StreamingMarkupBuilder().bind {
root {
composers.each { c ->
mkp.yield c
}
}
}.toString()
That is, the document looks like:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<composer>
<name>Wolfgang Mozart</name>
<born>1756</born>
</composer>
<composer>
<name>JS Bach</name>
<born>1685</born>
</composer>
...
</root>
Altering the solution for your local goals should be straight-forward.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to parse big XML file using beautiful soup? - python-3.x

For large files it better to use line parser like xml.sax. beautifulsoup will load the whole file in memory and parse, while using xml.sax you will use quite less memory.

Related

When I parse a large XML sitemap on Beautifulsoup in Python, it only parses part of the file

Generate random numbers and update in XML file using Robot Framework

result shows file full of some symbols rather than text when I loop files

why junk code is appending while writing to text file in python

Loop and merge XML files in a folder to one HTML file using Groovy script?

Categories

Resources