Split a giant kml file - kml

I have a giant kml file with the following structure:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Style id="transBluePoly">
<LineStyle>
<width>1.5</width>
</LineStyle>
<PolyStyle>
<color>30ffa911</color>
</PolyStyle>
</Style>
<Style id="labelStyle">
<IconStyle>
<color>ffffa911</color>
<scale>0.35</scale>
</IconStyle>
<LabelStyle>
<color>ffffffff</color>
<scale>0.35</scale>
</LabelStyle>
</Style>
<Placemark>
<name>9840229084|2013-03-06 13:41:34.0|rent|0.0|2|0|0|1|T|5990F529FB98F28A1F17D182152201A4|0|null|null|null|null|null|null|null|null|null|null|F|F|0|NO_POSTCODE</name>
<styleUrl>#transBluePoly</styleUrl>
<Polygon>
<outerBoundaryIs>
<LinearRing>
<coordinates>
-1.5191200,53.4086600
-1.5214300,53.4011900
-1.5303600,53.4028800
-1.5435800,53.4033900
-1.5404900,53.4083600
-1.5191200,53.4086600
</coordinates>
</LinearRing>
</outerBoundaryIs>
</Polygon>
</Placemark>
<Placemark>
<name>9840031669|2013-03-06 13:14:22.0|rent|0.0|0|0|0|1|F|E5BAC836984F53F91D7F60F247920F0C|0|null|null|null|null|null|null|null|null|null|null|F|F|3641161|DE4 3JT</name>
<styleUrl>#transBluePoly</styleUrl>
<Polygon>
<outerBoundaryIs>
<LinearRing>
<coordinates>
-1.2370933,53.1227587
-1.2304837,53.1690463
-1.1783129,53.2226956
-1.2016444,53.2833233
-1.3213687,53.3248921
-1.4809916,53.3039582
-1.6167192,53.2438689
-1.5593782,53.1336370
-1.4296123,53.0962399
-1.3205129,53.1024090
-1.2370933,53.1227587
</coordinates>
</LinearRing>
</outerBoundaryIs>
</Polygon>
</Placemark>
I need to extract 1 million polygons from this to make it manageable (know geo DB is ultimate solution - looking for a quick fix).
Loading it into a lightweight text editor and just deleting some lines would be my first port of call, but suspect this will take forever and a day (it's 10 Gb, I've got 16 Gb RAM). Just wondering if there is a more intelligent solution from the a linux terminal that avoids having to read it all into RAM. I've seen perl and bash commands for doing this but can't see how they would work for taking a random (or first million) sample: http://www.unix.com/shell-programming-scripting/159470-filter-kml-file-xml-remove-unwanted-entries.html

You can use a KML parsing library and few lines of code to parse out what you need in a large KML or KMZ file.
Java
The GIScore Java library, for example, uses STaX to parse the KML source file one feature at a time so it does not need to load the entire file into memory. The library works very fast so a 10GB won't take very long.
Here's a simple Java program that extracts points from polygons inside a KML file, which doesn't matter how large the KML file nor if the Placemark is deeply nested within folders.
import org.opensextant.geodesy.Geodetic2DPoint;
import org.opensextant.giscore.events.*;
import org.opensextant.giscore.geometry.*;
import org.opensextant.giscore.input.kml.KmlInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.text.DecimalFormat;
public class Test {
public static void main(String[] args) throws IOException {
KmlInputStream kis = new KmlInputStream(new FileInputStream("test.kml"));
IGISObject obj;
DecimalFormat df = new DecimalFormat("0.0#####");
while((obj = kis.read()) != null) {
if (obj instanceof Feature) {
Feature f = (Feature)obj;
Geometry g = f.getGeometry();
if (g instanceof Polygon) {
System.out.println("Points");
for(Point p : ((Polygon)g).getOuterRing().getPoints()) {
// do something with the points (e.g. insert in database, etc.)
Geodetic2DPoint pt = p.asGeodetic2DPoint();
System.out.printf("%s,%s%n",
df.format(pt.getLatitudeAsDegrees()),
df.format(pt.getLongitudeAsDegrees()));
}
}
}
}
kis.close();
}
}
To run, create source file Test.java in the directory src/main/java and copy the code above in the file.
If the Geometry is a MultiGeometry then you'd need to add a check for that and iterate over the sub-geometries.
Using Gradle, here's a sample build.gradle script to run the above test program using the command: gradle run
apply plugin: 'java'
repositories {
mavenCentral()
}
task run (dependsOn: 'compileJava', type: JavaExec) {
main = 'Test'
classpath = sourceSets.main.runtimeClasspath
}
dependencies {
compile 'org.opensextant:geodesy:2.0.1'
compile 'org.opensextant:giscore:2.0.1'
}
This does require that you install both Gradle and Java Development Kit (JDK).
Python
Alternatively, you can parse KML using Python with pykml library. Can create multiple smaller KML files with some logic to split the polygons or insert the polygon geometry features into a PostgreSQL database, etc. There is support for pykml in stackoverflow using the pykml tag.
from pykml import parser
import re
with open('data.kml', 'r') as fp:
doc = parser.parse(fp)
for pm in doc.getroot().Document.Placemark:
print(pm.name)
# Get the coordinates from either polygon or polygon inside multigeometry
if hasattr(pm, 'MultiGeometry'):
pm = pm.MultiGeometry
if hasattr(pm, 'Polygon'):
pcoords = pm.Polygon.outerBoundaryIs.LinearRing.coordinates.text
# extract coords into a list of lon,lat coord pairs
coords = re.split(r'\s+', pcoords.strip())
for coord in coords:
lonlat = coord.split(',')
if len(lonlat) > 1:
print(lonlat[0], lonlat[1])
# add logic here - insert points into DB, etc.

This might be too late but some thoughts for you.
I have traditionally modified such code blocks in Microsoft Word using Wild card searches.
While your file may be too large for Word the concepts will work with other similar tools.
I took one block of your file and performed three search and replaces (1) to get the name out and insert it into " marks, (2) to remove the intermediate block of characters and replace them with an = char and (3) to delete the final code block.
It worked like this:
(I actually did some tidy up to remove spaces first - these are probably an artefact of this website rather than the code itself)
Replace [<]Placemark[>][<]name[>](**)[<]/name[>] by “\1”
Replace [<]styleUrl(**)[<]coordinates[>] by =
Replace [<]/coordinates(**)[<]Placemark[>] by nothing
The square brackets are needed to stop word using some chars as escapes (I may have used them too often?)
The (**) sequence captures everything between those groups and gives them the label \1 which is used in the replace field.
In theory you should be able to do this in one hit using all the three together but this gives a too complex form error for Word until you get right back to basics and cut down the code. So:
Replace [<]Place**name[>](**)[<]/name**nates[>](**)[<]/coord**mark[>] by "\1"=\3
Will actually work.
Of course you can then easily change the format of your results to what you want (ie not using " or = in the output), and using further search and replaces you can manipulate the output ready for whatever package you want it for.
Wildcard search and replaces are fun!
Bob J.
PS I have written a compiler in Word vba using this concept to take a series of text strings from Excel which hold the essential mapping data and converting them into fully operational kml files. The current input file is >200k chars over 2500 lines and produces a 700k kml file spread over nearly 19,000 lines. It takes about 30 seconds to 'compile'. This is the reverse of your situation.

I am a little bit late, but that answer might help someone. You can split the kml file perfectly using FME DESKTOP, that giant software! using the ModuloCount transformer. check that Split kml file ModuloCount

Related

Skip over element in large XML file (python 3)

I'm new to xml parsing, and I've been trying to figure out a way to skip over a parent element's contents because there is a nested element that contains a large amount of data in its text attribute (I cannot change how this file is generated). Here's an example of what the xml looks like:
<root>
<Parent>
<thing_1>
<a>I need this</a>
</thing_1>
<thing_2>
<a>I need this</a>
</thing_2>
<thing_3>
<subgroup>
<huge_thing>enormous string here</huge_thing>
</subgroup>
</thing_3>
</Parent>
<Parent>
<thing_1>
<a>I need this</a>
</thing_1>
<thing_2>
<a>I need this</a>
</thing_2>
<thing_3>
<subgroup>
<huge_thing>enormous string here</huge_thing>
</subgroup>
</thing_3>
</Parent>
</root>
I've tried lxml.iterparse and xml.sax implementations to try and work this out, but no dice. These are the majority of the answers I've found in my searches:
Use the tag keyword in iterparse.
This does not work, because, although lxml cleans up the elements in the background, the large text in the element is still parsed into memory, so I'm getting large memory spikes.
Create a flag where you set it to True if the start event for that element is found and then ignore the element in parsing.
This does not work, as the element is still parsed into memory at the end event.
Break before you reach the end event of the specific element.
I cannot just break when I reach the element, because there are multiples of these elements that I need specific children data from.
This is not possible as stream parsers still have an end event and generate the full element.
... ok.
I'm currently trying to directly edit the stream data that the GzipFile sends to iterparse in hopes that it would be able to not even know that the element exists, but I'm running into issues with that. Any direction would be greatly appreciated.
I don't think you can get a parser to selectively ignore some part of the XML it's parsing. Here are my findings using the SAX parser...
I took your sample XML, blew it up to just under 400MB, created a SAX parser, and ran it against my big.xml file two different ways.
For the straightforward approach, sax.parse('big.xml', MyHandler()), memory peaked at 12M.
For a buffered file reader approach, using 4K chunks, parser.feed(chunk), memory peaked at 10M.
I then doubled the size, for an 800M file, re-ran both ways and the peak memory usage didn't change, ~10M. The SAX parser seems very effecient.
I ran this script against your sample XML to create some really big text nodes, 400M each.
with open('input.xml') as f:
data = f.read()
with open('big.xml', 'w') as f:
f.write(data.replace('enormous string here', 'a'*400_000_000))
Here's big.xml's size in MB:
du -ms big.xml
763 big.xml
Heres's my SAX ContentHandler which only handles the character data if the path to the data's parent ends in thing_*/a (which according to your sample disqualifies huge_thing)...
BTW, much appreciation to l4mpi for this answer, showing how to buffer the character data you do want:
from xml import sax
class MyHandler(sax.handler.ContentHandler):
def __init__(self):
self._charBuffer = []
self._path = []
def _getCharacterData(self):
data = ''.join(self._charBuffer).strip()
self._charBuffer = []
return data.strip() # remove strip() if whitespace is important
def characters(self, data):
if len(self._path) < 2:
return
if self._path[-1] == 'a' and self._path[-2].startswith('thing_'):
self._charBuffer.append(data)
def startElement(self, name, attrs):
self._path.append(name)
def endElement(self, name):
self._path.pop()
if len(self._path) == 0:
return
if self._path[-1].startswith('thing_'):
print(self._path[-1])
print(self._getCharacterData())
For both the whole-file parse method, and the chunked reader, I get:
thing_1
I need this
thing_2
I need this
thing_3
thing_1
I need this
thing_2
I need this
thing_3
It's printing thing_3 because of my simple logic, but the data in subgroup/huge_thing is ignored.
Here's how I call the handler with the straight-forward parse() method:
handler = MyHandler()
sax.parse('big.xml', handler)
When I run that with Unix/BSD time, I get:
/usr/bin/time -l ./main.py
...
1.45 real 0.64 user 0.11 sys
...
11027456 peak memory footprint
Here's how I call the handler with the more complex chunked reader, using a 4K chunk size:
handler = MyHandler()
parser = sax.make_parser()
parser.setContentHandler(handler)
Chunk_Sz = 4096
with open('big.xml') as f:
chunk = f.read(Chunk_Sz)
while chunk != '':
parser.feed(chunk)
chunk = f.read(Chunk_Sz)
/usr/bin/time -l ./main.py
...
1.85 real 1.65 user 0.19 sys
...
10453952 peak memory footprint
Even with a 512B chunk size, it doesn't get below 10M, but the runtime doubled.
I'm curious to see what kind of performance you're getting.
You cannot use a DOM parser as that would per definition stuff the whole document into RAM. But basically a DOM parser is just a SAX parser that creates a DOM as it goes through the SAX events.
When creating your custom SAX parser you can actually not just create the DOM (or whichever other memory represenation you prefer) but start ignoring events should they relate to some specific location in the document.
Be aware the parsing needs to continue so you know when to stop ignoring the events. But the output of the parser would not contain this unneeded large chunk of data.

Google KML file to python

I have the following code to access coordinates of kml files in python
from pykml import parser
with open('test.kml', 'r') as kml_file:
root = parser.parse(kml_file).getroot()
for i in root.findall('{http://www.opengis.net/kml/2.2}Document/{http://www.opengis.net/kml/2.2}Placemark/{http://www.opengis.net/kml/2.2}Point'):
print(i.coordinates)
This one finds me all individual points in the kml file, where I marked a certain points of interest. But I also have some polygons of points that I created in google earth and this algorithm does not return them. How can I also get the polygons?
Please let me know if you have any questions.
If the KML source file has a Document with Placemarks then the following Python code will iterate over each Placemark with a Polygon geometry and dump out the coordinates.
from pykml import parser
with open('test.kml', 'r') as f:
root = parser.parse(f).getroot()
namespace = {"kml": 'http://www.opengis.net/kml/2.2'}
pms = root.xpath(".//kml:Placemark[.//kml:Polygon]", namespaces=namespace)
for p in pms:
print(p.Polygon.outerBoundaryIs.LinearRing.coordinates)
If the KML uses MultiGeometry with one or more Polygons then a small change is needed to the check inside the for loop.
for p in pms:
if getattr(p, 'MultiGeometry'):
for poly in p.MultiGeometry.Polygon:
print(poly.outerBoundaryIs.LinearRing.coordinates)
else:
print(p.Polygon.outerBoundaryIs.LinearRing.coordinates)

How to parse big XML file using beautiful soup?

I am trying to parse an XML file named document.xml which contains around 400000 character (including tags, breakline, space) init find the code below
document_xml_file_object = open('document.xml', 'r')
document_xml_file_content = document_xml_file_object.read()
xml_content = BeautifulSoup(document_xml_file_content, 'lxml-xml')
print("XML CONTENT: ", xml_content)
when I am printing xml_content below is my output:
XML CONTENT: <?xml version="1.0" encoding="utf-8"?>
For the smaller size of files its printing complete XML code. can anyone help me with this why its happening.
Edit : Click Here to see my XML Content.
Thanks in Advance
For large files it better to use line parser like xml.sax. beautifulsoup will load the whole file in memory and parse, while using xml.sax you will use quite less memory.

Merging GDML files

The GDML manual documents how to setup multiple GDML files as follows.
Now lxml has a facility to parse multiple files as follows
from lxml import etree
parser = etree.XMLParser(resolve_entities=True)
root= etree.parse(filename, parser=parser)
I assume that at least under the covers it must combine the multiple files before parsing.
My question is there some way I can just save the combined files, as I would like to
have the facility to combine and save a LARGE ( > 100 ) number of xml files where they are specified as includes in a gdml file and discard the !ENTITY definitions.
note: I don't want to parse the files as for the large number this takes a long time.
I guess one option would be to code something like
gdml = etree.tostring('gdml')
etree.ElementTree(gdml).write('filename')
But have a concern that I might hit a max string size, which I understand would be limited to the maximum addressable.
import sys
sys.maxsize
Wondering if there is a better way.

Loop and merge XML files in a folder to one HTML file using Groovy script?

I am using ReadyAPI and trying to fetch my reports generation, so I'm at the point where all the xml files for each test case are generated, and I need to merge them.
So basically I only have a path where the files are, let's say "C:\Path", where XML files lie.
I have found parsers for single files, and ways to append some information of one XML file into another XML file, but I have not found the way to loop through all XML files and dump their content into a new file...
Any help or indication could be much appreciated...
Jackson.
There is a working example of this answer here.
Let's assume that we have XML files of this form:
<composer>
<name>Wolfgang Mozart</name>
<born>1756</born>
</composer>
Then, we could build a list of parsed XML documents from each .xml file in the current directory (or whichever you need):
def composers = []
new File(".").eachFile { def file ->
if (file.name ==~ /.*\.xml/) {
composers << new XmlSlurper().parse(file)
}
}
Then, we could use a StreamingMarkupBuilder to create a unified XML document. Note this mixes markup with the composers list built above:
def xml = new StreamingMarkupBuilder().bind {
root {
composers.each { c ->
mkp.yield c
}
}
}.toString()
That is, the document looks like:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<composer>
<name>Wolfgang Mozart</name>
<born>1756</born>
</composer>
<composer>
<name>JS Bach</name>
<born>1685</born>
</composer>
...
</root>
Altering the solution for your local goals should be straight-forward.

Resources