Reading MultiGeometry KML file using Geopandas - kml

I have a kml file with multi-geometries (points and polygons). I want to access only the polygons present inside the kml file.
I tried reading the kml file using Geopandas-
inputfile = 'path to kml file' fiona.supported_drivers['KML'] = 'rw' sp = gpd.read_file(inputfile, driver='KML')
here 'sp' variable only reads the point features present inside the kml file. I tried using 'Geometry' argument along with the driver argument, but still only the point features are read.
Can anyone help me in accessing the 'Polygon' entities in the kml file?

I had this same question today! I was obtaining the KML through a URL response however, so I was able to "remove" the point data prior to writing it as a kml file (now only with polygon data). This is what I did:
from bs4 import BeautifulSoup as bs
import geopandas as gpd
import fiona
fiona.drvsupport.supported_drivers['kml'] = 'rw'
fiona.drvsupport.supported_drivers['KML'] = 'rw'
soup = bs(r.content, 'xml') # kml/xml content obtained from hitting an API
child = soup.find_all('Placemark')
child_polys = []
# Creates list of only polygon feature data
for c in child:
if c.find('Point') == None:
child_polys.append(c)
# Add required header/footer to the polygon data, then join below.
header = ['''<?xml version="1.0" encoding="utf-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<open>
1
</open>''']
header.extend(child_polys)
header.append('''</Document>
</kml>''')
with open('temp_data.kml', 'w') as f:
f.write(''.join([str(x) for x in header]))
data = gpd.read_file(r"temp_data.kml", driver="kml")
This got me a geodataframe with a column of names, description and geometry for the polygons only. In your case, you may have to find a way to read in the kml, then edit it as I did above, before reconstructing and passing to geopandas. Good luck.

Related

PDF parser in pdfs with multiple images and formats with python and tabula (open to other options)

So first off what im trying to do: create a pdf parser that will take ONLY tables out of any given pdf. I currently have some pdfs that are for parts manuals which contain an image of the part and then a table for details of the parts and I want to scrape and parse the table data from the pdf into a csv or similar excel style file(csv, xls etc)
What ive tried/trying: I am currently using python3 and tabula(i have no preference for either of these and open to other options) in which I have a py program that is able to scrape all the data of any pdf or directory of pdfs however it takes EVERYTHING including the image file code that has a bunch of 0 1 NaN(adding examples at the bottom). I was thinking of writing a filter function that removes these however that feels like overkill and was wondering/hoping there is a way to filter out the images with tabula or another library? (side note ive also attempted camelot however the module is not importing correctly even when it is in my pip freeze and this happens on both my mac m1 and mac m2 so assuming there is no arm support)
If anyone could help me or help guide me in a direction of a library or method of being able to iterate through all pages in a pdf and JUST grab the tables for export t csv that would be AMAZING!
current main file:
from tabula.io import read_pdf;
from traceback import print_tb;
import pandas as pd;
from tabulate import tabulate;
import os
def parser(fileName, count):
print("\nFile Number: ",count, "\nNow parsing file: ", fileName)
df = read_pdf(fileName, pages="all") #address of pdf file
for i in range(len(df)):
df[i].to_excel("./output/test"+str(i)+".xlsx")
print(tabulate(df))
print_tb(df)
def reader(type):
filecount = 1
if(type == 'f'):
file = input("\nFile(f) type selected\nplease enter full file name with path (ex. Users/Name/directory1/filename.pdf: ")
parser(file, filecount)
elif(type == 'd'):
#directory selected
location = input("\nPlease enter diectory path, if in the same folder just enter a period(.)")
print("Opening directory: ", location)
#loop through and parse directory
for filename in os.listdir(location):
f = os.path.join(location, filename)
# checking if it is a file
if os.path.isfile(f):
parser(f, filecount)
filecount + 1
else:
print('\n\n ERROR, path given does not contain a file or is not a directory type..')
else:
print("Error: please select directory(d) or file(f)")
fileType = input("\n-----> Hello!\n----> Would you like to parse a directory(d) or file(f)?").lower()
reader(fileType)

How to save all figures in pdf file in python created from seaborn style & dataframe?

This code gives me output of grid as 1 with style background.
def plot(grid):
cmap = sns.light_palette("red", as_cmap=True)
figure = pd.DataFrame(grid)
figure = figure.style.background_gradient(cmap=cmap, axis=None)
display(figure)
I wanted to store multiples images such as 1 in a single pdf file generated by Fun 'plot'.In case of matplotlib
from matplotlib.backends.backend_pdf import PdfFile,PdfPages
pdfFile = PdfPages("name.pdf")
pdfFile.savefig(plot)
pdfFile.close()
can do this. but for this case I am facing issues because it is dataframe or I am using searborn background_style.
could you please suggest to store output of above in single pdf file or png or jpg.
Here is my code to save all open figures to a pdf, it saves each plot to a separate page in the pdf.
from matplotlib.backends.backend_pdf import PdfPages
pp = PdfPages('C:\path\filename.pdf') #path to where you want to save the pdf
figNum = plt.get_fignums() #creates a list of all figure numbers
for i in range(len(figNum)): #loop to add each figure to pdf
pp.savefig(figNum[i]) #uses the figure number in the list to save it to the pdf
pp.close() #closes the opened file in memory
We can creat folder name 'image' and store all images of code output in png format.we will have to use dataframe image for that.
import dataframe_image as dfi
from PIL import Image
def plot(grid):
cmap = sns.light_palette("red", as_cmap=True)
figure = pd.DataFrame(grid)
figure = figure.style.background_gradient(cmap=cmap, axis=None)
dfi.export(figure, f'image\df_styled.png, max_cols=-1)

Python 3: write Russian text to PDF file

The problem was to write Russian text to PDF file. I have tried several encodings, however, this didn't solve the problem. You can find the solution I came up with in answer section. Please, note that write_to_file function writes text only on one page. It was not tested for larger files.
Here is a solution. I am using reportlab version 3.5.42.
from reportlab.lib.units import cm
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.lib.styles import ParagraphStyle
from reportlab.platypus import Paragraph, Frame
from reportlab.graphics.shapes import Drawing, Line
from reportlab.pdfgen.canvas import Canvas
def write_to_file(filename, story):
"""
(str, list) -> None
Write text from list of strings story to filename. filename should be in format name.pdf.
Russian text is supported by font DejaVuSerif. DejaVuSerif.ttf should be saved in the working directory.
filename is stored in the same working directory.
"""
canvas = Canvas(filename)
pdfmetrics.registerFont(TTFont('DejaVuSerif', 'DejaVuSerif.ttf'))
# Various styles option are available, consult reportlab User Guide
style = ParagraphStyle('russian_text')
style.fontName = 'DejaVuSerif'
style.leading = 0.5*cm
# Using XML format for new line character
for i, part in enumerate(story):
story[i] = Paragraph(part.replace('\n', '<br></br>'), style)
# Create a frame to make the text fit to the page, A4 format is used by default
frame = Frame(0, 0, 21*cm, 29.7*cm, leftPadding=cm, bottomPadding=cm, rightPadding=cm, topPadding=cm,)
# Add different parts of the story
frame.addFromList(story, canvas)
canvas.save()

Modify large file containing multiple XML files to create small file depending on condition

I have a large file that contains multiple XMLs in different lines. I want to create a new file with lines (or XMLs) depending on a condition where multiple tags match columns of spreadsheet. For example, I have a large XML file.
<?xml version="1.0" encoding="UTF-8"?><data><student><result><grade>A</grade></result><details><name>John</name><house>Red</house><id>100</id><age>16</age><email>john#mail.com</email></details></student></data>
<?xml version="1.0" encoding="UTF-8"?><data><student><result><grade>B</grade></result><details><name>Alice</name><house>Blue</house><id>101</id><age>17</age><email>alice#mail.com</email></details></student></data>
<?xml version="1.0" encoding="UTF-8"?><data><student><result><grade>F</grade></result><details><name>Bob</name><house>Blue</house><id>100</id><age>16</age><email>bob#mail.com</email></details></student></data>
<?xml version="1.0" encoding="UTF-8"?><data><student><result><grade>A</grade></result><details><name>Hannah</name><house>Blue</house><id>103</id><age>17</age><email>hannah#mail.com</email></details></student></data>
<?xml version="1.0" encoding="UTF-8"?><data><student><result><grade>C</grade></result><details><name>James</name><house>Red</house><id>101</id><age>18</age><email>james#mail.com</email></details></student></data>
I need to create a file where the house and id are picked from a xlsx file like below:
and create a new file like below:
<?xml version="1.0" encoding="UTF-8"?><data><student><result><grade>F</grade></result><details><name>Bob</name><house>Blue</house><id>100</id><age>16</age><email>bob#mail.com</email></details></student></data>
<?xml version="1.0" encoding="UTF-8"?><data><student><result><grade>A</grade></result><details><name>Hannah</name><house>Blue</house><id>103</id><age>17</age><email>hannah#mail.com</email></details></student></data>
What I have tried:
from lxml import etree as ET
import pandas as pd
df = pd.read_excel(open('Student_data.xlsx','rb'),sheet_name="Sheet2")
df['House_Id']=df['House'].map(str)+'-'+df['Id'].map(str)
required_ids = df['House_Id'].tolist()
required_ids = [str(i) for i in required_ids]
for event, element in ET.iterparse('new_student.xml'):
if element.tag == 'xml' and not(element.xpath('.//id/text()')[0] in required_ids):
element.clear()
element.getparent().remove(element)
if element.tag == 'data':
tree = ET.ElementTree(element)
tree.write('student_output.xml')
I am able to create the required id using the 2 variables from the xlsx file (i.e. ['Blue-100', 'Blue-103']) but don't know how to:
Create a similar "pair-id" using the XMLs
Navigate to look for the "pair-id" and create a new file that contains only the required lines
Please let me know a way to do this. Thanks in advance.
I think, if you know the encoding given in the various XML declarations on each line is always UTF-8, then you can process the file with several lines where each line represents an XML document with file based IO and lxml etree as follows:
from lxml import etree as ET
required_ids = ['100','103']
required_houses = ['Blue', 'Blue']
with open('input.txt', 'r') as f, open('output.txt', 'w') as w:
for line in f.readlines():
root = ET.fromstring(bytes(line, encoding = 'UTF-8'))
if root.xpath('.//id/text()')[0] in required_ids and root.xpath('.//house/text()')[0] in required_houses:
#ET.dump(root)
print(ET.tostring(root, encoding = 'unicode'), file = w, end = '\n')

Parse multiple xml files in Python

I am stuck with a problem here. So I want to parse multiple xml files with the same structure within it. I was already able to get all the locations for each file and save them into three different lists, since there are three different types of xml structures. Now I want to create three functions (for each list), which is looping through the lists and parse the information I need. Somehow I am not able to do it. Anybody here who could give me a hint how to do it?
import os
import glob
import xml.etree.ElementTree as ET
import fnmatch
import re
import sys
#### Get the location of each XML file and save them into a list ####
all_xml_list =[]
def locate(pattern,root=os.curdir):
for path, dirs, files in os.walk(os.path.abspath(root)):
for filename in fnmatch.filter(files,pattern):
yield os.path.join(path,filename)
for files in locate('*.xml',r'C:\Users\Lars\Documents\XML-Files'):
all_xml_list.append(files)
#### Create lists by GameDay Events ####
xml_GameDay_Player = [x for x in all_xml_list if 'Player' in x]
xml_GameDay_Team = [x for x in all_xml_list if 'Team' in x]
xml_GameDay_Match = [x for x in all_xml_list if 'Match' in x]
The XML file looks like this:
<sports-content xmlns:imp="url">
<sports-metadata date-time="20160912T000000+0200" doc-id="sports_event_" publisher="somepublisher" language="en_EN" document-class="player-statistics">
<sports-title>player-statistics-165483</sports-title>
</sports-metadata>
<sports-event>
<event-metadata id="E_165483" event-key="165483" event-status="post-event" start-date-time="20160827T183000+0200" start-weekday="saturday" heat-number="1" site-attendance="52183" />
<team>
<team-metadata id="O_17" team-key="17">
<name full="TeamName" nickname="NicknameoftheTeam" imp:dfl-3-letter-code="NOT" official-3-letter-code="" />
</team-metadata>
<player>
<player-metadata player-key="33201" uniform-number="1">
<name first="Max" last="Mustermann" full="Max Mustermann" nickname="Mäxchen" imp:extensive="Name" />
</player-metadata>
<player-stats stats-coverage="standard" date-coverage-type="event" minutes-played="90" score="0">
<rating rating-type="standard" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="5.6" imp:rating-value-mid-fielder="5.8" imp:rating-value-forward="5.0" />
<rating rating-type="grade" rating-value="2.2" />
<rating rating-type="index" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="3.7" imp:rating-value-mid-fielder="2.5" imp:rating-value-forward="1.2" />
<rating rating-type="bemeister" rating-value="16.04086" />
<player-stats-soccer imp:duels-won="1" imp:duels-won-ground="0" imp:duels-won-header="1" imp:duels-lost-ground="0" imp:duels-lost-header="0" imp:duels-lost="0" imp:duels-won-percentage="100" imp:passes-completed="28" imp:passes-failed="4" imp:passes-completions-percentage="87.5" imp:passes-failed-percentage="12.5" imp:passes="32" imp:passes-short-total="22" imp:balls-touched="50" imp:tracking-distance="5579.80" imp:tracking-average-speed="3.41" imp:tracking-max-speed="23.49" imp:tracking-sprints="0" imp:tracking-sprints-distance="0.00" imp:tracking-fast-runs="3" imp:tracking-fast-runs-distance="37.08" imp:tracking-offensive-runs="0" imp:tracking-offensive-runs-distance="0.00" dfl-distance="5579.80" dfl-average-speed="3.41" dfl-max-speed="23.49">
<stats-soccer-defensive saves="5" imp:catches-punches-crosses="3" imp:catches-punches-corners="0" goals-against-total="1" imp:penalty-saves="0" imp:clear-cut-chance="0" />
<stats-soccer-offensive shots-total="0" shots-on-goal-total="0" imp:shots-off-post="0" offsides="0" corner-kicks="0" imp:crosses="0" assists-total="0" imp:shot-assists="0" imp:freekicks="3" imp:miss-chance="0" imp:throw-in="0" imp:punt="2" shots-penalty-shot-scored="0" shots-penalty-shot-missed="0" dfl-assists-total="0" imp:shots-total-outside-box="0" imp:shots-total-inside-box="0" imp:shots-foot-inside-box="0" imp:shots-foot-outside-box="0" imp:shots-total-header="0" />
<stats-soccer-foul fouls-commited="0" fouls-suffered="0" imp:yellow-red-cards="0" imp:red-cards="0" imp:yellow-cards="0" penalty-caused="0" />
</player-stats-soccer>
</player-stats>
</player>
</team>
</sports-event>
</sports-content>
I want to extract everything which is within the "player meta tag" and "player-stats coverage" and "player stats soccer" tag.
Improving on #Gnudiff's answer, here is a more resilient approach:
import os
from glob import glob
from lxml import etree
xml_GameDay = {
'Player': [],
'Team': [],
'Match': [],
}
# sort all files into the right buckets
for filename in glob(r'C:\Users\Lars\Documents\XML-Files\*.xml'):
for key in xml_GameDay.keys():
if key in os.path.basename(filename):
xml_GameDay[key].append(filename)
break
def select_first(context, path):
result = context.xpath(path)
if len(result):
return result[0]
return None
# extract data from Player files
for filename in xml_GameDay['Player']:
tree = etree.parse(filename)
for player in tree.xpath('.//player'):
player_data = {
'key': select_first(player, './player-metadata/#player-key'),
'lastname': select_first(player, './player-metadata/name/#last'),
'firstname': select_first(player, './player-metadata/name/#first'),
'nickname': select_first(player, './player-metadata/name/#nickname'),
}
print(player_data)
# ...
XML files can come in a variety of byte encodings and are prefixed by the XML declaration, which declares the encoding of the rest of the file.
<?xml version="1.0" encoding="UTF-8"?>
UTF-8 is a common encoding for XML files (it also is the default), but in reality it can be anything. It's impossible to predict and it's very bad practice to hard-code your program to expect a certain encoding.
XML parsers are designed to deal with this peculiarity in a transparent way, so you don't really have to worry about it, unless you do it wrong.
This is a good example of doing it wrong:
# BAD CODE, DO NOT USE
def file_get_contents(filename):
with open(filename) as f:
return f.read()
tree = etree.XML(file_get_contents('some_filename.xml'))
What happens here is this:
Python opens filename as a text file f
f.read() returns a string
etree.XML() parses that string and creates a DOM object tree
Doesn't sound so wrong, does it? But if the XML is like this:
<?xml version="1.0" encoding="UTF-8"?>
<Player nickname="Mäxchen">...</Player>
then the DOM you will end up with will be:
Player
#nickname="Mäxchen"
You have just destroyed the data. And unless the XML contained an "extended" character like ä, you would not even have noticed that this approach is borked. This can easily slip into production unnoticed.
There is exactly one correct way of opening an XML file (and it's also simpler than the code above): Give the file name to the parser.
tree = etree.parse('some_filename.xml')
This way the parser can figure out the file encoding before it reads the data and you don't have to care about those details.
This won't be a complete solution for your particular case, because this is a bit of task to do and also I don't have keyboard, working from tablet.
In general, you can do it several ways, depending on whether you really need all data or extract specific subset, and whether you know all the possible structures in advance.
For example, one way:
from lxml import etree
Playerdata=[]
for F in xml_Gameday_Player:
tree=etree.XML(file_get_contents(F))
for player in tree.xpath('.//player'):
row=[]
row['player']=player.xpath('./player-metadata/name/#Last/text()')
for plrdata in player.xpath('.//player-stats'):
#do stuff with player data
Playerdata+=row
This is adapted from my existing script, however, it is more tailored to extracting only a specific subset of xml. If you need all data, it would probably be better to use some xml tree walker.
file_get_contents is a small helper function :
def file_get_contents(filename):
with open(filename) as f:
return f.read()
Xpath is a powerful language for finding nodes within xml.
Note that depending on Xpath you use, the result may be either an xml node as in "for player in..." statement, or a string, as in "row['player']=" statement.
you an use xml element tree library. first install it by pip install lxml. then follow the below code structure:
import xml.etree.ElementTree as ET
import os
my_dir = "your_directory"
for fn in os.listdir(my_dir):
tree = ET.parse(os.path.join(my_dir,fn))
root = tree.getroot()
btf = root.find('tag_name')
btf.text = new_value #modify the value of the tag to new_value, whatever you want to put
tree.write(os.path.join(my_dir,fn))
if you still need detail explaination, go through this link
https://www.datacamp.com/community/tutorials/python-xml-elementtree

Resources